Unfortunately, there exists a strong theoretical negative result in computer science that states that this is impossible. It's known as the No Free Lunch Theorem, and it was established by David Wolpert in 1996. The No Free Lunch Theorem says that no single classification algorithm can be universally better than any other algorithm on all domains. Even stronger and a bit surprising formulation of this theorem is that all classification algorithms have the same error rate when averaged over all possible data-generating distributions. A similar result applies to the case of regression that we presented earlier. This statement might sound very surprising, so let's consider a specific example. Say we want to make a classifier that we will call hot stock or not hot stock. The classifier will take a set of predictors or features, X1 to XN, that we use to make a prediction. For convenience, I added one more constant predictor, x0 that equals 1 here, but don't worry if you don't know yet what it is for. We'll talk about these technicalities later. Now, the output z of the classifier would be a binary number of 0 or 1. The value of 1 for a given stock means that it's expected to beat the market and 0 if it doesn't. We would use the output of such classifier to make our investment decisions, so that the classifier would be a kind of our investment, advisor. Now to get the binary output z of 0 of 1 from real valued inputs X1 to XN, we need to put these inputs through some sort of nonlinear transformation. We can schematically represent such a function as this blue circle. Inside of this circle, I plotted one example of a nonlinear transformation for the case of just one variable. This function would have some number of parameters. At least N + 1 parameters, because this is the number of our input variables including our constant input x0. One simple example of such a function would be the so-called logistic function shown here, where argument would be a linear combination of all features. Such a function would have N + 1 parameters, so it would be simple enough. This is not yet a binary output, as such function will produce a continuous output. But we will see in our follow-up videos how its inputs can be converted to a binary value of 0 or 1. For now, let's just continue with this example and assume that we fine-tune parameters of this model so that it's now trained on some large data set of stocks, say on 2,000 days of observations for 2,000 stocks. So that we have 4 million observations in total, each having, say, 50 features. Again, I skip the details on how this can be done. We will learn it in a short while, but for now I want to focus on a high-level picture. So assume that you built such classifier and fine-tuned its parameters by looking at the parses of data. Now you have a predictor that will tell you whether any particular stock is hot or not. You can now start trading using this predictor. For example, you can buy ten hot stocks and sell ten not hot stocks. Chances are that in reality, you will not be too thrilled with the performance of your strategy, and you will want to improve your classifier. So now let’s assume that we come up with this bright idea on how to do it. What if we just make a pipeline of such information as shown on this picture? The circle here represents some transformation on inputs of this cell. So our whole pipeline would produce a sort of waterfall of different nonlinear transformations. A pipeline would consist of layers that take their inputs, make nonlinear transforms on them, and pass them up in a hierarchy. Here I show just two such layers, but in principle, we could add more layers someplace in between the inputs and the outputs. Each transformation will have its own parameters. So the whole pipeline would have many parameters and would produce a sufficiently rich function. If we have lots of data or lots of predictors, maybe such model after some parameter tuning would perform better than the first, less sophisticated model. In fact, what I describe is a schematic working of neural networks, which we will talk about a lot in this facilitation. Now let's assume that we have built such a more advanced model and even found that it indeed works better for our stock data. So maybe because it works better for stock predictions, such more sophisticated model would always be better than the first, less sophisticated one for any data of the same shape. Which was in our example, 4 million rows and 50 columns. Indeed, if it has more parameters than the first model, shouldn't we always refer to the first model for any datametrics of dimension of 4 million by 50? And the answer given by the No Free Lunch Theorem is that a more sophisticated model not only would not be always be better than the simple one. But they will actually have exactly the same error rate if their performance is the ratio of all possible datasets of the same size. Now how is it possible and what does it mean? It simply means that the set of all possible data-generating mechanisms is too rich to be adequately represented by any given machine learning algorithm whose capacity for generalization is determined by a particular model architecture. For example, while a very large neural network can beat any other model for image classification, the No Free Lunch Theorem guarantees that a much simpler model would work better at least for some types of data. Now is such lack of universality bad news or good news? I personally believe that this is very good news because it's exactly what makes machine learning so exciting and open to everyone who wants to experiment with new types of datasets and new machine learning algorithms. Another more practical conclusion from all of the above is that we should not seek machine learning models that would be universally better than any other across all possible domains. Instead, the focus should be on models that work better for particular domains of interest. And this is exactly one of the objectives of this specialization, where we explore methods that work best specifically in a financial domain rather than those that are found to work better in other domains. For example, for image recognition. And on this note, let's talk next about a key machine learning concept, namely the idea of regularization.