The major problem in data analytics is that it's become so easy with computers to fit any set of data with a complex model that, when you then try to apply that model to new data, data that the model did not train on, or was not designed with, the model simply does not work at all, or it performs extremely poorly. This phenomenon has its own name. It's called overfitting. And at the end of the course when we have our final project, you'll practice some techniques to avoid overfitting. But I wanted to show you an example of data that is completely random, and how common it is for random data to have patterns that appear to be associations, and appear to be something that can be modeled. So what we have here is ordered pairs, xy values, each of which are selected at random from a Gaussian distribution with mean, zero. And we select 25 ordered pairs, and then we calculate the best fit line. And using the points and the line we can, of course, calculate the correlation, r. What you'll see is that these correlations range from extreme values less than -0.5 all the way down to -0.63, and then up to greater than 0.5. And none of these correlations are due to anything other than chance. The actual correlation of the two underlying random variables x and y is zero. One way to understand overfitting is that it occurs when we are overconfident about our model's ability to extract signal from a mixture of signal and noise. Our model includes some of the noise in our training set data as if it were part of the signal, so we think the signal is stronger than it really is. Then, when we try to forecast new outcomes, using our model on new data, the new noise tells us nothing about the new outcomes, and our model performs much worse than we were expecting. To better understand this view of over fitting, we will first consider an ideal situation of a simple linear regression model with no overfitting. Assume an unchanging or stationary process that inputs real number values, we'll call them x sub i, and outputs real number values y sub i. We will assume the process is parametric, meaning that the x values and the errors are both drawn from independent Gaussian distributions, with mean, zero, and standard deviation, one. That means that the standard deviation of our y output, the combination of signal and noise, is also equal to one. So this is a simplified model, but it gives us the basic idea which is we have a particular x sub i. We multiply that x sub i by beta. So our y sub i value is going to be beta times x sub i. This is x sub i. This is beta times x sub i. Then we add a constant, called alpha. But, again, in our simplified model with means zero, alpha one will equal zero. And then, finally, there's an element of uncertainty in the values of our y. That uncertainty takes the form of a Gaussian distribution, with standard deviation known as sigma sub e, it's the standard deviation of the random error component. So if you think of our model as generating estimates of y equal to beta times x. We're going to use y sub i hat to symbolize that this is an estimate of y. The true value of y is different. It is a draw from a Gaussian distribution, with mean equals zero, and variance of our random error. How much is our random error? Well, using some arithmetic that occurs elsewhere in the course, we know that the standard deviation of our random error is equal to the square root of 1 1- R squared. And we know, because of our standardization, that R is equal to the slope, beta. And that is also equal to the standard deviation of our signal. So we're combining a signal, which is beta x sub i, signal. With noise. To get the true value of y sub i. So if you imagine that this line shows, I'm sorry. If you imagine that this line shows a point here, and we have some kind of Gaussian distribution of error, the true value of y could lie anywhere within this distribution. It's likely to lie near the signal value, but the error gives us an additional component. The difference between y sub i and y sub i hat is known as the residual. That's the residual. And if we take the standard deviation of our residuals, that is what we mean by standard deviation of our error. In most of this course, we make the simplifying assumption that the best fit line on a set of observed ordered pair accurately defines fixed parameters alpha, beta and the standard deviation of error. Real life can be more challenging. Even if we are lucky enough to model a process that really is actually stationary, we still don't know the true values for alpha, beta and the standard deviation of random error. What we derive from a sample of ordered pairs, drawn from some underlying process, are estimates. So we have an estimate for beta, we have an estimate for alpha, and we have an estimate for the error. The whole purpose of the animation that you've just watched is to convince you that without further testing, the parameters of a linear model generated from a small- to medium-sized set of ordered pairs, cannot be relied upon to predict future unknown outcomes. Now, mathematical methods exist to calculate the probability distribution of errors for each of our estimates, beta hat, alpha hat, sigma E hat. Our true error for y sub i is not sigma E hat, but some combination of all three of these errors. However, these calculations are beyond the scope of the present course. Texts that explain these adjustments in detail are cited in the bibliography document. Instead, we raise the issue in order to demonstrate that overfitting is not only modeling noise as if it were signal, but the effect of overfitting is to increase the residual when using our model on new data. So that not only our errors greater than we expect, they're actually greater than the true noise of the situation if we had a correct knowledge of beta. In other words, overfitting makes our error even worse. Let me show you what I mean. We assume a stationary process, where the true correlation are equals 0.50. And that means that our true standard deviation of random error would be the square root of 1- R squared, would be equal to the square root of 1- 0.25, would be equal to 0.866. However, let's imagine that we develop this model using some collection of Ordered pairs. And our best fit line was beta hat. Line was slope 0.85. We would believe that the correlation equaled 0.85. And, therefore, we would believe that the true error equaled the square root of 1-0.85 squared, which happens to come out to be 0.527. So taking the ratio of these two numbers, you see that the true error is 64% larger than what we expected from our overfit model. Why do I say that? Because 0.866, excuse me, divided by 0.527 = 1.64. Okay. However, there is an additional error, and let me show you what I mean. The additional error, let's say we have y sub i = 2. We're going to have the point 2,1, and we're going to have the point 2, 1.7. The additional error, is equal to 0.7. In fact, the additional error, Is going to be dependent on the value of x sub i. And it is going to be equal to x sub i times the difference between what do we believe our slope is and what the true slope, or true correlation, is. So how can we add in this additional overfitting error? Well, since we know that the x values are also drawn from a Gaussian, with mean, zero, and standard deviation, one, we can add in the impact of this additional error term over all possible x's, meaning if we don't know the value of x, there's an expected value to this error. This expected value to this error is going to look like this. Why? Because That is equal to the difference between our estimate and the true value. If we add this error to the previous error we had, and take the square root of it, we're going to have the square root of 0.866 squared + 0.35 squared, is going to be equal to 0.93. So, our new total error Is 77% rather than 64%. Worse. That's the penalty that we pay for overfitting. In general, if you develop a model with a high correlation on a modest amount of data, it is much more likely that the true correlation is lower than your estimate rather than higher. A much lower correlation on a second set of data is a sure sign of overfitting. The closer a new correlation estimate on a new data set is to your previous correlation estimate, the more likely it is that both estimates are relatively near to the true value of the signal.