So in this section, we'll account for the uncertainty in our estimates, our estimated slopes and intercepts, by creating confidence intervals and getting p-values for hypothesis tests. And the approach we'll take will look incredibly similar to everything you did regarding confidence intervals and p-values in term one. So what we'll review is that creating confidence intervals for linear regression slopes essentially means creating confidence intervals for mean differences. Because as we've seen, each of the slopes has a mean difference interpretation. And the approach is business as usual, just like we did for mean differences in term one. We take the estimated slope, and add and subtract to estimated standard errors. Similarly, we can do the same thing to create a confidence interval for a regression intercept. And that's akin to creating the confidence interval for a single population mean, the mean of the outcome y for all population values with a predictor value of zero, an x-value of zero. So in the last section, we showed the results from several simple linear regression models. So for example, when relating arm circumference to height, using a random sample of 150 Nepalese children who were less than 12 months old, the resulting regression equation was y hat, the estimated mean arm circumference given a value of height, x1. It was equal to 2.7, the intercept, plus the slope 0.16 times x1, the height measurement. And this was estimated, I used a computer to estimate this. And it was estimated from individual arm circumference and height data, using a computer package. But it shouldn't matter what package I use. Whether I use R, Stata, Excel, SPSS, etc, I should get the same answer regardless of what computer package I use. So there must be some underlying algorithm that would consistently yield the same answer for these data, and for other data sets across different computer packages. So the algorithm to estimate the equation of a line for a given set of data is called least squares estimation. The idea is to find the line that gets closest to all of the points in the sample. So how would you define closeness to multiple points? So for any single point, we can measure the distance between the observed y-value, that child's arm circumference value, and his or her predicted mean amongst children of the same height, the regression line estimate. Well, what we do is, we're going to aggregate those distances between individual y measurement and its corresponding mean given its x-value, the regression estimate. We're going to square those, and we're going to add them up across all the observations in our sample. And what we want to do is choose the regression equation that minimizes that sum of squared distances. In other words, minimize the squared distance between each observed y-value and its corresponding estimated mean y-value, for all points with the same value of x1 for that particular observation. So let's look at this graphically and in equation form. Each distance I'm referring to for each of the i observations in our sample is the individual observations observed, y-value minus the mean predicted among observations with the same x-value as the one we're looking at. So for in our sample, 150 Nepali children, there will be 150 distances, the distance between each child's, Actual arm circumference, and the mean estimated by the regression line for children of the same height. So another way to write that is, each observed y-value, the observed arm circumference, minus that mean predicted by the regression equation. And this is computed for each data point in the sample, all i sample values. And so this is what the discrepancies look like for a given regression line. So here's our arm circumference, the Nepalese children example here. I put the actual regression line on it and I've showed graphically what each of these discrepancies, or sometimes called residuals, what they look like graphically. It's the distance between each persons observed arm circumference and the regression predicted mean for persons of the same height. So this person, this child who is roughly 41cm tall had an observed arm circumference on the order of 7-point-something centimeters. But the regression estimate of the mean arm circumference for children of this height is on the order of 9 centimeters. So that child's residual is the observed value 7.5 centimeters, so minus the estimated mean for children with the same height of 9 centimeters, so it comes in below the estimated mean. And similarly, these first couple residuals are also negative, because the mean estimate for children of the given height is larger than the observed values of arm circumference for children of those heights. But then we can see for the majority of these observations, there's positive and negative variation around the predicted mean amongst the individual values. And the idea is to minimize the cumulative distance of those 150 points from the optimal regression line choice. So the algorithm chooses the estimated values for the slope intercept, beta 1 hat and beta 0 hat, that minimize the total sum of the squared residuals for each of the observations in the sample. So what it will actually do is the algorithm solves for beta 0 and beta 1, the values that minimize this function that I'm showing here, the sum across the n observations of the n residuals, each squared. So this is the residual for the first estimate minus its corresponding predicted mean, plus the residual for the second observation minus its corresponding regression predicted mean, up through and including, in our example with Nepali children, all 150 discrepancies. And the algorithm chooses the values of beta 0 and beta 1 that minimize this cumulative discrepancy or distance of the points from the line. And this method will also give us standard error estimates for these estimated slope and intercepts. So the way this does this, it doesn't have to go through and look at a bunch of different lines and evaluate the total squared distance, and then choose among them the one with the lowest value. The values of beta 0 and beta 1 that minimize this can be obtained by doing calculus. But luckily, we don't have to worry about that, because the computer does it for us. Once we have the standard errors, these allow for the computation of 95% confidence intervals and p-values for the slope and intercept. So the random sampling behavior of regression slopes and intercepts is normal, because they are sample mean differences and sample means, respectively. And then it's normal in large samples, and the sampling behavior is t-distribution in smaller samples. Hence, it's straight up business as usual for getting 95% confidence intervals and doing hypothesis tests. So you'll recall the idea of central limit theorem based ideas. Similar to that, from a confidence interval perspective, we know that if we were to replicate a study over and over again of the same size, and repeatedly estimate the regression equation, estimate slopes and intercepts, I'll just show it for slopes here, but the same principle applies for intercepts. If we plotted the distribution of the estimated slopes for multiple random samples of the same size from the same population, it would be normal-esque, and it would be centered At the true value of the slope, the population value. We're going to get one of the values under this curve, we don't know where it will fall under this curve but most of the values we can get will fall within plus or minus two standard errors. So most of the time, if we start with our estimate and add and subtract two standard errors, the interval we get will include the unknown unobservable true slope. So that process is straightforward. We take our estimated slope or intercept and add or subtract two estimated standard errors. To get a p value. Null hypothesis, if there's no association between Y and X, that knowing the value of X does not inform us at all about the mean value Y, the slope would be zero. In other words, there's no difference in the mean of Y between any two groups who differ in x-values. And what we're going to do with the hypothesis testing is assume that this is true, that our data come from a population where the true regression slope is zero, and then we're going to measure how far our estimate is from zero. I'm just, for example, maybe we have an estimate here. We're going to do that in terms of standard errors and then convert that to a p-value by looking at the proportion of results that are as far or farther away from zero in either direction as our result. And remember, if our result is generally far away from zero, we would reject the null hypothesis in favor of the alternative that the true slope is not zero. But if it's relatively close, you fail to reject the null, and this would sync up with whatever decision we made about zero in terms of the confidence interval we got for the quantity of interest, as well. If the confidence interval includes zero, we will get a p-value that's greater than our cutoff of 0.05 and we will fail to reject. If the confidence interval does not include zero, we know that we would reject at the 5% level and our p-value would be less than 0.05. But it's not unless we do the hypothesis test that we can get the exact p-value. So, let's go back to our arm circumference and height example. When, again, the estimated best fitting line we got relating arm circumference to height was this one with intercept of 2.7 and slope of 0.16. So, again, from the computer I get the following output that's estimated the slope for me and the standard error of .014. And similarly the estimated slope of 2.7 has an associated standard error of 0.88. So let's look at creating Confidence intervals for these quantities of interest. So 95% confidence interval for the population slope beta1. The true association between arm circumference and height at the population level. The true magnitude of the linear association. We would take our estimated slope of 0.16 and add and subtract two standard errors. And this gives the confidence interval of roughly 0.13 centimeters to 0.19. So, again, this slope is a mean difference, the null value is zero, we can see this result does not include zero. If you wanted to get a p-value to see how likely it is to get a slope estimate as far or farther from zero as we did, if these data come from a population where there's no association between arm circumference and height. In other words, where the true slope is zero, it doesn't matter what your height is, doesn't inform us about the mean arm circumference. What we did do is assume the null is the truth, assume that beta1 at the population equals 0. And then calculate the distance to the slope estimate from zero in units of standard error. So we take our estimated slope of 0.16 divided by the units in the standard error of 0.014. And we have a result that's 11.4 standard errors above what we'd expect it to be under the null hypothesis. So you and I both know that that's very far under a normal curve. But what we're figuring out and what the computer gives us is the proportion of results under a normal curve that are as far or farther than 11.4 standard errors away from zero. So that proportion's very small. We already know it's going to be less than 0.05 as our confidence interval did not include zero. In this case, it's very small, well less than 0.0001. And again, this is just the concept behind getting the p-value, but the computer will give that to you as part of its regression output. We can also do a 95% confidence interval for the intercept. I'm just doing this for illustration as the result here is not scientifically tenable per say, because remember, this is a equation where Equation where x is in centimeters of height, And 2.7, the intercept, is the estimated mean arm circumference for children whose height is zero. Does not describe anyone in our sample, and in fact doesn't describe anyone anywhere. So it's not a scientifically relative quantity, but it's necessary for establishing the entire line. So we could do a confidence interval for this intercept. It would be business as usual, but it's the estimated confidence interval for the mean arm circumference amongst children with no height. So it doesn't really mean anything scientifically. I will say this, though, that the standard error for the intercept is useful, for example, if we wanted to get confidence intervals on a particular mean estimate. So you may remember we used this equation to estimate the mean arm circumference for children who were 60 centimeters tall. And we plugged 60 into the equation and got an estimated y hat for that group. And the standard error of that estimated mean is a combination of the standard error of the intercept and of the slope. And we can't use it to do by hand but the computer could give us confidence limits on any predicted point on the regression line using these two standard errors. So if we wanted to summarize what we found with these data, write it up in bullet point format. We could say this research used simple linear regression to estimate the magnitude of the association between arm circumference and height in Nepali children less than 12 months old. Using data on a random sample of 150. A statistically significant positive association was found. It's the portal low p-value here. And the results, if you want to put into words our numeric result. The results estimate that two groups of such children who differ by 1 cm in height will differ on average by 0.16 cm in arm circumference. With a 95% confidence interval of 0.13 centimeters to 0.19 centimeters. So that's just one way we could describe these results in words. Suppose we wanted to report to folks that we wanted to estimate the mean differences arm circumference for children 70 centimeters tall compared to children 60 centimeters tall and present a 95% confidence interval. So we actually wanted to compare two groups of children who differed by more than one x unit. So the difference will be a multiple of slope but will not be the slope itself alone. So we know from a previous section, we know this estimated mean differences is 70- 60 or the difference In heights 10 centimeters times that slope which has to meets the mean difference per 1 centimeter difference in height. So it's 10 times the slope, 10 times 0.16 or 1.6 centimeters. So how could we get a confidence interval for this mean difference because it's not just the slope itself. It's a multiple slope 10 times the slope. Well, the 95% confidence interval for the slope itself is 0.13 to 0.19. To get the 95% confidence interval for what we've got here, this is 10 times beta 1. We want the 95% confidence interval for the value 10 times the true slope beta 1. We can also multiply the endpoints of the confidence level for the true slope by 10. So 10 times 0.13 on the lower end and 10 times 0.19 in the upper end which yields a 95% confidence interval for the true mean difference. And circumference for two groups of children who differ by 10 centimeters in height of 1.3 centimeters to 1.9. So whatever you do to the estimate to get a mean difference between two groups who differ by something other than one unit in x, whatever you multiply that by, or scale that by, you would do the same thing on the confidence interval limits for this slope. Let's go back to our systolic blood pressure and age example. We call the results using the 7,172 observations from the NHANEs 2013-14 data relating systolic blood pressure to age, age in years. So for these data, the estimated regression line is y hat, the estimated mean systolic blood pressure for any group given its age is equal to the intercepted 99.52 millimeters of mercury plus 0.48 times that slope x1 of age and years. So here are the slope, the estimated slope is 0.48 and the estimated standard error for the slope is 0.008. And similarly, the estimated intercept is 99.52, and the estimated standard error is 0.35. So here, we can go through the same approach to getting confidence interval for the truce underlying populations slope and the p-value, and then a confidence interval for the underlying true population intercept. Again, in the situation in intercept is not necessarily a meaningful quantity on its own. So to get the confidence interval for the true underlying relationship between systolic blood pressure and age from which this NHANE sample came. We take the observed or estimated slope of 0.48, and add and subtract two estimated standard errors. And this would give us an interval. Our estimated slope is 0.48, 0.48 millimeter mercury average increase for each one year. Increase an H would be look at the confidence limits that goes from 0.464 to 0.496. So again, this is quantifying a mean difference and this result does not include zero, so we have a statistically significant relationship between systolic blood pressure and age. If we want to get a p-value, we, again, be testing the null hypothesis of no association between blood pressure and age. In other words, knowing someone's age does not tell you anything about their blood pressure, putting into terms of our quantities, that's the null with the slope is 0. Conversely, the alternative that the slope is not 0. So, well again, what we're going to do is assume the null hypothesis is true, and calculate the distance of the slope estimate beta 1 hat in unit standard error from 0. So if we do this, we take our estimated slope of 0.48 and divide by the estimated standard error of 0.008. We already know this is going to be more than two standard errors away from 0 because the 95% confidence interval did not include 0. But we have a result that’s 60 standard errors above what we’d expect under the null. So obviously, this is going to be a very, very small p-value. So how can we summarize these findings? because in this research, your simple linear regression to estimate the magnitude of the association between systolic blood pressure and age using NHANEs 2013 to 2014 data. Subjects range from 8 to 80 years old. A statistically significant positive association was found between systolic blood pressure and age. And the results estimated each additional year of age is associated with a 0.48 mercury increase in average systolic blood pressure, and then we could give them 95% confidence interval. A hospital length of stay and age of visit. Here we have a regression where age was dichotomized. It was not actually taken as continuous. And we had two groups, those who are greater than 40 years old, time of their visits. And those with their less than or equal to 40 years old. And the resulting of equations for these data was that the estimated mean like the stay in days was equal to intercept to 2.74 plus the slope of 2.13 time x1, a predictor which is in this case binary. 1 for persons greater than 40 years old, 0 for persons less than or equal to 40 years old. So here are the slope estimates, 2.13 days and the standard error of this given by the computer is 0.09 days. And the intercept is 2.74 days. And the standard error given by this computer as well for the intercept is 0.08. So let's go ahead and do our business as usual. To get the 95% confidence interval for the true population slope, which will be the true mean difference in stay between those over than 40 compared to those who are less than or equal to 40 years old. We take our estimate of that mean difference with just the slope itself. The estimated slope of 2.13, add and subtract two standard errors. Gives a confidence interval of 1.95 days to 2.31 days. So again, all slopes estimate mean differences. The null value for a mean difference is 0. And this confidence interval does not include 0, so we know the result is statistically significant at the 5% level, and can say the p-value is less than 0.05. What we don't know is what it is exactly. We could figure that out if we wanted to or let the computer do it for us, but the process would be to assume that at the population level, there was no association between like the stay in age that the slope for age was zero versus the alternative that the slope was not zero. We assume the start that that null is true and calculate the distance of their slope estimate, beta 1 hate from zero in units of standard error. We do this, we have a result that is, Over 23, it's about 23.7, sorry. 23.7 standard errors above what we'd expected to be on the null. That's quite far and the resulting p-value is very small. We can't even give an actual value, just contains less than some threshold, less than 0.0001, for example. I want to point this out here though that the intercept here it is useful quantity is the estimated mean like to stay among subjects in the younger age group, the reference group less in the four years old. So actually, created a confidence interval for that true underlying mean like the stay for that sub-group would yield the useful confidence interval. So in this case, the estimated mean length of stay for that group is 2.74 days, the confidence interval for that true mean goes from 2.58 days to 2.90 days. So this is a useful confidence interval in the case and that's always the situation when the group with an x value of 0 is relevant to our data. So in summary, the construction of confidence intervals for linear regression slopes and intercepts is business as usual. Take the estimate and add and subtract two estimated standard errors for larger samples. In smaller samples, the 95% confidence interval and p-values are based on the t distribution with n- 2 degrees of freedom. This detail will be handled by a computer, and the interpretations of the confidence intervals and p-values are the same, regardless of the sample size. Why n- 2 degrees of freedom? Well, in the regression, we have to estimate two things with n amounts of data. We have to estimate intercept and slope. So for similar reasons that we talked about in the first term. The degrees of freedom is the total sample size less the number of quantities that need to be estimated. Confidence intervals for slopes are confidence intervals for mean differences. Always regardless of the type of predictor we have. And confidence intervals for intercepts are confidence intervals for the mean of y for a specific group. The specific group when x1 or predictor zero. Not always a relevant quality when x1 is continuous but as I noted the standard error of the intercept will be use for constructing standard errors for specific mean estimates based on specific x values. Those mean estimates are a combination of the intercept and some multiple of the slope, and the uncertainty in that mean estimate when you sum those together is a function of the uncertainty in both slope and the intercept.