Let's look at some additional examples related to linear regression. So, first thing I want to revisit as the relationship between arm circumference and sex of the child. We had originally looked at this with a linear regression model. Here, we have a binary predictor which is coded as a one for female children and a zero for males, and hence, the slope estimate of negative 0.13 estimated the mean difference in arm circumference between females and males in the direction of female to male. So, females had lower arm circumference on average than males by 0.13 centimeters. And here, the intercept estimate is the estimated outcome, the mean arm circumference when x1 is equal to zero, when we're looking at the reference group of males. So, the question I have for you now is suppose we had coded the variable as a one for male children and a zero for female children and we can fit the following regression line. I'm just putting a little stars on the intercept slope here to estimate that they may be different than the values we looked at before when the coding was a one for females and zero. So, if you think about it, you might want to pause this and see if you can figure it out. If not you can just follow along. Can you figure out based on the previous regression results given what the values of the new intercept and slope would be based on the same data with this new coding of one for males and zero for females? And what would r square and r be with this new coding? So, let's just look at this. What did we have before? Let's just go through it in detail when we initially had the following. When x1 was originally coded as a one for females and a zero for males, we had this y hat the estimated mean arm circumference equals 12.5 plus negative 0.13 times x1. This is the mean difference. In arm circumference for females compared to males, we said the intercept in that coding schema here was the mean for males. If you wanted the mean for females, we would take 12.5 and add negative 0.13 which would give us a mean of 12.37. Those are the only numbers we can generate from this equation. There's only two groups for which we're estimating means for, and hence one mean difference. Now, let's suppose somebody else have taken these data, encoded sex as a one for males and a zero for females. So, we have this new equation based on this coding schema and new intercept estimate potentially and the new slope estimate. So, let's see if we can figure this out relatively quickly. Well, under this new coding schema, which is one for males and zero for females, the slope is now going to estimate the difference between sexes in the direction of the mean difference in arm circumference for males minus that for females. Now, we already know from the previous coding that the difference in the opposite direction was negative 0.13. So, if we change the direction, if females have arm circumference of 0.13 centimeters less than males on average, males have arm circumference of 0.13 war on average. So, that demystifies this slope immediately,. It's just the opposite value. It's 0.13 instead of negative 0.13. What does the intercept represent now in this equation? Well, it's the estimated mean arm circumference for the group whose sex value is coded as zero and under this new scheme female or the reference group, so this would be the mean arm circumference for females, which we already know from the previous iteration is 12.37 centimeters. So, under this new schema, the value of the intercept is 12.37 and the slope is 0.13. Notice that we get still the same overall results. We get a mean of 12.37 females. We get the mean for males, we take 12.37 and add the slope of 0.13 to give us the 12.5 we saw before, and the difference is just being computed in the opposite direction. What would have happened to the r and r squared? Remember R-squared is equal to point 0.02. Well, we've been nothing but change if you will. You can think of this as changing the units of x from the direction of one's for females and zero for males the opposite direction. It's going to change nothing about that relationship so the R-squared stays the same. But now via this coding, you can think of this as estimating the relationship between arm circumference and sex and because males are coded as one, we could say the relationship between arm circumference and male sex, and that's positive whereas when we looked at the opposite direction with female sex it was negative. So, the corresponding R for this analysis when x1 is coded as one for males is going to be the positive square root of 0.002 or 0.04, and it's only because we're coding sex in the opposite direction here. So, you can think of arm circumferences as being positively associated with the male sex, meaning males have higher arm circumference on average or negatively associated with female sex because females have lower arm circumference on average. We can extend this way of thinking that we had multiple categories if we change the coding, change the reference category, and the coding of the indicators, we'll still get the same overall results, but the comparisons being made by the slopes and the reference mean we have for the intercept would depend on that coding. So, let's look at systolic blood pressure and ethnicity. And here's the challenge exercise that I won't cover here, but you could think about and we can talk about in office hours or online, what would happen if you change the reference group for this comparison? What would the estimate of the resulting regression slopes B for the other categories with a different reference group? But right now we're going to look at this one. We're relating systolic blood pressure to ethnicity and we had five ethnicities. The reference group for all comparisons was Mexican Americans and we get indicators for those identifiers Hispanic. Those identifiers non-Hispanic white. Those identifiers non-Hispanic black and those who don't identify as any of the other four categories. So, now I want to talk about the mean difference in systolic blood pressures between non-Hispanic blacks and non-Hispanic whites, and we'll also comment on the resulting 95 percent confidence interval. So, neither one of these groups is the reference. So, we're going to have to do some combination of our slopes here. If you recall that x3 is the indicator of non-Hispanic black and 4.4 the slope for x3 beta three hat estimates the mean difference in systolic blood pressure between non-Hispanic blacks and the reference group of Mexican Americans. Similarly, the slope for x2 which is the indicator of being non-Hispanic white beta two hat equals the mean difference in systolic blood pressures for non-Hispanic whites and the same reference group of Hispanic. Of Mexican Americans. We take this difference in slopes B3 hat minus B2 hat, the reference cancels out and we get the estimated mean difference that we're looking for between Non-Hispanic Blacks and Non-Hispanic Whites. That estimated mean difference is 4.4 minus 3.4 or one, a difference on average to one millimeter per mercury. So, that's a 95, that's the estimated mean difference. To get a confidence interval for this, is a little bit complicated. The standard error, if you think about it this is a difference in two slopes. So, we're going to need to get the standard error of the difference in these two slopes. These two slopes are potentially correlated in that they either track with each other as one goes up, the other goes up from sample to sample if we were to repeat the study over and over again or they are negatively correlated meaning as one gets larger, the other gets smaller from study to study. So, we may need the back out double counted variation or add in something that's missed because of the negative correlation. But in either case, it's not something we can do by simply combining easily the standard errors for these two things that the computer has to do that part, account for the correlation to come up with a resulting standard error. So, I just want you to know that we can get confidence intervals for such differences, but even with this individual standard errors for each of the slopes, we couldn't do that by hand. So, you would need to appeal to the computer or have someone do it for you who was running the data. If you want to do that we get a confidence interval for this estimated difference of one millimeter mercury that goes from negative 0.14 up to positive 2.14. So, it's not statistically significant to include zero, the majority of possibilities for the true mean difference are positive indicating higher average blood pressure in non-Hispanic blacks compared to non-Hispanic whites but again, these differences scientifically are relatively small, even with or without statistical significance. Just to show you that we can also get confidence intervals for single estimated means again, I'm not going to ask you to do this by hand because it requires a computer even given the standard errors for everything, but if I asked you, what is the estimated mean systolic blood pressure for non-Hispanic whites? What you would do, is you say, "Well John, X2 is an indicator. It's the one for non-Hispanic whites so what I would do to get the mean for this group is take the intercept, the mean for the reference group and add the difference between non-Hispanic whites in the reference group of 3.4 and if you do this it turns out to be 119.1 millimeters of mercury." This is a mean estimate and it's estimated by taking the intercept plus the slope for the indicator of non-Hispanic white, so the standard error of this mean estimate is the standard error of the sum of two estimates, the intercept and slope. Again, it's not a straightforward combination because we need to take into account the potential co-variation of these two from study to study, so again this isn't something you could do by hand even if I gave you the respective standard errors for these two things. But I just want you to know that with a computer, you can create confidence intervals for single group means that are a combination of the intercept and some multiple of the slope, so in this case just FYI, the estimated mean is 119.1 millimeters mercury for non-Hispanic whites and the estimated confidence interval for that mean, the true mean among all non-Hispanic whites in the population is 118.4 millimeters of mercury to 119.8 millimeters of mercury. Let's look at another systolic blood pressure example but this one from NHANES again, but relating systolic blood pressure to age and years. These were the data we have, here is the resulting equation, we saw a positive association between age and mean blood pressure, a slope for age and years of 0.48, the standard error estimate for the slope is given at 0.008, the estimated intercept is 99.52, it's technically the estimated mean for newborns who were not included in this study, so it doesn't mean much with regards to our data but we still have a standard error for it as well of 0.035. So, the first question I have is, what is the estimated mean difference in systolic blood pressure for 65-year-olds compared to 60-year-olds? Remember, any differences in the outcome for any given differences next are all described through the slope, so the slope here, that is the estimated mean difference in blood pressure, for a one year difference in age is 0.48 millimeters of mercury. So all we need to do to get the difference for this in mean systolic blood pressure for this five year difference in age is take five times the slope estimate, five times that 0.48 or turns out to be 2.4 millimeters of mercury and we can put a confidence limit on that as well as one that we already have the standard error for the slope, we can create a 95 percent confidence interval then multiply the endpoints by five. So, to do that you take the 95 percent confidence interval for the slope which is 0.48 minus two times the standard error of 0.008 and 0.48 plus two times the standard error of 0.008 that would be a confidence interval for the slope itself, and to get the confidence interval for five times the slope we just multiply these endpoints times five. So, now let's talk about the estimated mean systolic blood pressure and we'll comment on the 95 percent confidence interval for this mean for 65-year-olds based on this model, and we recall the model says that get the estimated mean blood pressure for a group given its age, we take an interceptor 99.52 plus a slope 0.48 times the age group value. So, if we're looking at 65-year-olds, this estimated mean for 65-year-olds would be given by taking 99.52 and adding 0.48 times 65, and if we do that we get an estimated mean of 130.82 millimeters of mercury. But in generic terms, in terms of the intercept and slope, this operation is taking the intercept plus 65 times our slope estimate. So, if we want to get the standard error of this mean and get a confidence interval estimate, it's going to be a function of both the standard error of the intercept and the standard error of the slope and then some seemed about their co-variation. So, here's a situation where our intercept isn't alone a particularly useful number and so we wouldn't necessarily put a confidence limits on it because it describes the mean for an age group that doesn't exist in your data. However, it's standard error is critical when we go to estimate the means for age groups that do occur in our population, because that's always done by starting with the intercept and adding in multiple slopes. So the standard error of this mean estimate will be a function of the standard error for this intercept and 65 times the standard error for it's slope and will include those two things and some measure of how they co-vary. I'll just cut the chase, if you do this with a computer, we already know the estimated mean, the confidence interval for this true mean is 130.3 rounded to one decimal 131.4 on the upper end. So anyway, I hope this gives you some other examples of handling computations with regression coefficients and the arbitrariness of coding when we have binary predictors and something to think about for the categorical case. Certainly, I don't expect you to be able to compute by hand, confidence intervals for things that involve combinations of slopes, like mean differences for two groups that are neither as a reference group, when we have categorical variables or for specific means given a specific X value but I do want you to know that they can be computed and it just has to be done by a computer to get the correct standard error. Can absolutely be computed so if you're working as part of a research team or doing the computing yourself, you could get those if interested.