So, in this set of lectures, we're going to build on the concept we talked about intermittently in the first term and define it more formally the idea of confounding, and talk about a remedy for dealing with the interference in crude estimates from confounding removing that by a process called adjustment. We'll lay out the ideas of adjustment here, and then we'll segue way into the next set of lectures, which will show how we can do the adjustment with multiple regression techniques. So, in this set of lectures, we will formally define confounding and give explicit examples of its impact, define adjustment and adjusted estimates conceptually and begin a discussion of the analytics of adjustment. So, to start, let's look at confounding, give it a formal definition and give some examples of it. So, we're going to formally define confounding, establish conditions that can result in the confounding of an outcome exposure relationship, and demonstrate the potential effects of confounding by examples. So, the first example we're going to look at is actually fictitious. I'm a big believer in using real data 99% of the time, but I think for this concept, it's helpful to have something that hit you over the head more viscerally than most real data examples could just to start and then we'll use real data from here on in. So, consider the results from this fictitious study. We took a sample of males and females. We wanted to investigate the association between smoking and a certain disease outcome in male and female adults, and 210 smokers and 240 non-smokers were recruited for the study from the population under study, and what we have looks like this. Two by two format, the association between our outcome disease, and our predictor of smoking is as follows. Out of the 210 smokers, 52 in the sample have disease, out of the 240 non-smokers, 64 in the sample have disease. So, if we took the relative risk, the estimated relative risk of disease for smokers to non-smokers based on these data, we take the ratio of those two P has, the 52 had 210 smokers, divided by 64 out of 240 have disease among non-smokers. The relative risk comes in slightly less than one. It looks like there's no impact of smoking or perhaps smoking's even a little bit protective, we don't have confidence limits here, and we won't put those on for a little while just to focus on the estimates for now. So, that's kind of counterintuitive to what we know about smoking and health outcomes, but nevertheless, these are what the data say as thus we've analyzed it so far. What turns out we have additional information about these subjects including their biological sex. So, I was able to analyze this data in a little more detail, and what I did was, looked at the relationship between smoking or exposure of interest and this third factor, sex. If you look at this two by two date of distribution of the 210 smokers, you can see the majority by far on male, 160 out of 210 are males and the non-smokers, the majority are female. So, in this sense, we have this variable sex seems to be related to our predictor of interest smoking in that males are much more likely than females to be smokers. Similarly, I want to look at the association of this other variable sex and our outcome of interests disease, and here's a two-by-two table that shows this association. Again, I had the raw data at hand. There is no way you could generate these tables from the previous table, so I've given you. So, of the 116 persons with disease 33 were male and the remainder were females. The majority were female and of the 334 without disease half for male and half for females. We also seem to have sex being related disease and informally compute a relative risk here, but females have a higher risk of disease than males. So, the original outcome of interest is disease and the original exposure of interest is smoking. In this example, sex, the biological sex of the person seems to be related to both the outcome of disease and the exposure of smoking. This relationship behind the scenes is possibly impacting that overall relationship we assess between disease and smoking when we got the relative risk of disease for smokers to nonsmokers of 0.93. So, you might ask how can we look at the relationship between disease and smoking removing any possible interference from sex. So, one approach would one to break out data separately by sex, and look at the original relationship of interest disease and smoking separately for males and females. Again, you wouldn't be able to produce these tables I'm showing you here from the previous tables I showed you, you would need the entire dataset to do this. But here is the two by two table of the disease smoking relationship among only males. Of the 160 of the 200 males who smoke, 29 of them have disease, of the remaining 40 who are nonsmokers, four have the disease. So, if we take the relative risk of disease for male smokers to male non-smokers 29 over 160 divided a four under 40 is approximately 1.8. We showed very little association when males and females were taken together in the analysis, and among males only, it appears that smoking is positively associated with disease, increases the relative risk 80 percent. We use the same thing for females only, look at the two-by-two table of smoking disease status in females only, what we see a similar thing out of the 50 females who smoked, 23 of them had disease, of the 200 females who didn't smoke, 60 had disease. If we take the relative risk of disease for female smokers to female non-smokers, 23 and 50 over 60 out of 200, it's about 1.5. So, as we saw with males, slightly different numerical result, but females who smoke have an increased risk of the disease outcome as well by 50 percent. So, when ignored sex and combined males and females together, we saw relationship or relative risk between smoking and disease was close to one, and a risk difference if you did it on that scale was close to zero. This overall association not taking into account any other factors is sometimes called the crude or unadjusted relationship between smoking and disease. But when we look at the sex specific results though, males only and females only, we shall something very different. In both males and females, we saw a positive association with disease with relative risk of 1.8 for males and 1.5 for females respectively. So, what happened here? How did we get such different results when we considered sex versus not. When we ignored sex, we saw very little association, and maybe even a protective association of smoking, but we looked at smoking disease data separately, by sex we saw that both sexes had largely increased risks on the relative scale and absolute scale. So, recall though that males in our sample were much more likely to be smokers than females, and females conversely we're much more likely to have disease. So, the crude relative risks we got comparing the risk of disease in smokers to non-smokers has an over-representation of persons with a lower risk of disease males. So, when we took that crude ratio, when we compared the risk of disease in smokers to non-smokers. The numerator was disproportionately male and I'm just going to put a bunch of M's and several F's to represent that the smokers were majority male, and in the denominator, in reverse, that is not drawn, so to speak, to proportion but just to illustrate the point. So, the numerator was over-represented by males who were less likely to have the disease, and the denominator was over represented by females who were more likely to have the disease. So when we compare smokers directly to non-smokers, and don't take into account these differing sex distributions, the risk in the numerator is pulled down because of the higher presence of males who were less likely to have the disease. So, what we're showing here in this example is something called Simpson's paradox. So the nature of an association can change even reverse direction or disappear when data from several groups are combined to form a single group. So, we started with all males and females, and we missed the relationship between disease and smoking. Another way to say this is association between an exposure X, like smoking and our situation, and an outcome Y, like disease, can be confounded by another lurking or hidden variable Z or multiple variables Z1, Z2, etc. So, a confounder Z or multiple confounders Z1 through Zp for example, distort the true relation between X and Y. This can happen if any of our confounders is related to both our exposure and outcome. So just to recap, our outcome of interest was disease. We were assessing its relationship with smoking, and we had this third variable sex, which was related to both. So, if we call this male sex, males versus females, it was negatively associated with disease. Males were less likely to have disease, but males were more likely to smoke. So, since sex was related to both of these it had the potential and ultimately did distort our understanding of the relationship where we combine males and females together. So, in the next section, we'll talk about how to actually come up with a single measure that removes the distortion from that sex imbalance in the distribution between smokers and non-smokers. So again another example of confounding in action, here's an observational study we looked at these data before to estimate the association between arm circumference and height Nepali's children. So 150 randomly selected children, 0-12 months old, they had their arm circumference weight and height measured. Certainly, an observational study, our exposure of interest is height, it's not possible to randomize subjects to height groups, and the data looks like this: There's the ranges of values for arm circumference height and weight in these data. So you'll recall from the simple linear regression section we had done this analysis for arm circumference and height was a positive association, here's our scatter plot of the arm circumference values versus the height values in the Nepali's children with the estimated regression line superimposed on the graph. We know that there was relatively high correlation there, and it was a positive association. But as you might suspect, weight, this third measure we could be thinking of is certainly associated with both arm circumference and height. So our outcome of arm circumference in our primary predictor fight. So here's a scatter plot of the relationship between arm circumference and weight with a regression line relating arm circumference to weight superimposed, and the R-squared for that association is 0.7, so it's a positive association. The correlation is relatively higher, and if we certainly look at height, our predictor of interest versus weight as well, but we see a strong positive association of R-squared to 0.85. So weight is certainly related to arm circumference and height. So, if we take into account weight, we may get a different understanding of the relationship between arm circumference and height. So, it turns out here's a scatter plot of arm circumference by height after adjusting for weight. What we have on this graph, one way to think of it, is this is among persons of the same weight, this shows the relationship between arm circumference and height. So, now there's no longer variability in these data in terms of height. In terms of weight values, we're considering only persons of the same weight. So, think about that for a moment. Now the relationship between arm circumference and height is negative in nature. Does that makes some sense? I want you to think about this. Of a restricter assessment to persons of the same weight would it make sense in those weight groupings to have a negative association between arm circumference and height? Just chew on that, and we'll come back and talk about that in the next section. Something to consider for those of you who are lab-based scientists, confounding is a big issue in laboratory studies and it's only become recently talked about and accounted for in the past 10-15 years. Something I might call batch effects in lab-based analyses. So, lab-based results can be influenced by the technician, the laboratory used, the time of day, the temperature in the lab. If the goal of the study is to ascertain differences in lab measurements between groups, for example, between diseased and non-diseased groups, and the group is associated with at least some of the above characteristics, the technician, the laboratory, etc., then there can be the valley and the most egregious examples, the most difficult to understand is something where a quantity or quantities were measured on a group of patients with disease in Lab 1, and a group of patients it was a case-control setup, no disease in Lab 2 at a different time, and differences were found in some gene expressions between diseased and non-diseased. But it was impossible because of the setup to disentangle whether the disease results in a different gene expression or is associated with a different gene expression, then those without disease or whether this was an artifact because of using two different labs. So, lab-based researchers to become more cognitive. These ideas not only thinking that they need to be adjusted for whenever possible, but also this gives rise to some changes in how they might do the study design, where they might for example, randomize samples from diseased and non-diseased to one of the two labs. So, there was variability in the outcome types in both the labs because of the randomization, and we'll get into talking about the role randomization in reducing or minimizing the potential for compounding, and we'll talk about that in this situation as well. So in summary, a non-randomized studies outcome exposure relationships of interest may be confounded by other variables, in such a situation the relationship between the outcome and exposure differs after taking into account the confounder or confounders of note. In order to confound an outcome exposure relationship, a variable must be related to both the outcome and exposure. So in the next section, we'll show how to extend what we did here when breaking things out into separate groups in estimating overall relationship separately for males and females for example, or for different weight groups. We'll show how to summarize that in a single number called the adjusted estimate of the association.