In this module, we're going to talk about omitted variables and the challenge they pose to making claims about cause and effect in observational studies. Major problem with omitted variables is that by failure to account for them it may be that the relationships we observe in observational data turn out to be spurious. So the most serious problem that we face in observational studies is really ruling out a role for omitted variables in accounting for an observed relationship. So we think about having a X variable and a the Y variable. And we would like to assess whether X has some cause and effect relationship with Y. The big problem we have is that it's always possible in an observational study that there might be other variables out there that are influencing both X and Y. And so if that is the case, then the relationship that we think we observe between X and Y may be spurious and the product of their common association with those omitted variables. Economists refer to this problem as endogeneity. In fact, if you sat in on talks by economists, especially applied economists, or read papers, you've probably heard them refer to endogeneity problems. So basically, if an X variable that's hypothesized to influence Y is influenced by some other variable that also influences Y directly, X is endogenous to Y and economists recognize this as a serious problem for making claims about a causal relationship between X and Y. And again, they recognize that the observed association between X and Y may not really reflect a causal relationship Let's think about examples of omitted variables that could cause problems and make it difficult to make a claim about cause and effect. So let's think about the example of education and income. Generally, we'd like to assume that education influences income. That's probably why we go on to advanced study. However, we know that there are all sorts of things that influence people's education and which directly influence the income that they have as adults. So for example, their family background, their personality may influence the amount of education they get and then, it may also influence the sorts of opportunities that they get when they grow up, in ways that have nothing to do with education. For example, perhaps very ambitious, very hardworking people, on the one hand they may be more likely to gain more education, but even in the absence of an education they may actually be able to find jobs that give them a higher income. So if these background effects, family, personality, so forth, have strong effects on both education and income, we might see a relationship between education and income, even though education doesn't actually cause income. Another area where we need to worry about spurious relationships created by the common influence of an omitted variable is the one if we're looking at the relationship between migration and income, if we're trying to figure out whether people who benefit by migrating. Do they actually increase their incomes? Or do they end up with lower incomes? The big problem we have is that may be personality characteristics that influence both the chances of migrating and people's subsequent income. So if people like to engage in hard work, if they're especially risk taking, they maybe more likely to migrate and they may no matter what may end up with higher income Income. These same people that even if they didn't migrate, perhaps they would do well anyway, because wherever they were they would take risks, they would work hard, and end up with a higher income. So this is a common issue that people have to think about when they're looking at the relationship between migration and income. Another example of a situation where we have to worry about spurious relationships and omitted variables is looking at the relationship of alcohol consumption and mortality. So there's some observational studies that show that people who don't drink at all have a higher risk of dying. In other words, that abstinence from alcohol actually increases the risk of dying. Now, where omitted variables come in here is that there are situations where people in poor health may be, on the one hand, less likely to consume alcohol and on the other hand, they may be more likely to die. So the apparent relationship between abstinence and higher mortality entirely may be driven by the fact that people in poor health are on the one hand more likely to abstain completely from alcohol, but they may also experience a higher risk of dying overall. Related problems is one of suppressor variables. Sometimes the influence of an omitted variable on X and Y means that the observed relationship between them is actually weaker than the causal relationship. So sometimes in looking at data from developing countries, we may see that there's not much association between the placement of health clinics and mortality rates that say average death rates in places where there are health clinics are the same as in places without health clinics. Now this is counterintuitive, but this actually reflects the problem of endogenous program placement. Sometimes there's features of the local environment that affect the placement of health clinics. For example, if we have planners that are placing health clinics in the places where they are most needed, that is the places where death rates are actually the highest, that health is the worst. And so we may see high concentrations of health clinics in the least healthy places and then when they have an effect, all they do perhaps is bring death rates down to be be the same as elsewhere. If we're just looking at cross sectional data or observational data, may appear that places with health clinics are no better off than places without health clinics, when it comes to mortality rates. But in fact, it's all being driven by local environment, which is driving the placement of the health clinics and the mortality rates. And the apparent lack of a relationship between health clinics and mortality rates is spurious. Now, the traditional approach to dealing with these problems with omitted variables is to try and measure and control for as many relevant variables as possible. So if you've seen people give talks running regressions and so forth or if you looked at old papers, you'll notice that people will often include long lists of variables, as right hand-side variables, to try to control for possible omitted variables. Now, this can be problematic for hard to measure variables like personality characteristics or certain variables that are inherently impossible to measure, like, say, intangible features of a household or a community or an individual that affect various outcomes. This is why economists, as we'll learn later, and other researchers sometimes look for natural experiments, situations where we have exogenous influences on X that should have nothing to do directly, in terms of affecting Y. And then, we compare Y according to whether or not they have being affected by the natural experiment and make an argument for cause and effect on that basis. So these are situations where X is changing as a result of some exogenous influence unrelated to Y. We're going to get into that more in the next lecture. The economists also try to isolate exogenous variation in X and measure its effects on Y with instrumental variables. Again, we'll get into that in the next lecture. So there are many solutions that have been proposed for dealing with omitted variables. Many of them are controversial, but they certainly require advanced training. So we'll touch upon them in the next lecture, but you'll have to take more advanced courses to really understand and learn how to use them.