Hi, in this module, I'd like to talk about selection bias, problems with a sample that make it unrepresentative of the population about which we would like to generalize. Selection bias refers to problems with a sample that make it unrepresentative. Now, here enclosed in the green line, we have a sample that's representative of the larger population. Green and white appear within the sample in roughly the same proportion as they do for the larger population. However, here's an example of an unrepresentative sample, again enclosed in green, where we have five white figures and no green figures. Even though the population as a whole includes a mixture of white and green. Somehow in drawing this hypothetical sample, there's a problem with selection bias. The most obvious source of selection bias is problems with sampling. Sometimes we have a sample that because of design problems, the way we design the sample, it's not representative of the population about which we would like to generalize. Now, this can happen whenever the sampling design is anything other than probability based. For example, convenient samples, respondent-driven samples, and quota samples may be especially prone to various forms of selection bias that produce an unrepresentative sample. It can also arise in probability-based sampling designs, random samples, if the sampling frame, that is, the basis on which we did the sampling, doesn't correspond to the population of interest. So, for example, if we're conducting a household survey and the list of households is out of date or incomplete. And therefore, the sample that we draw from that sampling frame is not representative of the actual population of households about which we would like to generalize. Selection bias can still be a problem in a probability sample that was designed properly and which used a sampling frame that corresponded well to our population of interest. If the chances of contacting a prospective respondent depends on their characteristics. So if it's systematically easier or more difficult to contact particular kinds of people to ask them to participate in a survey, then we can end up with a biased sample at the end of the day. So if certain kinds of people, perhaps people that are busy at work, etc., are less likely to be at home when we knock on their doors. Or if certain people are simply less likely to own a phone or answer the phone, if we're conducting a phone survey. Then at the end of the day, the sample we have may be unrepresentative. We have a further problem in that even after we reach respondents, their willingness to participate may actually depend on their personal characteristics. So some people, if you ask them to participate in a survey, may be more or less likely to actually agree to participate. And this may depend on characteristics that are important to us in our study. One common example is the wealthy. We know from many societies that the wealthy are especially unlikely to participate in surveys. They may be hard to reach because they may live in gated communities, or simply not answer the doors, or may not be at home. And then if we do reach them, they may be concerned about privacy much more than, say, people with less money. Now, there are a lot of examples of populations that may be for some reason or another harder or more easy to reach. Think about the possibility that, say, the elderly who are remaining at home are probably easier to contact than elderly who are still active and out fairly frequently. Another source of selection bias is differential study attrition. People may differ in their chances of dropping out of a study. And their chances of dropping out of a study may be related to characteristics that we're interested in in our research. Now, this may be especially important in situations where participating in a study is time-consuming or otherwise demanding. Where people may have initially been enthusiastic about participation. But when they realized how long the interview would take. Or how demanding it might be in terms of their time or their focus. Or that it might include the collection of biomarkers or other invasive procedures. They may lose enthusiasm and quit. Now, if their decision to withdraw from the study is related to the characteristics that we're interested in, then we have a problem with study attrition that may lead to selection bias. Now, problems with selection bias via study attrition are especially a concern if we're conducting longitudinal studies where we need to reinterview people at later points in time, perhaps a second or a third, a fourth time. And then we run into problems where it may be that certain kinds of people are harder to find again when we want to locate them for a second or a third wave. Or perhaps certain kinds of people are just less likely to participate a second or a third time. One concrete example of this sort of attrition, that may matter a lot if we're conducting a longitudinal study where we need to reinterview people, is that in highly mobile populations, people who migrate are probably going to be harder to locate during follow-up. That is, if people are moving around and perhaps looking for other jobs. Or in other words, they're keeping themselves busy and looking for opportunities elsewhere. They may be harder to find two or four or six years later during the second wave of a survey. Now, that's a problem because these people, the people that are willing to move around, may be very different from people who are sedentary. The people that we're able to find again two or four or six years later because they're in the same address. They may be either people that have very stable jobs, which is unusual in its own way. Or they may simply be less ambitious or less risk-taking. And therefore different from the people who we can't find. So that can be an issue related to study attrition when we're conducting a longitudinal study. And the people doing longitudinal studies therefore put a lot of effort into recontact, tracking down their subjects from the first wave, so that they can be reinterviewed later. Another more subtle form of selection bias is associated with problems with selection into the study population that make it different from the actual population about which we would like to generalize. Now, let's imagine that we're interested in measuring the effect of some X on some Y. Let's think of the example of the effect of college entrance examination scores on performance in college, college GPA. Now, if we're doing an analysis of college GPA and its relationship to the college entrance examination scores, we run into problems. Because the selection into college, that is, the people who were actually available to have a GPA that we can include in our study. These people were selected based on their X, that is, their performance on the college entrance examination score. So even though we would actually like to make some generalization about all students, including what the GPA would have been for students who took the college entrance examination but then based on their score were not admitted to college. In fact, we can only measure the relationship among the people that were admitted into college. So let's think about this relationship and why that's a problem. If we're trying to figure out whether entrance examination scores predict college performance and we want to use GPA in college as an outcome, of course, we only have college GPA for enrolled college students. We don't have a GPA for all of the people who, because of their low examination scores, were not admitted. So what we have is a situation where we can measure a relationship between the entrance scores and college performance for a subset of the larger population of all potential students based solely on those students who did actually go to college. But that doesn't include the people who were excluded from college because of their poor scores. Now, the problem we have is that it's entirely possible that those people who had the scores that were poor enough not to be admitted into college. If they had been admitted into college, then it's very possible that they might have not done well. That, at some level, we're only looking at people that at least scored well enough to get into college. So when we compute a relationship simply based on those people, the ones who passed the threshold. Then the apparent relationship may be weaker than the one that we would get if we could somehow examine also the college GPAs of the students who actually didn't go to college. That is, the GPAs they would have had had they gone to college. We could really only do the, you might say, perfect study of the predictive validity of entrance examination scores for college performance. If after allowing students to take the entrance examination, we in fact allowed students into college at random. And then looked at their GPAs and its relationship to the entrance examination scores. Another form of selection bias that I'd like to talk to, it's closely related to the one we just talked about, is selection on an outcome. So we have a common problem with selection bias when the outcome that we're most interested in can only be measured for a subset of the population. And membership in that subset is influenced by our X variables. So let's think about an example, the relationship between education and wages or salary. Now, to have a wage or a salary, you actually have to be employed. The unemployed, they may have income, but they won't have a wage or a salary. Now, the problem we run into is that education not only influences wage and salary. But it actually influences the chances of being employed and therefore having a wage or salary to measure. So when we actually conduct a association of the wage and salary with education, we're only looking at a subset of the larger population. And we're missing the wage or the salary the currently unemployed would have if, in fact, they were employed. Now, this gets complicated because, for example, it may be that the least educated people who are employed are perhaps different in some special way. Maybe they were harder-working or more ambitious or had other skills that made them different from the similarly educated people who couldn't find a job. So they may have actually fairly decent wages or salary. Whereas the people with the same level of education who remain unemployed, if they found a job, might end up with a much lower wage or salary. So this is a complex problem that has emerged in a number of areas. There are proposed solutions, like corrections for selection, and so forth. You can learn about those in advanced econometrics class. But for the time being, what we're trying to do here is just alert you to the various problems that can arise with selection bias.