So in the previous two sections, we estimated empirically the sampling distributions of sample means, proportions, and incidence rates via computer simulation where we were able to take multiple random samples from a theoretical computer-based population. That was just to illustrate some of the common properties of these distributions that we'll formalize in this section. But what we'll generally be doing in real life is we're not going to take multiple samples from a population, we're only going to take one from each population of interest. So, if we're going to be able to use characteristics of the sampling distribution to help us extrapolate from our sample to the population, we're going to have to be able to estimate characteristics of the sampling distribution from a single sample of data. So, let's talk about how to do this to start in this section. So, upon completion of this lecture section, you will be able to explain the Central Limit Theorem, sometimes called the CLT for short, with regards to the properties of theoretical sampling distributions. Estimate the variability of the sample distribution for sample means and for sample proportions using the results from a single random sample, and begin to appreciate how an estimated sampling distribution can allow us to incorporate the uncertainty in an estimate to the story of what's going on the unknown truth at the population level. So, starting with our estimate and adding in some uncertainty to potentially create a range of plausible values for the unknown truth. So, let's talk a little bit about ''real life research''. So again, in the previous sections, we showed the results of computer simulations to illustrate some general properties of sampling distributions. But in real life research, generally, only one sample can be taken from each population under study. So how can we use the results of the single sample we have to estimate that if you will, behind the scenes theoretical sampling distribution of a sample statistic, the one we're calculating from our single sample of data, and how can we use this if we can characterize this, how can we use this to help us? So let's talk about some generalities we saw from the simulations. So, regardless of what type of data we were summarizing, whether it be continuous, binary or timed to event, with the appropriate sample statistics, whether for means, for continuous data proportions, for binary data incidence rates, for timed event data, the resulting estimated sampling distribution from simulation was generally symmetric. In other words, the distribution of the sample estimates across multiple random samples of the same size when we looked at a histogram of them, it was generally symmetric and approximately normal regardless of the size of the sample each statistic was based on. Generally, these estimated sampling distributions were centered, or the average of the 5,000 sample-based estimates we had came out to be the true value of the population level quantity being estimated by those statistics. So for example, the average of the 5,000 sample mean estimates of the underlying population mean was in fact the underlying population mean. Across all these distributions we saw the variability in the sample statistics from sample to sample systematically decreased, the larger the sample each estimate was based upon. So, there's a mathematical theorem that generalizes these properties called the Central Limit Theorem, and many times this will be referred to by its initials, the CLT. Basically, the Central Limit Theorem states that the theoretical sampling distribution of a sample statistic, were we to take an infinite number of random samples of the same size and plot the distribution of the statistics across the samples, the distribution will be approximately normal. On average, the average of our estimates will be the true population level value being estimated, and the variability in these estimates will be a function of both the variation in the individual values in the population, the standard deviation of the individual population values, and the size of the sample the statistics based on. This variability in sample statistics across multiple samples of the same size is called the standard error of the statistic. So basically, the Central Limit Theorem or CLT says, ''Look, I can tell you what would happen if you were to take an infinite number of samples of the same size from a single population, and computing the infinite number of samples statistics on these samples, and then plot a histogram of your estimates.'' It says look, if you were to take this, if you were to do this and take multiple random samples of the same size from the same population, and look at the distribution of the sample estimates across these multiple samples, it would be if we did a histogram and draw a smooth curve over it, it would be approximately normal centered at the true value of whatever each statistic was estimating. So, if we had sample means, it be centered at the true mean. We had sample proportions, this the distribution of sample proportions would be centered at the true proportion. And the variability in these values will be a function of the variability of the individual values in the population, the size of the sample each estimate in the histogram is based upon. So for example, if we were looking at samples of size n taken from a population with mean mu and standard deviation sigma, and we plotted a histogram of multiple sample means based on random samples of the same size n. So, this histogram consist entirely of x bars each based on the sample of size n. So, an infinite number of x bars, the average of the x bars across the infinite estimates we have would be the true population mean. Then the variability in these x bars from sample to sample would be a function of the variability of the individual values in the population and the size of each sample. I'm gonna put this out here now, and we'll formally do this in the next set of lectures, but this variability in these x bars would actually be equal. The theoretical variable is equal to the standard deviation of the individual measurements in the population divided by the square root of the sample size. Again, we'll formally review this in lecture set seven, but this quantity, the variability in the sample means cross samples of the same size is called the standard error. So the standard error of sample mean based on samples of size n is given by the true variability of individual values in the population divided by the square root of the size of the sample each x bar is based on. So we kind of have a conundrum here. Right? We only have a sample. We estimate the sample mean, but in order to quantify the potential variation in sample means across sample sizes of the same size, we need to know the population standard deviation. I can't think of many situations where we wouldn't know the population mean, but we would have a measure of the standard deviation. Luckily, we have a quick fix to help us move forward with this. We don't know sigma. Well what we're going to see is when we estimate this based on the single sample, we can substitute sigma with our sample-based estimate s and get an estimated standard error for sample means based on samples of size we have. So for example, recall we had a single sample of 113 men where the mean blood pressure in these 113 men was 123.6 millimeters of mercury, and the standard deviation of the 113 values, the variation in these 113 values was 12.9 millimeters of mercury. So let's see what we can ascertain about the sampling distribution of sample means each based on 113 men from samples of this population of man. Well, the Central Limit Theorem tells us without even having any data it tells you if you were to plot multiple x bars each based on a random sample of 113 men, this population, plot multiple mean systolic blood pressures, the distribution of these means across an infinite number of random samples each based on 113 persons would be approximately normal. Prox normal. While these would vary in value, the average value of all of these sample means would be the true population mean blood pressure, and the variability in the sample means from sample to sample, the true standard error for means based on a n of 113 will be equal to the population variability in blood pressures divided by the square root of 113. Well, we don't know that, but what we're going to be able to do is estimate the standard error of means based on samples of one 113 by replacing the unknown sigma with our best sample estimate of s, the sample standard deviation. So, from this single sample, we can ascertain all this based on using both the information in the Central Limit Theorem about the normality of the theoretical sampling distribution, and the fact that it's centered at the true mean, and we can estimate the variability in sample means based on 113 using this result. So, we can fully characterize or estimate the sampling distribution of sample means based on 113 using the results from only one random sample of 113. Similarly. Remember we had a large sample of 12,920 like this days, from a one-year sample of patients in the Heritage Health Plan and the sample mean, was 4.3 days and the sample standard deviation was four 4.9 days. So what do we know about the theoretical sampling distribution of sample mean length of stays from all possible random samples of 12,928 from a population a heritage health patients? Well, the central limit theorem tells us right off the bat before you've collected any data, if you did a histogram of all of these sample means based on different random samples, you get a histogram. It would be approximately normal, the mean of all your sample means would be the true population mean length of stay and the variation in these sample means from sample to sample of 12,920 persons. The true variation, so no hat I'm sorry, where the hat would be equal to the individual variation in the population level length of stay values divided by the square root of 12,928. We only have one sample, it's not the entire population per say if we consider this to be a manifestation of a random process. So our estimated standard error based on the single sample we have would be the sample standard deviation 4.9 days divided by that square root of 12,928. We can do similar things and again we're going to formally lay out this again in lecture seven but I'm trying to give you a sense of what's to come here. So let's look at the theoretical sampling distribution on samples of size n taken from a population with proportion p, the theoretical sampling distribution of our sample p hats. Well the central limit theorem tells us without collecting any data, that the distribution of these proportions across the infinite random samples of the same size n should be approximately normal, and the average of my proportion estimates across all these samples should be the true population proportion. Furthermore, the variability in these estimates from sample to sample should be a function of the variability of the binary values in the population and the sample size. So the theoretical standard error is actually given by this sample proportions based on samples of size n is given by the square root of the true proportion times one minus that proportion over the sample size n. So you may remember I mentioned this briefly, but that numerator is essentially the standard deviation of binary values in the sample or population is based or is a function of the proportion of yeses and the proportion of no's in the sample. We said that number itself wasn't that illuminating about the sample but here, again we have a standard error just like we did with means that depends on the variability of individual values in the population and the size of the samples we're looking at. Again, we have a conundrum here. We do not know the population proportion and therefore we're estimating it with p hat from a sub imperfect sub sample of the population. Yet in order to quantify the uncertainty in our estimate p hat, we need to know the underlying population proportion. But of course if we knew the underlying population proportion, we wouldn't be estimating it with an imperfect estimate. So here's the deal, just like we saw with our other examples of sample means, we're going to estimate the standard error of the sample proportion by replacing that population-based quality in the formula p with our best sample estimate which would be p hat. So you can see here that the uncertainty in the sample statistics is a function of both the sample statistic estimate itself and the size of the sample. So for example let's look at our maternal infant HIV study for a moment, you may recall this was the study where 363 pregnant women who had been randomized to either get AZT during their pregnancy or to get a placebo and 180 of these women were given AZT. We had a single sample the burst of these 100 women, who were given AZT and 13 of the 180 births, the children were HIV positive diagnosed as HIV positive within six months of birth. So for a sample proportion of seven percent. How we've taken a different random sample of HIV positive women and randomize them, even if we ended up with one 180 given AZT, depending on the sample we get a mother's taking AZT we might get different sample estimates of this proportion. So if we were to replicate this study in infinite number of times where we've randomized 180 pregnant women to get AZT, and we took the proportion across each of the samples of the children who develop HIV and plotted this infinite number of proportions across the infinite number of samples each of size 180, the Central Limit Theorem tells us to do a histogram, the distribution of these proportions from sample to sample will be approximately normal, and it would average out to the true population level proportion had all HIV positive pregnant woman been given AZT. Furthermore, it says well, you can only have one sample here and one estimate, you can estimate though the uncertainty in sample proportions based on random samples of 180 HIV positive pregnant women, remember the theoretical standard error which we can't observe, was a function of this unknown truth p the proportion of children in the population who would develop HIV. But we can substitute p hat where we had p before and estimate the sample proportion or the sample of 180 with the information in this one sample. So we can completely estimate a characterization of the sampling distribution of a proportion across an infinite number of random samples of 180 women, based on a single sample of 180 women. That's the magic and amazing result that is the central limit theorem. So you may say well how's that going to help us? Whether it's a mean, a proportion, I didn't show you yet how to do incidence rates. If I have an estimate of someone known truth and some sense of how variable estimates like mine are around the truth, I still don't know the truth and that's absolutely correct. But let's think about what the Central Limit Theorem tells us. It says look, the distribution to estimates around their truth is approximately normal and the center of this distribution is this unknown truth. What do we know about the percentage of values falling under normal curve as a function of standard units and standard error measures the variability in sample proportions. Well, we know if this were normally distributed we went plus or minus two standard errors and standard error is just a form of standard deviation is just quantifying variability In summary statistics and not individual values. So the same properties apply, if we went plus or minus two standard errors of our statistic, 95 percent of the estimates we will get, just by chance would fall within plus or minus two standard errors of this unknown truth. You might say well great, that doesn't help me because I don't know the truth but if I didn't know the truth, I wouldn't care about a potential range of values for imperfect estimates of the truth that would fall within plus or minus two standard errors and I agree with you, but we won't know the truth and in fact, all we will have is one single point under the sampling distribution, sample estimate from one random sample amongst infinite possibilities. Here's the thing, our estimate could be way out here, our estimate could be way down here, our estimate could be right here, we'll never know. So if this were for example continuous data, we could get a sample mean estimate that's up here, but you get one that's here, you get one that's here we'll never know where it falls under this curve. But, let's take what we saw before and spin it. Ninety five percent of the time, I'm going to get a sample estimate, that falls within plus or minus two standard errors of the unknown truth. So, if I take my estimate and add plus or minus two standard errors, which we can estimate as we saw, based on a single sample, I will get an interval that contains the unknown truth most of the time, 95 percent of the time. That is called the 95 percent confidence interval. I'm just putting this idea out there. For now, we're going to formally break this down through a bunch of examples in the next set of lectures. But I just want to plant a seed in your head and give you an idea of why the central limit theorem is really a useful tool. So again, in real life research, only one sample will be taken from each population being studied. The sampling distribution for the sample summary measure of interests whether it be a mean, proportion or incidence rate can be estimated by coupling the results of. The Central Limit Theorem with information from the single sample from the population of interest and ultimately this process will enable the creation of an interval that gives a range of possibilities for the unknown population level value of the quantity being estimated. The true mean or true proportion and just again, for a single sample of n observations we can estimate the standard error of sample means based on samples of that size, by taking the sample standard deviation divided by the square root of the sample size and for proportions, we can do this by taking a function of that sample proportion divided by the sample size and we'll review this in lecture set seven and we'll also show you how to do this for incidence rates as well.