So in this section, we'll give some examples of the sampling distribution of the sample mean, based on some computer simulations. In real life, we can't operationally take multiple samples of the same size from the same population, and compute multiple sample means, and look at their distribution, but when we have a computer, and a starting population, we can do this. While we can't take an infinite number, we can still take a large number of samples and look at the variability in there means. So upon completion of this lecture section, you will be able to describe the sampling distribution of a sample mean in terms of its composition, and comment on some characteristics of the sampling distribution for sample means to demonstrate empirically including the general shape of these distributions, where they tend to be centered, and talk about the relationship between the variability of the means of these distributions, and the sample sizes each mean is based on. So to start, let's just look at two random samples taken from a population of US adults looking at the height distribution. One of these samples is based on 20 observations, the other is based on 50. As we know, there's no way to systematically predict which of these samples will have the larger mean and standard deviation. In this case, the sample size 20 had a slightly lower mean height, and the sample of size 50, and also a slightly lower in value standard deviation, and while 20 isn't quite enough to get a good sense of what the underlying population distribution may be, we can start to see some evidence of some basic potential symmetry, and what we'll flesh this out a little bit more and then the sample 50 we see a roughly symmetric bell-shaped curve. What I'm going to do now, is take three to the sample sizes and more-and-more, I'm going to take multiple samples of that size and compute multiple sample means, I'm doing this via computer, and I'm going to plot not the distribution of the individual height values in any of the samples, but the distribution of the sample means across the samples. So this histogram here shows the distribution of 5,000 means, 5,000 sample mean heights, based on 5,000 samples each of size 20. So each point in this histogram, each value in here is a single mean, based on a random sample of N equals 20. So again the points in this histogram are not individual height values, but a sample means each based on 20 persons. You could see the distribution of this sample mean, tends to be symmetric and bell-shaped. But now let's contrast this with when we do the same thing, but each of our 5,000 samples of size 50 persons, and each mean in our histogram is based not on 20 people, but each is a sample mean. Each point in this histogram is a sample mean based on 50 people. This is put on exactly the same scale, as the previous histogram, and if you notice this it's symmetric and bell-shaped and if you can remember what the previous one looked like, this distribution has less variability, tends to be narrower. So I'll go back, here's the distribution of 5,000 means, each based on 20 people. Here's that distribution each based on 50 people, and you can see that the means are based on 50 persons is narrower. Let's do it one more time, and I'm going take 5,000 samples each, of 150 persons, and plot the means, the 5,000 means each based on 150 persons. So each point here is a sample mean, based on 150 persons. So again we get a symmetric bell-shaped distribution of the sample means, but it's less variable than the previous two, and I'll just run through that one more time. Here's the distribution of 5,000 sample means based on 20 people each. Based on 50 people each, and based on 150 people each. Let me show you this side-by-side box plot form. I'll show these distributions, so this box plot is the 5000 means each based on a sample of size 20, this is the 5,000 means each based on samples of size 50, and this is the distribution of the 5,000 means each based on a sample of 150. You can see that the center of these distributions the median, is very similar across the means based on three different sample sizes, but the variation in the values decreases, the larger the sample each mean is basd on. I created a theoretical population of US persons and their heights, on my computer. The true mean of those heights in the population is a 167 centimeters. So here are the means and standard deviations of the sample means in each of those simulated sampling distributions. So you'll notice that the mean of the 5,000 sample means, for means based on 20 people, means based on 50 people, and means based on 150 people are all equal and they're all equal to the underlying true population mean, and the standard deviation which soon we will call the standard error, this measures the variation not of individual values, Individual height values in any one sample, but this measures the variability in the 5,000 sample means in our estimated sampling distribution. You can see numerically just as we saw visually, that this decreases the more information each of the means it's based on. So this standard deviation in our 5,000 sample means decreases, with increasing sample size that each of the means is based on. I just want to note that these are simulations. I'm trying to estimate a process that theoretically describes the distribution of an infinite number of means across an infinite number of samples. I can't possibly do that in one lifetime. So I chose 5,000 means to estimate this theoretical distribution, but I just wanted to note that if I take in more samples and computed more means, that we're not systematically changed these estimated sampling distributions, the thing that affects their variability in the means from sample samples not the number of samples I take, but the size of the sample that each mean is based on. So here on the left side, these are the box plots for 5,000 sample means, based on 5000 random samples of size 20. Next to it is the distribution of 10,000 means based on 10,000 samples of size 20. You can see those distributions are very similar. In the middle here, we have side-by-side distributions of sample means based on 50 persons each. The first one shows the distribution of 5,000 such means, the second one shows the distribution of 10,000 such means. If you look at this, those are very similar in terms of not only their center, but their variability as well. On the right here we have a similar comparison for distributions of mean heights based on 150 people each. So again, the thing that drives the decrease in variability across these graphics is not the number of means in my estimated sampling distribution, but the size of the sample that each mean is computed on. Let's look at another example. Remember our length is stay data on Heritage Health enrollees. Who were hospitalized at least once. So, I'll look side-by-side here at two different samples of two different sizes. Here's a random sample of 50 patients, and you can see even though there's only 50 patients in this sample, we start to see the rate skewedness in that data. In our sample distribution, the mean for this sample was 4.6 days, standard deviation of the 50 values in this sample describes the variation in those 50 individual of length of stays and it was 4.3. In our second sample, based on a random sample of 250 persons, they get a more fleshed out picture of that right-skewed that we saw with the 50 observations, we get a slightly lower sample mean estimate of 3.8 days and a very similar standard deviation the variability in the 250 values in this single sample is comparable in this case to the variability of the 50 values in the previous sample but there would have been no way to predict that based without seeing the results. What I'm going to do now is repeat that sampling process I did with height data. I'm going to take 5,000 samples of hospital of heritage health patients who had a hospital stay, and each sample is going to have 50 people in it, and then I'm going to take the mean of each of those 5,000 samples and each mean again is based on 50 people and I'll plot those 5,000 means on histogram. And you can see despite the fact that the individual stated values in any one sample exhibited that right skewness that we'd seen. Despite that the distribution of the means across these 5,000 samples is roughly symmetric and bell-shaped. Let's do this again where we take 5,000 samples but this time each sample will have 250 persons, and so in this histogram here I have 5,000 mean length of stays, each based on a single sample of 250. Note again that the original data we have here in each sample of length of stay values is heavily right skewed. But when we present a histogram of the means taken across multiple samples of this right-skewed data and we summarize the means across multiple samples, we see the distribution of the means is not right-skewed at all, it's symmetric and bell-shaped. This is on the same scale as the previous histogram so if I flash back and forth, we can see that the variability in these means each based on 250 persons is less than the variability in the means each based on 50 people. Then finally let me do this one more time where I take 5,000 samples and each has 400 people in the sample, and so again I get 5,000 means but this time they're each based on 400 people. So, again I have 5,000 sample means across 5,000 samples but this time each mean is based on 400 persons and again the variability in these means from sample to sample systematically decreases the larger the sample each mean was based on. Here's a side-by-side box plot presentation of the distribution of these 5,000 means for each of the three simulations and you can see that the middle value or the center of the distributions, the medians are very similar whether we're looking at the distribution of means each based on 50 persons, 250 or 400, but again the variability in those means as we saw in the histogram across samples decreases the larger the sample each mean is based on. So, again this is a theoretical exercise I knew the populace created based on the Heritage Health data, I created a population of individual values in my computer, where the sample mean length of stay was 4.25 days. So, what do I want you to notice here, when I look across the simulations of 5,000 sample means, look at the mean of the 5,000 sample means. What each of my means was based on 50 persons, the mean of the five thousand sample means was 4.22 days very close to this underlying population mean of 4.25 days, and for the other two simulations where my means were based on 250 persons or 400 persons respectively, the mean of the 5,000 sample means was numerically equal to the underlying population mean value for all persons in the population. You can see that the variability numerically in these sample means decreases systematically again as the size of the sample each mean is based on increases. So in summary, theoretical sampling distributions for sample means across random samples of the same size from the same population can be estimated via computer simulation, and that's what we've just done. Simulation is a useful tool for helping explore the properties of these sampling distributions. Some properties observed with these two examples which was generalized shortly include that the variation in sample means decreases with means based on larger samples. So, in other words as the sample size that the means are based on increases, the variation in the means across samples decreases, and that the average of the sample means regardless of the size of the sample that each mean is based on is close or equal to the underlying population mean which we'll call again Mu, and in these simulations I knew Mu because I created the populations on a computer. What we'll see is that ultimately estimating the characteristics of a sampling distribution like the ones we estimated via simulation here, will be done using the results from a single random sample from a population, and later in this lecture set the properties that have been demonstrated empirically via the simulations will be generalized. In the next section we'll show similar results when we look at the sampling behavior of sample proportions for samples of binary data and incidents rates for samples of time to event data.