So in this next set of lectures, we're going to define the idea of sampling distributions of sample statistics and something that measures the variation in the sample statistic across multiple samples of the same size from the same population, something called standard errors. So in this first section, I'm just going to take a moment to define the idea of a sampling statistic and lay out a roadmap for what we'll do in the subsequent lecture sections to make more sense of this title sampling distributions and standard errors. So let's first define the sampling distribution of a sample statistic. So, thus far in the course, we have summary measures for single data samples like sample mean, sample proportion, a sample incidence rate, and we also have measures of association that compare two samples, differences in means, risk differences, relative risks in incidence rate ratios. We have discussed how these affirmation sample estimates are not necessarily the population truth, fact we won't ever know if they are not because we don't know the population truth, but this sample estimates are our best estimates of these unknown truths based on the data we have at hand on our imperfect sample or samples from a population or populations. So ultimately, in addition to getting an estimate and correctly interpreting it, it is important to recognize the potential uncertainty in a sample-based estimate as it relates to the unknown truth that it estimates. This uncertainty is sometimes called sampling variability. If we understand that sample-based estimates vary across random samples of the same size from the same population, this will give us a framework for coupling our estimate with some measure of uncertainty and putting those two things together to make a statement about the unknown truth. So this set of lectures, lecture set six, involves defining, characterizing, and estimating the theoretical sampling distribution of a sample statistic. For example, the sampling distribution of a sample mean or the sampling distribution of a sample proportion. As we move through the course, we will extend this concept to include sampling distributions for comparison measures like mean differences, risk differences, relative risks etc. Ultimately, what this sampling distribution will allow for is the estimates of an interval describing a range of plausible values for the unknown truth that we can only estimate, but we can use the results for our single sample as we seem to estimate this unknown truth and now to add in uncertainty bounds to create this interval of possible values for this unknown truth called a confidence interval. So to start, let's define the concept of a sampling distribution. The sampling distribution of a sample statistic is a theoretical distribution that describes all possible values of a sample statistic based on all random samples of the same size, taken from the same population. The variability of the sample statistic value is characterized by the sampling distribution is a measure of sampling variability, we will ultimately call this the standard error of our sample statistic. So let's think about, what is meant by uncertainty in sample-based estimates also called sampling variability? Well, you may remember earlier in the course when we were defining continuous data measures, we looked at the weight distribution for a sample of 236 one-year-old children from Nepal and we had the sample mean for that entire group. So, think about this. If you were doing research in Nepal, you might take another sample of 236 children and you might get a slightly different estimate of the mean of weight for Nepali children one-year-old than my sample 236, and if you can imagine this study being repeated an infinite number of times with an infinite number of researchers taking a random sample of 236 one-year-olds from Nepal and computing a mean weight for the sample, then the theoretical sampling distribution of mean weights based on random samples of 236 Nepali children who are 12 months old will be given by a histogram that included all infinite sample mean estimates from samples of 236. So for example, my mean was 7 kilograms, the second researchers mean was 6.8 kilograms, the third researchers mean on the 236 children, his or her sample was 7.4 kilograms, and we would get in the theoretical concept anyway, we will be looking at an infinite number of means each based on samples of 236 children. If we were to plot a histogram of these infamous means and look at the distribution of the sample mean estimates from sample to sample, this would be the theoretical sampling distribution of a sample mean based on 236 Nepalese children. So this is a theoretical quantity. We would not likely do our study more than once and we certainly wouldn't do it infinite number of times, but the idea of a sampling distribution is a histogram where each point in the histogram, each value, is a sample mean based on, in this case, in this example, based on a random sample of N equals 236. So this distribution shows the range of values, their distribution, and the variability in these values across the different random samples. If I look at the sampling distribution of the proportion of Marylanders who have been vaccinated for flu in current cycle. Let's suppose, I was basing this on samples of size 500. So, I researcher one take a sample of 500 and maybe 36% of the sample has been vaccinated for the flu. You research your number two, take another random sample of 500 Marylanders and maybe 41% has been vaccinated, and this goes on infinitely and infinite number researchers all take samples of 500 and compute a single summary measure on their respective sample, the sample proportion. If we were to do a histogram of the sample proportion values across these infinite random samples of each of size 500, this would give us the sampling distribution of the sample proportion based on random samples of 500. So each point in this histogram is example of proportion from one random sample of N equals 500. So again, this is the theoretical idea because no researcher will likely take more than one sample for a given study and certain not an infinite number. So again, the sampling distribution like I just said, is a theoretical entity, it cannot be observed directly or exactly specified. In real life research, only one sample from each population under study will be taken, and even if we wanted to take multiple samples, it would be impossible to take an infinite number. So the remaining sections in this overarching lecture six will serve to further demonstrate and define sampling distributions by detailing the results of some computer simulations where we do sample from a theoretical population multiple times and look at the distribution of sample statistics across the samples. By doing this, we'll empirically show some consistent properties of sampling distributions regardless of the sample statistic we're creating. A mean for continuous data, a proportion for binary data, and the incidence rate for some time to event data. Will unveil a mathematical property that will allow for the generalizations of these properties shown in the simulations, and then we'll say, well, we can use that property to further then demonstrate how to estimate characteristics of a sampling distribution for sample statistics based on the results of one random sample. So, even though it's a theoretical quantity measuring the distribution of statistics across an infinite number of random samples, we'll have some tools to estimate characteristics of this distribution based on the results of a single sample.