The first statistical step in almost any genomic data analysis is pre-processing and normalization. The basic idea is that the data come in in a raw format that's very complicated, or maybe too big, or maybe too unprocessed. And so your step is to process that data, get it normalized, get it set up and situated. It's easy to perform statistical analysis. So this is a typical pipeline for a genomic experiment. So you might do something like, imagine you're doing some sort of sequencing. So what you might do is use a pipeline like this where this is a pipeline for an illumina sequencing machine. So what you might do is you might take the samples that you're going to have. You're going to prepare those samples. And then you create these little clusters of the sequence samples on the slide. And then you add, you do sort of sequencing by synthesis, and add one base at a time, which generates these large images, where one point in this image corresponds to the same sort of fragment, cluster of fragments across different levels of that base you're trying to calculate. Then you might do, use those color images to, say, call the bases. Say, at base one for cluster of fragments one, we think we might have a C, that we might have a G, that we might have a T, and so forth. And then, what you end up with at the end of the day is maybe a set of sequences and their base calls. So each of these steps, starting here at the images, to going to the base calls, going to the reads, and then, ultimately, going to alignment and other things that you might do with that sequencing data, are all steps in pre-processing and normalizing the data. And there are a large number of artifacts and differences that you might need to correct for. And so in here I was going to give you a little example of just one simple thing that you do after that step. So you might do, in an RNA-sequencing experiment, you might take RNA transcripts, and you might have various different variants of those transcripts. And then you might get reads from those transcripts which are ultimately going to align back to the genome. And then, the next thing that you need to do is do some kind of counting. So that's a pre-processing step. You actually don't care about the reads themselves necessarily. You might just care about the total abundance for each gene. And so a pre-processing step would be adding those reads up and getting a number for each gene, for each sample that you're working with. Another step might be, so for example, it's been shown that if you look at the particular set of genes, and you look at their GC content, and you plot their expression levels on the basis of GC content, you actually see patterns that arise due to the GC content. So you see that for different levels of GC content, you see different levels of expression for a gene. And that might be different between different samples. So for example, sample one, with increasing GC content, you might see increasing expression. For sample two, you might see decreasing expression. What that means is for different levels of GC content you actually might look and see that there are differences between the two samples that actually isn't real. It's just because some genes have lower GC content and some genes have higher GC content and that varies across samples. And so, you might need to do some sort of correction for GC content. Now I'm giving you two really simple examples there, but there's a large number of pre-processing steps and they're very dependent on the pipeline that you're using. So here's another example where you're doing some sort of normalization. So in this case, you're looking at genotype calls for a particular SNP across a large number of individuals. And so what you can see is, for each different SNP, you can see, here's a cluster of individuals that have a set of samples where it looks like you have a low value of the intensity for the B allele, and a relatively low value for the intensity of the A allele. Here, you see a relatively higher value of the intensity for the A allele compared to the B allele, and then finally you see the B allele is higher for these samples. So these might be homozygous B, or heterozygous and homozygous A, and that varies by each SNP. So this SNP has a certain set of values, this SNP has a certain set of values, this SNP has a certain set of values. And so you need to process the samples together to try to understand what are the right variant calls that you want to make. Moreover, you might see that there's some variability here across the homozygous samples and so you might want be able to call those homozygous samples even allowing for this variability, so that's an example of normalization across samples. Another example of normalization across samples is if you're doing sort of a ChIP-seq experiment. And suppose that you take common peaks from replicate samples. And so what you might see if you do common peaks from replicate samples, you can make a plot, an MA plot like we saw in the exploratory analysis, where on the x axis you plotted, you've added the values, the number of reads that come from those peaks, from the replicate samples. And on the y axis you take the differences. And so here you see replicate samples, technical replicates, you would hope that this line would lie exactly on zero but if you read carefully here you can see zero is down here and it doesn't exactly lie on that line. So you could fit a line through these data and subtract it off and you can see now, the samples appear to have a greater concordance. They appear to lie greater on this line. This is an example of MA normalization or low s normalization uses this sort of technique, where they try to take replicate samples and make sure that the bulk distributions look alike. Now, it's not always true that those bulk changes in the distributions are not due to biology that you care about, but more often than not in genomic measurements, if you see a really large bulk distribution and a change in the distribution of bulk measurements between two samples, it's due to technology and you want to remove that. Here is the most, probably the most common kind of normalization across samples. So, imagine you have some raw data and they have very different distributions of values. So, the first thing that you can do is do something called quantile normalization. So, here's how that works. So, what I have here is a bunch of different genes that are measured in the different rows and then a bunch of different samples in the four different columns here. Okay, so the first thing I'm going to do within each sample or within each column, I'm going to order the values. So here it goes 2, 3, 3, 4, 5, because you can see that there's a 2 value, two 3 values, a 4 and a 5. Then similarly I order column two, I order column three, order column four. Now what I do is I average those values. So I averaged across the rows, 2, 4, 3, and 5, and I get a value of 3.5. Now, I assign that to every value at this point. So the lowest value in each sample might vary. It might be 2, 4, 3, and 5, but now I say the lowest value is always going to be assigned to be 3.5. The second lowest value is always going to be assigned to be 5, and so fourth. Fourth. Then what I do is I reorder the samples so that they go back to their original order. And so now the distribution across this column is exactly the same, because it always has the same lowest value, the second same second lowest value, the same third lowest value. It's the same in all of these samples, but they're ordered in different orders depending on where the highest and lowest values appear. So what does this do? It forces the distributions to be exactly the same as each other. Now, this is not necessarily a good thing if you see big bulk difference in biology, but almost always the big bulk differences in distributions are due to technology. So here's a really nice illustration of when to use quantile normalization and when to not use quantile normalization. So here, for example, you imagine that you have within groups that you care about. So here you're comparing, say, three different genotypes, you want to compare the differences. Here you see small variability within those groups but also small variability across those groups. You can use quantile normalization in this case, but it's not really necessary because you don't see big, bulk differences in the distribution. Here is a different case where you want to compare say non smokers to asthmatics to smokers. And here within those groups you see large variability but you expect to see very little variability across the groups. In this case it makes sense to remove these bulk differences and to use quantile normalization to make the distributions look exactly the same. So, in another case, you might not necessarily want to do this. So, imagine now that there are some global changes. So, imagine, for example, you're comparing brain to liver tissue. So, if you're comparing brain and liver tissue there might be huge differences just due to biology, which you don't necessarily want to remove. And so in that case if there are global technical variability you might use quantile normalization. But if you're looking at biological variability, you do not necessarily want to use quantile normalization because it will force this distribution to look exactly the same as this distribution, and that might be due to biology. And so there's a technique now called quanta that you can look at that I've linked to that will allow you to take a look at what are the ways in which you should use and shouldn't use quantile normalization. Another example of something that you have to do, and be careful about, quantile normalization is that, again, it's going to force the distributions to be exactly the same. But sometimes they shouldn't be exactly the same, so it's something that you have to be sort of nuanced about and careful to pay attention to. So here's an example from DNA methylation arrays. Here is two different types of probes, the Infinium I probes are the red and the Infinium II probes are the blue and these are for two different channels. And so it turns out that the different kinds of probes actually have different distributions and you don't want to force them to be the same because they're supposed to be different. And so when you do quantile normalization, sometimes it makes sense to quantile normalize within groups of probes or within groups of measurements that are similar, and should have similar distributions. So I've shown you a little bit about preprocessing and normalization. Preprocessing is the step where you actually take the raw data and turn it into a set of data that you can actually do statistical modeling on. Normalization is the step where you try to make samples have appropriate distribution, or have a common distribution across samples. These are highly platform and problem dependent so if you're doing genotyping arrays versus whole genome sequencing versus gene expression arrays versus RNA sequencing, they will all have different preprocessing and normalization steps. I'll talk a little about those when I talk about each specific problem type, but it makes sense to go and look for what's the actual preprocessing and normalization steps for the type of data that you're looking for. In general, though, the things you're looking for is to make sure that there aren't bulk different, big differences between samples, especially due to technology. If you go to bioconductor, they actually have these really helpful workflows for a lot of different technology types, which explain some of the most common preprocessing and normalization techniques. Again, visualization is going to be your help here. it's going to be the way for you to detect if there are big differences that you haven't missed. One thing to keep in mind is that researchers starting out in genomics must keep in mind is that outliers will inevitably contain a bunch of experimental or analytic artefacts. This is a quote from a paper, but it makes a lot of sense, and it's something to really keep in mind when doing these analyses. Almost always when you find a really surprising, really huge effect in genomics or genetic data the first thing that you should be thinking is, this is probably due to some experimental or technological artefact. And before you make any big claims on that, it makes sense to do very careful analysis and identify, are there any artefacts that you missed? Is there a step in the normalization, pre-processing, or correction that you missed that's causing that big difference to appear? That way you won't be embarrassed if you report an effect that appears to be real but it turns out to just be an artefact.