[MUSIC] Hello everyone and welcome to another lecture of this course, my name is Sofia Duarte and I work at the Technical University of Denmark National Food Institute the division of Genomic Epidemiology. This lecture will be about the application of metagenomics in surveillance and particularly about the statistical methods that can be used to analyze metagenomics data. First I will tell you about the different types of data that should be collected along a metagenomics project. And then I will give you an overview of some statistical methods that can be used in the analysis of the data. Both to describe the distribution of the components of the metagenome. And to find determinants for the pattern that we see in the metagenome. Then a little bit about some of the challenges of the data interpretation and finally what can we do beyond metagenomics. So if we go back to our flow of a metagenomics project, right now we are here at the end of the process. You have learned about all these steps so far, and the last lectures were about analyzing your sequence. But we need to of course go beyond just analyzing the sequence because we need ultimately to do an epidemiological analysis of our results. In order to find what determines antimicrobial resistance or the presence of a certain pathogen in a community. So for that we need both to know about statistical methods and which ones are appropriate to use with a different goal. And we need to of course collect explanatory data that can be used in the statistical analysis, as determinants of what we find in our results. So first about metadata, what kind of metadata we need to collect in our projects, and these are just some examples. For example, whenever we collect a sample, we should associate a certain sample ID to a sample. And the sample ID ideally should have a certain meaning behind it so that when we look at the number it's just not a random number, but we can read that number as in for example the country where the sample was collected, the animal species it relates to, the setting, for example. So really, when you design your projects, think through how to code your samples in a meaningful way. Then of course, you should characterize the sampling site. Because depending on where you collect your sample, you might find different results. Many times it's important to record, collection and date, and even the time of day if you are analyzing a microbiome. This may change in the same animal at different times of the day. Then of course, very important, is to characterize which kind of sample type you are collecting, as the bacterial community you find will of course differ very much between sample types. You have learned also why it is so important to characterize the conditions during transport and storage of your samples. You have also learned why it is important to record which protocol you use for DNA or RNA extraction. And the same is valid for the library preparation method and the sequencing technology you use. And not only the technology but also the thresholds, all the decisions you take along the process of library preparation and sequencing. The same is true for the quality control step and for the bionformatics analysis. Not only which algorithm you choose and reference database you use, but also which criteria you choose along the way. Epidemiological data, or epidata as you may call it, will consist on potential explanatory variables that will actually be of crucial importance when you do your epidemiological analysis. And this may include, again, the characteristics of the sample site, whether your samples characterize a country or a region or whether you collect them from a clinical setting or from a community setting. The collection day might be important because of seasonality of certain diseases, for example. The sampled individual can be the age of the animal. The microbiome of an animal or the pathogens, an animal or a human is exposed to, may differ with different ages. If you're talking about food, you need of course, to characterize this food, and then again you need to say, to record whether you are sampling hospitalized individuals or the community. If in case you are sampling for example hospitalized patients, you need to characterize their health status. Are they with a clinical infection or is it just a surveillance sample, a screening sample, sorry. If you goal is to analyse antimicrobial resistance in your samples, then, of course, you should record, or collect records on antimicrobial use prior to sampling. Both on which agents were used, the amounts, and also how long before the sample date. For microbiome studies, it's important to record the diet of the individuals that are sampled, and then when we consider country/ regional level it might also be important to collect data on the development status or economy indicators. And I'm sure there are plenty more depending on the particular study that you are running. So it is important when you design your study that you not only think of the metagenomics data and the metadata associated to it, but also what other possible explanatory variables that you may need in the final step of the flow. That brings us having the right data to analyze on how to analyze it, so what statistical methods can you apply to your data? You can use methods with two goals, one is to describe the distribution of genes in your community, how is your community composed. And you can use this by applying different descriptive statistical methods of the abundance of different genes. And you can do this with methods that you might be familiar with like boxplots or heatmaps or bar plots or Venn diagrams. You can of course characterize the community in terms of its diversity and evenness, as you have learned in the previous lecture. You can graphically see how diverse and how far apart and how do your samples cluster away from each other by using ordination analysis, and you can use different types of ordination analysis for this. Principal components analysis, principal coordinate analysis, canonical correspondence analysis, depending on what type of problem you have at hand, what kind of data you have at hand. You need to consider which one is most appropriate, and then you can also use network analysis. And again there are many more methods you may use but this is just a snapshot of some of them. And then the second goal of your statistical analysis and the ultimate goal of your metagenomic study for surveillance, is to find determinants. What determines the particular metagenome, the particular microbiome, the particular resistome that you find in your samples. So first of all you need, of course to have what you think are the appropriate explanatory variables and once you have data on them. Then you can do a differential abundance analysis just to try to cluster your samples, and once you find those clusters you can then look into the explanatory variables, and see whether they make sense with the clusters. You can do a Spearman's rank correlation coefficients, multivariable regression analysis which might be a bit cumbersome if you are dealing with many, many predictors, so many genes, for example. You can do a meta-analysis if you think your data might have a certain random component, such as a certain variation that you can attribute, for example, to a country without really being able to say and describe where this variation comes from apart from the country variation itself. And then you can use machine learning methods like classification models to try to predict based on what you find in your metagenome try to predict some characteristic of the sample. I will now present some examples of how each of these methods can be used with metagenomics data. But again, please keep in mind this is just a snapshot, and you should definitely explore with the data you have at hand and different methods and see how they inform your project. [MUSIC]