Hello everyone, and welcome to this lecture on the opportunities and challenges for the application of metagenomics in surveillance. My name is Sofia Duarte. And I'm an Assistant Professor at the Division of Genomic Epidemiology, at the National Food Institute at the Technical University of Denmark. Today, I'm going to talk about why metagenomics is fit for a surveillance purpose. Throughout the course, we already touch upon some of its potential but now we will look more in-depth into that. We'll also see what is needed and what are the challenges for its actual implementation. And we'll also look into the need of epidemiology in a surveillance plan, why is epidemiology so important? Then, I will focus on two particular challenges for the implementation of metagenomics, which are the collection of explanatory data along with the samples, and the need for standard protocols in a Global Integrated Surveillance plan. And then finally, I'll present some recent results that demonstrate that there is actual potential for the use of metagenomics for surveillance. So, just to see where we are in the workflow of a metagenomics process. When we speak of surveillance, a great deal has to do with the statistical analysis and epidemiological analysis of the data in order to study the distribution and also possibilities for control measures. But you can see that it's actually a step that will be affected by all the previous steps in the process. So let's start by knowing why metagenomics is actually fit for surveillance. Well, if we think of future surveillance as global and integrated global as in, in a globalized way with many countries contributing to the same surveillance plan and integrated as in a one health approach, where we integrate data from the human side with data from the environmental side, and the animal side. We need data that supports both integrated and global surveillance and metagenomics does support both. And why? Well, with metagenomics, we can survey several pathogens at the same time in one sample, which might fit the purpose of integrated because there might be different pathogens circulating in different spheres. And also, global because they might be different distribution of different pathogens in different countries. It's a culture-independent analysis so it does not depend on the growth of microorganisms in the lab, which brings us closer to real-time surveillance, which is very convenient in a global surveillance plan. The data we get out of metagenomics is given in a standard format for different sample types and different pathogens. And this is convenient for data analysis but also for data sharing in an electronic way. It discloses the microbial biodiversity in the samples as opposed to isolate-dependent surveillance, which focus on each pathogen separately. And it discloses also the diversity of different antimicrobial resistance genes which can for example, be different between different participating countries. And finally, sequencing data is a historical record and can always be re-investigated, re-analyzed as new discoveries are made in terms of for example, new antimicrobial resistance genes, or re-emergence of new pathogens. So, knowing what is the potential, we also need to address what is actually needed in order to put it in practice. If we think of global surveillance, we definitely need a collaborating and functional international surveillance community, which might be difficult to establish by itself. This community needs to be engaged and develop jointly, a global sampling plan. And its members need to have similar, comparable sequencing capacity and data storage capacity. And a global plan, integrated plan needs to have some kind of data sharing infrastructure in place that can be managed and accessed by different partners. And of course, the data we get out of metagenomics is under this umbrella of the so-called big data. So some of the traditional statistical approaches, epidemiological analysis might not be fit to this kind of data and we need to develop and implement advanced mathematical modelling. And of course, the sequencing data by itself, it's useful to demonstrate the distribution of whatever we are surveying but does not fit the second purpose of surveillance, which is to implement control measures. For this, we need to know what actually determines that distribution and to that aim, we need to collect relevant explanatory data along with our samples. Because of what I said before, that the last step where we are right now with the statistical analysis and epidemiological analysis of the data is influenced by all the previous steps, we need to adopt a standard protocol for each step of the process so that we end up with data at the global level that is comparable between different members of the plan. So, the challenges on each of these steps are many, I won't address them all here of course but they can be classified under different categories. And I consider that there can be at each of the steps, logistical challenges. At some steps, political challenges. For example, just in establishing a collaborating international surveillance community might be a huge political challenge. There are many technical challenges along the process. And there are important knowledge gaps that need to be addressed before we can implement it successfully and have an effective surveillance plan going. In this lecture, I will focus on the last two points, the last two challenges, the collection of explanatory data and the development and implementation of standard protocols, and why it is important. It's basically to ensure that we have reproducible and comparable results between the different participants in the surveillance plan. But before I get to those two challenges, I would like to talk a little bit about epidemiology and why do we need epidemiology in surveillance. If you recall, from one of our first lectures, surveillance is defined as the continuous, systematic collection, and analysis of data that is needed for the planning, implementation, and evaluation of a certain public health practice, a certain control measure. Epidemiology is defined by the same body, The World Health Organization, is defined as the study of the distribution and the determinants of health-related state for example, the occurrence of a certain pathogen in a population. And various methods can be used in epidemiology. Some of them such as descriptive studies will be used to study the distribution of what we are surveying in a population, and analytical studies are used to study what determines that distribution, the occurrence of the pathogen or the resistance. So, having this in mind, we go back to the two aims of a surveillance plan. The first one is that we continuously assess the occurrence of, let's say, pathogen or antimicrobial resistance gene in a population. And the second aim is that we plan, and implement, and continuously evaluate the effect of control measures that are implemented to contain the spread eventually of these pathogens or antimicrobial resistance genes. We can only fulfill the second aim if we have explanatory variables that will help us identify what determines this distribution. So, we can say that this analytical, epidemiological studies will help us implement control measures in surveillance. For that, we obviously need to collect data on relevant explanatory variables, and at the global scale, if we think of a global surveillance plan. Now, the challenges in fulfilling this step are again, many. It can definitely be affected by political, management, or trade interests. For example, if we think of antimicrobial use data, in some countries it might be sensitive information that they do not want to disclose to everyone and therefore, it might be difficult to get some partners collaborating in this plan. It can also be a huge logistics challenge. Some countries might have a data collection system in place and even historical records for the variables that we want to assess, whereas, other countries may have a complete lack of surveillance on these variables and everything needs to be done from scratch. So, in the start, we will already have considerable differences between different partners. And in order to implement a standard way of collecting these variables, many resources will be needed. And of course, there will be considerable technical challenges as well. And all the technical challenges will affect how systematic and reliable and valid the data we collect is. And finally, we would like to address the important knowledge gaps at this step. For example, if we think of integrated surveillance where we want to analyze together data from humans, animals, and the environment, it might be a challenge in itself to integrate explanatory variables from the different spheres. For example, if we want to explain how a certain change or certain control measure at the farm level may influence what we survey at the human level. There might be a lot of knowledge gaps in between that we need to fill in order to find the right association between the two. And then, of course, an important question is, what is the relevant explanatory data? In some cases, we might have previous studies and evidence that guide us into determining which variables we need to study, whereas, in other cases we might be a little bit lost and not know where to start and which variables are indeed relevant to determine the occurrence of a certain pathogen or resistance gene. So, to address this last point, what is relevant explanatory data? What makes a certain variable relevant as an explanatory data? First of all, it has to be fit for purpose. It has to fit the purpose of our initial question. It will depend on the sampling plan and its representativeness. The sampling plan we design for surveillance will have a certain representativeness level of the target population and we need to ensure that the explanatory variable we define will represent the population at the same level. And this is very much related to the second point which is we have to collect the data for the explanatory variables at the appropriate level, which can be national, regional, local, at the group, herd-level for example, or at individual level. Ideally, we should collect both explanatory data and the samples for sequencing in the same time period, so that we can, with no doubt, link them to each other. The way we collect this explanatory variables will depend very much also on how much variability exists between the different units. For example, how different are different countries? Let's think of, I don't know, antimicrobial usage in hospital settings, for example. How different are regions or even how different are different individuals? Also, some explanatory variables can be influenced by the application of different calculation methods. For example, if you want to use antimicrobial use as an explanatory variable for antimicrobial resistance, there are different formulas to calculate this antimicrobial use, for example, at the farm level. By using different nominators and denominators, you end up with different measures which might mean that in the end, you have different associations between use and resistance. This is something you want to avoid and you want to be in control of. Of course, there are unknown determinants and this is very much linked to what is actually relevant explanatory data? If we don't know what to look for, it's difficult to decide that we want to go there and sample. And then finally, remember that finding a strong association between an explanatory variable and what you see in a metagenome does not necessarily mean causation. There can be a strong correlation. It does not necessarily mean that that's what is causing what you observe in the metagenome. So, there needs to be some biological reasonable explanation behind it. So, that was about explanatory data. Now, the second challenge is, the need for standard protocols. Why are standard protocols along the process needed? Because there are different factors along the whole process that may influence interpretation, the reproducibility, and the comparability of your final results. Among these factors are: how you decide to sample, how big are your samples, and what type of samples you collect, and how you handle your samples, how you extract the DNA, whether or not you need to amplify the DNA before sequencing, how the library is prepared There are different ways of doing that. The sequencing platform that is used, different decisions, for example, which algorithms, which criteria you implement in a bioinformatics pipeline, what reference databases you use during mapping? In what way you quantify and normalize read counts into gene abundance? And even the choice of the index measures that you choose to analyze your gene abundance data. Here, in this slide, I tried to use different colors to demonstrate at what point in the workflow do these factors have an influence. And you can see that they actually cover all of the points, which means that all the uncertainties, all the doubts that you have on how reproducible your results are, how reliable they are, will add up and may have a huge impact on your final epidemiological analysis which is key in a surveillance plan. So, I hope I have convinced you now that it is important to have a standard protocol. Now, let's see what is the challenge in establishing a standard protocol. Again, I have to say that there might be important political and management challenges. How engaged the collaborating surveillance is, and also, how willing they are to collaborate in open source research environment? There are, of course, also, technical challenges. Some participating countries or bodies may not have the logistics to perform all of the steps in the workflow in a standard way, which might mean that we might need to centralize some of the steps in a particular partner, for example, and this is linked also to open source research. So, all of these points are, in a way, interlinked between them. Also, if you collect samples from different spheres, human samples, environmental samples, even different types of human samples, they might require different treatments which we cannot avoid to do and in the end might make them uncomparable. Then, if we address the knowledge gaps, there is definitely the need for further scientific studies to address each of these particular factors and how high is the impact in the final result. And once we reach the point where we are comfortable enough to say, now we can implement metagenomics in the surveillance plan, we will need training of all the participating partners, and we will need ring trials to ensure that the standard protocol is working, and that the partners obtain comparable results. So, in the end, with all these challenges, we need to ask ourselves, is it really realistic and possible to implement a standard protocol in a global integrated surveillance? And if not, what alternatives do we have to cope with different practices among different participants. Good news is that there is a growing number of scientific studies that address the impact of different protocols on metagenomics results. Here, I list some examples, and if you want to know more about it, you have a reference list at the end of the lecture. There has been investigation on the effect of sample size, of DNA isolation, library preparation, which reference database you use, which bioinformatic approach, and the decisions you made along your pipeline. I would like to finish the lecture by saying that I have highlighted here all the challenges that exist for the implementation of metagenomics and surveillance, but I would like to really stress that there is a lot of potential in it. And to demonstrate that, here are three of many available studies that have used metagenomics to investigate the occurrence of either pathogen or antimicrobial resistance in different environmental samples, which can be an environmental sample, a sewage sample, or even toilet waste from long distance flights. So, there's definitely a lot being done in this field, and it has been demonstrated that metagenomics can be used. I will just shortly mention this study which was done by my colleagues here at the National Food Institute, where they compared the use of metagenomics for monitoring antimicrobial resistance in swine herds in comparison to more traditional approaches. Now, I will very shortly tell you that they found that metagenomic analysis was highly correlated with expected resistance in the herd and that's very good news. That gives us hope that in the future we can really implement metagenomics. Here are the references of all the studies that I've shown in this lecture. And finally, I would like to finish the lecture by saying that metagenomics can indeed be the next frontier in surveillance as long as we combine it with a standardized, integrated and global sampling, advanced mathematical modeling, and the relevant epidemiological data. Thanks for watching.