Hello. My name is Rasmus Borup Hansen, and I would like to talk to you about some of the lessons we have learned during the EFFORT project with the metagenomics analyses. Unlike some of the other talks you may have heard in this series, I would like to take a step back and provide some general take home messages. First, a little bit on my background. My name Rasmus Borup Hansen. And by education, I'm a mathematician and a computer scientist. And I've worked with various kinds of IT for several years. The last five years, I have worked with Bioinformatics at Intomics, which is a Danish company located just outside Copenhagen close to the DTU campus. So, we have both Intomics and DTU logos here in this slide. We mainly sell bioinformatic services to pharma company in the European Union or in the USA, but we also work with academia in various projects which is why you're seeing me in this video today. So, the outline of my talk. First, I'm going to give a very brief introduction to the EFFORT project and then I will proceed to talk about the metagenomics pipeline we used in EFFORT project, and a little bit on Intomics role in the EFFORT project. And finally, I will talk about some of the lessons we learned during the project. Now, the EFFORT project. Since you may already have heard other talks about the EFFORT project, I will only give a very brief introduction. As you can see, EFFORT is an acronym for Ecology from Farm to Fork Of microbial drug Resistance and Transmission, and it consists of 19 European partners, and the focus on the project is to study antimicrobial resistance. In particular, we have used metagenomics to quantify antimicrobial resistance in a number of pig and poultry farms across these nine European countries. We did that by looking for traces of certain genes called resistance genes that are known to cause antimicrobial resistance. We've also examined the bacterial composition of the samples using metagenomics and finally, information about actual antimicrobial usage has been collected by interviewing the farmers. So, this is a generic overview of the metagenomics pipeline used in the EFFORT project. It is centered around the tasks that are related to bioinformatics. And you should note that a tremendous amount of work goes into just collecting the samples and preparing the samples for sequencing. There are some other talks in this series that deal with these things. You should also know that for the EFFORT project, we took the tool called MGmapper which was originally made by Thomas Nordahl Petersen from DTU, and streamlined it by removing some parts of it not needed for the project, and added a database to store all the results automatically. Okay. Once a sample has been sequenced, you get something called a FASTQ file or several FASTQ files from the sequencing provider. Basically, these are long lists of short DNA sequences that are found in the sample together with some quality assessments. These sequences are called reads and with the sequencing technology we're using, they come in pairs. So when I talk about reads, read-pairs or maybe just pairs, I'm simply referring to short DNA sequences that are present in the sample. It's a good idea to think about the FASTQ files as a digital photo of the DNA in the sample. When you have your FASTQ file, you want to relate the FASTQ file to some of the databases of things you are interested in. This could be a database of resistance genes or a database of sequences of entire genomes. And for each of the reads or the read-pass in the FASTQ file, you then try to determine what it looks like in the database. And then, you simply count how many of the pairs correspond to each part of the databases. The extra alignments, of course, in the alignments step. And then, afterwards you get some raw data out, we aggregate them by simply counting how many pairs aligned to which genes or which genome. There are some other values as well in the raw data but the counts are the most important output of the alignment step and postprocessing of the alignment. We then aggregate the raw data to an appropriate level and put the results in a database. So, we can later extract reports or do analysis that involve all the samples. Intomics work in the EFFORT project has been centered around this box here. And as I said, we spend a lot of time streamlining the pipeline but we also used a lot of time communicating how the output should be understood and what are the inner workings of the pipeline. So, the first take home message of this talk is that we only find what we look for. This is related to the alignment step in the pipeline, and we do get different results if we use different databases. This is pretty obvious but what is important here is that the bacteria databases, they change all the time as we get more knowledge and that knowledge is added to the databases. So, what we see here are two pie charts of bacterial composition of the same sample, this is the sample identifier, and the pie chart to the left is based on bacteria databases from April, 2015, while the pie chart to the right is based on slightly newer bacteria databases from November, 2016. You can see that even though we're analyzing the same sample using the same method, we do get slightly different results. And this is of course because we use updated databases and you will for instance see that in the 2015 data, we did not find as many clostridia bacteria as we did in 2016. Maybe, that could be because some of these clostridia bacteria were misclassified as bacilli. But you really have to dig into the details to figure out what is the actual cause for this different pie chart. One other thing you will notice is that this class you may see over here, thermoplasmata of bacteria is simply not present in this pie chart. And the reason that we did not have the same knowledge about thermoplasmata in 2015 as we did in 2016. So we simply did not know what to look for and that is why we didn't find any thermoplasmata back in 2015. So, as I pointed out, the bacteria databases change a lot as new knowledge is added. So we may see changes in the distribution of the bacteria composition in the samples. And in the example, a whole new class of bacteria showed up after updating the database. But, as always we believe that using more knowledge improves the results. So, we should try to use as new databases as possible when we do our analysis. But on the other hand, in a long running project, it's important that all the samples are analyzed in the same way. So, you should actually freeze the database at some point, but do it as late as possible. The next slide is actually related to the first lesson learned. When a database of genes we want to look for changes, we can actually restart our analysis and use our new knowledge. This is because the FASTQ files, work as a digital photo of the DNA in the samples. Even the DNA we did not know about when we collected the samples. When we first analyzed the data for the EFFORT project, these two resistance genes here were not known. So obviously, we couldn't look for them. But because of the digital nature of the sample data, we could later restart the analysis without going into the lab, and we did actually find traces of these bacteria (genes) in some of the samples. So, the next point I'm going to talk about is more technical than the other points. The work course in the pipeline, here in the alignment step, is the alignment software called BWA for Burrows-Wheeler Alignment Tool. However, BWA was not designed for metagenomics. If you look in the documentation for BWA, you will find it was designed for low-divergent sequences and a single reference genome. However we use multiple reference genomes, and since many of the reads come from bacteria that are not yet in our databases, our reads will be highly-divergent and not low-divergent. This turned out to be a problem in the project, because BWA needs to estimate something called the insert size, and this did not work when many of the reads did not align properly. It also turned out that these insert size estimate would depend on the number of CPU course used for the computations. So we really had to dig into these problems and figure out what was going on. So first we spent a lot of time understanding what was going on, and then, when we had this knowledge, we could then change BWA, and then also change our pipeline accordingly. So, we got a good consistent insert size estimate. This took a lot of work. So, you should be careful when you're using software for things it wasn't designed for. The last thing I'll talk about is that you should aggregate your data and you should do it in one way. If you look at the pipeline again, you probably won't be surprised that there are many different parameters you can tune in the alignment step, and also if you look at the raw data, there many different ways you could interpret it. So, all this could lead to several different variants of reports or analysis. And really that can be quite confusing. So, to address this, it's very tempting just to provide all the details or provide alternative data sets corresponding to different parameters, and then leave the choice to the people who are using the data. Then you don't risk getting blamed for bad choice. However, people will be confused and overwhelmed by many choices or a lot of data. So, it's really not necessarily a good idea. You would also need to spend a lot of time explaining the different data sets. And if someone is going to work with the data that you provide, it is very unlikely that they have a better understanding of the parameters than you do. You're the expert. So you should be the best person to make good choices about the parameters. So my final recommendation is that if you're doing a project that generates a lot of data, you should recommend a single way to aggregate the data, to avoid confusion, and to get the project to run smoothly. So, let's look at the lessons learned I've talked about. Except for the second one here, they are not really very specific for metagenomics. So I hope you can benefit from them in other projects too. I will admit that they look pretty obvious. Nevertheless, they have all been relevant for us at some point during the EFFORT project. So they may very well be relevant for you in the future. Thank you very much for listening to my talk.