Hello, my name is Jakob and I work at DTU Bioinformatics. Today, I'll be speaking about MGmapper, a metagenomic annotation tool that we developed here at DTU. So after this lecture, I want you to be able to do three things. First of all, understand in broad terms how MGmapper works, second of all, to be able to understand when to use MGmapper appropriately, and third, to actually use MGmapper. If you're interested in more, you should read the article which have the reference to here. So you might have seen this slide before, it is an overview of the different approaches to metagenomic analysis. So at the top, you'll see that we begin with reads and you can go either of two ways. Either you can go to assembly and assemble the reads by determining which reads overlap, and then when you have constructed your contigs, you can then either bin or do annotation using these other methods. Alternatively, you can classify the reads directly by using three methods. Either by using sequence similarity-based methods where you look up your reads against a database. Second, you can do sequence composition-based methods where you do statistics on your reads, for instance, which types of DNA bases are they composed of sets of DNA bases either composed of and then use these statistics to look up in a database more effectively. Or third, you might search for very specific genes in your reads and map those towards a database of these reads. MGmapper is a sequence similarity-based method. The sequence look up is also called a mapping. And this is why we get the name MGmapper. So I thought that I would compare three different approaches that we see here namely, the assembly approach, the sequence similarity-based approach and the sequence composition-based approach. So assembly has certain pros. It's independent of a database and that means you can theoretically find anything, new species, new mutants you have never seen before, new strains with basically no constraints. But the problem is that it's very hard to do, and you need great data. And even when you have great data, you will have limited success for complex samples. And even if you have great data, it's also very slow and resource-hungry. In contrast, similarity-based approach are very sensitive and very quick and lightweight. Unfortunately, you need a database and that means you cannot find what you don't have in the database. Also sometimes it's too sensitive. A single read can be misplaced because you have millions of them and then your method may turn out to say that you have something present in your sample which you don't actually have. You only get one hit for each read and not all hits, maybe a single read could fit multiple samples and then a random one is picked which may be the wrong one. And lastly, it has limited resolution. For example, you may two very similar strains, and there's no way for a similarity-based method to distinguish the vast majority of reads from them. Lastly, Composition-based methods are even quicker than similarity-based methods. They're not quite as sensitive. They are also database-dependent and a big problem with them is that you have a quite tough tradeoff between sensitivity and recall. If you want to make them sensitive, you don't become very accurate. And if you want to be accurate, they're not very sensitive. And they have even worse resolution than similarity-based methods. So when do I want to use MGmapper? You could it then in these situations: Say you want to figure out which of these bacteria are present in this sewage sample. You could say, are there any measles virus in this DNA sample? Or you could ask, what is the relative abundance of these known resistant genes? But you cannot use them to answer these questions namely, which bacterial species are present in the sewage sample? The reason you cannot answer that is because most bacterial species are unknown and will not be present in any database you can make. Also you cannot answer, what distinguishes this particular measles virus in the sample? Because any small deviations of this measles virus, say mutations, would not be picked up. And lastly, you cannot discover novel, for example resistance genes. Because, as we already mentioned, if it's not present in the database, you're not getting it. If you use it correctly though, it's quite efficient. This table here shows the result from an in vitro tests of MGmapper where they basically took different strains and species of bacteria, put them into water and sequenced the DNA. As you will see for the genus level and species level, it picked out all the true positive with no false positive or false negative. Only on the strain level it did make some mistakes, but it still compares quite favorably to other methods like Kraken. So let's talk about how MGmapper works. MGmapper at the core is a mapping technique. It uses the BWA MEM algorithm to map to the database. This is fairly simple stuff, it's tested and tried. We know it works, but it doesn't give the perfect results. As I said, that might be random hits or you might have that any read can map to particular databases. So what really makes MGmapper work is not the mapping step, it's the automatic preprocessing and postprocessing in order to filter out credible results from random hits. So let's talk about how MGmapper works in three steps. First, the preprocessing, secondly the mapping to the database, and lastly the postprocessing. In the preprocessing, first it does the quality control of the reads. It uses the program CutAdapt to trim poor quality bases and remove adapters from the reads. Any reads below 30 basepairs is discarded, is not useful for mapping. And lastly, if you use a paired read library any reads without partners is discarded. It also removes PhiX DNA. So Illumina machines in particular is known to use PhiX DNA is an internal control to test out whether the sequence worked. But some sequences from the PhiX genome will be present in the DNA as an artifact. So what MGmapper does is that it maps to the PhiX genome and any reads that map well are discarded and the rest are kept. The middle step is the mapping step. It simply invokes the already known program BWA MEM to do the mapping and then creates a sequence alignment and mapping file, a SAM file with all the information from the mapping. And all the next steps, the postprocessing steps, simply extract information from this file. This is how the file looks like. It's just a wall of text. Don't worry if you don't get it. But if you want to, you can go into the data that MGmapper produces and look at this yourself. The last step is postprocessing. So let's first have two definitions. The first is the size normalized abundance. We define this as the number of reads mapped to a reference divided by the size of the reference. Because you would expect a larger reference, gets more reads assigned to it. The second is a unique read, and that's a read that can only map to one reference and not others. So when do we believe a single hit is well-mapped? It has to have at least 80% of the read's bases aligned to the reference. And secondly, it must have an alignment score which is given by BWA MEM of at least 30. We believe that a reference sequence is present if at least it has one in 10,000 size normalized abundance, and at least 20 different reads aligned to the reference. Also at least 0.5% of the reads must be unique as per the definition of above. And at most 15% of the bases, in aggregate of all the reads, are mismatched to the reference. These parameters, when you invoke the program, can be tuned to your liking. Sometimes they should be. For example, if you want to discover bacteria or look up bacteria in a metagenomic sample, you might expect them to have deviated somewhat from your reference, they might not be the exact same strain. Then you want quite some leeway and maybe accept 15% difference in bases mismatched, as I said here. On the other hand, if you are searching for resistance genes which are not known to be highly similar, you want a very low tolerance of mismatches, maybe you want to down to one percent only. So how do we use MGmapper? Well, there are two basic approaches. Either you can use the command line approach, if you installed it yourself on your machine. The advantage of this is that you can build your own database and use those to look up. And also you can throw as much processors and computing power as you want to. The second approach is a web-based approach. You can use the link here. It's more user-friendly and straightforward. And this is what I'm going to demonstrate in a moment. There are a few options you want to notice when you use MGmapper. The first is that if you have a paired end or a single end library whether your own data is paired end or single end. Obviously, that makes a difference. The second is that you can pick for each database whether it should be fullmode or bestmode. In fullmode, each read can be mapped to several databases. It picks the best hit in each database. In bestmode, it only assigned to one database. And lastly, the postprocessing parameters, as I told you before, you might want to use the default, as I mentioned in the previous slide. Or if you have a specific reason, you can tweak them yourself. Now I'm going to show a demonstration of how MGmapper works on the web server. So you open the link, you will be presented with the following website. There will be different fields where you have to put in your information before you can submit the MGmapper job. So the first is to choose whether you have paired-end or single-end reads, as shown here in the red box. Here in the second red box, you choose your databases. They are comma separated, so there's the full-mode databases and the best-mode databases, as explained in the previous presentation. You see that they are represented as numbers and you might not know what one and two means. So in order to have that explained you need to press the Click to Show to show available databases, and you'll be presented with this which is a table of the different databases so you can pick those. Then you need to set the postprocessing parameters if you want to. Click this Show button here in the red square and you will be presented with the different options. Note that the web version of MGmapper do not have quite the best default settings. So I'd recommend, instead of having the 0.01, I recommend 0.15 as written here in red. And also that the minimum recount should be 20 instead of 10. As I said before, you might want to tweak them yourself. Now you have to upload your files by clicking Isolate File down there. And then you're ready and set, so you'll simply hit upload and wait for it to finish. Depending on the size of your file, it might take a few hours. When you're finished, you get this, which as a result report. There's a bunch of statistics like how long it took and what kind of reads you have. But you might want to export these statistics from the web page. If you scroll down, you get this part of the web page where you can download all sorts of different results. The red box here, there's a field where you can or rather a button you can press to get all the results in a single Excel file for convenience. This is how the Excel file looks. You can see there are several taps down in the bottom. And this button it's an overview of the different species of bacteria present, and you see there has been assigned five different bacteria. You can see a little bit of statistics about the five, like the size normalized abundance I told you before, and the size of the genome. And then of course a pie chart explaining the breakdown. You can also get the abundances based on the different databases you submitted. So if you, for instance, you maybe have several databases you're interested in at the same time, for instance, a bacteria database or resistance gene database. And this slide here which is one of the tabs, you can access the Excel sheet shows the breakdown of hits through different databases. And lastly, there's insert size distribution. So the underlying algorithm of MGmapper, which is BWA MEM is known to not estimate insert size quite properly if you have several references which of course you do if you want to look up against A large database of references. But, as you can see here, it has sort of figured out an approximate distribution of insert size from different DNA fragments. You can give it a look and look if it seems to be consistent with what you would expect. Thank you for paying attention. I hope you'll be able to use MGmapper after this demonstration.