[MUSIC] So welcome to a demo of metagenomic classification using Kraken. In this lecture, I will briefly give you an idea on how the basic concept of Kraken is. So how the algorithm works. Then, how to build your own Kraken database. And then, I will switch to the terminal and show you live how you could use the database to classify metagenomic reads. And how you, then, can visualize the result you get from Kraken. So first, you need to install some requirements for the classification. We need Kraken. And for the visualization, we need R and Pavian. R is a programming language and Pavian is just an app for that language that enables the visualization of the Kraken results. And if we want to build our own database, we also need to install Jellyfish. But we won't show this in the lecture. This will be in a supplementary file that is offered beside the lecture. So Kraken was developed by Derrick Wood and Steven Salzberg. If you want to check out the publication, you can find the link in the bottom of the slide. Or I guess you can just search for the title of the paper. So in a simplified fashion, what Kraken does is first, we supply a set of genomes to Kraken. And Kraken will then build the database. So what happens is, it will shred the genomes into k-mers. K-mers are short subsequences with links of k. So, just on default, the k-mer links are set to 21. This means that all the k-mers will have lengths 21. And then, Kraken builds a huge list out of those k-mers. And will assign the taxonomic ID where we got the k-mer from. So if we shredded an E.coli genome. Then, we will assign the taxonomic ID to E.coli to the k-mer. However, if we encounter a k-mer that is present in multiple organisms. Then, we will assign the lowest common ancestor in the NCBI taxonomic tree to the k-mer. This means, for example, if we have three different Helicobacter organisms. And each of them shares the same k-mer. Then this k-mer will be assigned to the lowest common ancestor, which is the Helicobacter canis in the NCBI taxonomy. So now we have our huge list of k-mers that are linked to the respective taxonomic ID. Now, we want to classify the reads. So we have now our read. And again, we chop it into all possible short k-mers. And then, we search for each k-mer in our database. What is the taxonomic ID? And after we did that. And we look at it in a taxonomic fashion. So in the taxonomic tree from NCBI, we would see that those purple, six k-mers are assigned to the root node. This means that they are just common k-mers present in a lot of different organisms. And the lower we go into the tree, the more specific the k-mer gets. So if we are on the very bottom of the taxonomic tree, then we would assign this k-mer only to this strain or species. So now we classified all the k-mers. But we still don't know what is actually the taxonomic ID of the read of all. So this is done by, first, by pruning the tree. This means we remove all the branches that don't have k-mers in it. And then, Kraken tries to find the highest weighted root-to-leaf path. Which is in this case, this path indicated in red. Which is only just some of the k-mers present in the path. Compared to this path where we only have two additional k-mers in that node. And then, the taxonomic ID for the overall read is then just the ID stored in the lowest node. So in this case, whatever ID is stored in the blue node will be the taxonomic ID of our read. So in summary, what the Kraken algorithm does is it chops our genomes into k-mers and links them to a taxonomic ID. Then during classification, it chops all the reads into k-mers and search for exact hits in our database. And then, it tries to find the highest weighted root to leaf path. And assign the taxonomic ID of the lowest node to the read. So now, a bit away from the theory. Now, a bit more practical. First, we want to, of course, build our database. There are different possibilities on building a database. First, there is a standard database which contains all complete Archaea, Bacteria, and Viral genomes from RefSeq. And you can build just by typing kraken-build --db. Then specify where you want to save your database. Then you can also specify how much computation of power you want to assign to building. So how much threads your CPU has. For example, ten. And then, just --standard. So this builds the standard Kraken database defined by the developers. You need about 75 gigabytes of memory to build and run this database. And depending on how much computational resources you have, this can take a few hours. So, maybe then, you'll think, okay. But I'm also interested in human or in plasmids. What can I do then? Then you have to build a custom database. A custom database needs a bit more hands-on time. So you need to run a few, multiple steps. First, you kraken-build. And download the taxonomy. The NCBI taxonomy. Then, you can download different libraries supplied by Kraken. The available libraries are "archaea", "bacteria", "plasmid", "viral", and "human". And after downloading everything you want, you just run kraken-build. And this time with the parameter build. And then it knows, okay. Everything is here. I can start building the database. But the standard libraries available. They still don't cover everything that's available for classification. So you can think! I'm more interested into parasites. Then, you can add your parasite genome by, for example, running kraken-build. With the path to the database. And then, add-to-library. And then, the customGenome you want to add. However, there is one criteria your custom genome has to fulfill. It has to have in the header, either the NCBI accession number. So Kraken knows to which taxonomic ID the genome is associated to. Or you need to directly tell Kraken. Okay, Kraken. This genome belongs to the taxonomic ID 5811. Which is, for example, Toxoplasma. So if you don't have the accession number for the genome because you made an isolate and sequenced it on your own. But you still want to add it to your Kraken database. Then, you need to tell Kraken, in this fashion, exactly where to put this genome into the database. And now, we are switching into the terminal. And I will show you how to run, create, and form classification. Now to the live demo. For this, I prepared path and reads from five different bacteria. From each bacteria, I took 10,000 reads. And added them to our fastq files. R1 contains the forward read. And R2, the backward read. Additionally, I have our Kraken database. This is the minikraken database. It's available on the home page from the developer. It's the standard database. But it shrunk to a smaller size. So it only needs 4 gigabytes of RAM, instead of 75. But of course, that comes to a price. And the database is way smaller. So we have way less k-mers stored in our database. So to run Kraken, you just type kraken. You specify the database. You specify the number of threads you have available. I'm on my laptop. So I just use two threads. Then, we have path and reads. That means. We need to tell Kraken that by setting the parameter paired. We need to specify an output. We will just save it in example.kraken. To save our Kraken output. And then, we supply the two reads. The two Read files. So Read1 and Read2. And this takes only a few seconds, because, as I said, we only have 50,000 sequences. And what we can see that from 50,000 sequences, we were able to classify 39,000. This is a bit unexpected because I simulated those example reads. So they should actually be in the database. But of course, this has a reason. So the reason is that for some sequences, we just can't find the k-mers in the minikraken database. Because they were removed to condense the database. So next, before visualizing, we need to transform the output into a Kraken report. This is done by running kraken-report. Specify the database. Again, our minikraken database. And then, giving it the Kraken result. So our example.kraken file. And this, we are storing into example.report. This, again, takes a few seconds. And then, we convert the Kraken output to Kraken report. That we can load, then, into Pavian for visualizing the data. So now, we are going to start Pavian. For this, we need to open R. And then, just type pavian::runApp(). So now we are in Pavian, in the explorer. And we can now visualize the Kraken results. So for this, we first load the Kraken report we generated. And in the result overview, we see a brief overview of the report. So we have 50,000 reads. Of those 50,000, 78% were classified, 21% unclassified. And most of the reads were microbial. And then, of those where most of the reads are bacterial reads. In sample, we see a sankey diagram. This shows us on the different levels how much reads are assigned. And here we have the species level. We see four dominant species which are on part of the species I put in. But then, the fifth species we see. We have kind of a low number of reads. So for Mycobacterium, it's either that by shrinking the database, we lost those k-mers that allow us to bring the reads down from the genus to the species level. Or those bacteria are so similar that it's hard to distinguish between those bacteria. So maybe if we use a bigger database, this would be resolved. And we would actually see the reads down here on the species level. Like we see in the other four organisms I put into the artificial species mix. Then, in comparison, we can compare our different samples. In this case, we only have one report. But maybe, if you're on multiple analysis on multiple samples. Then, you would see here, okay. In this sample. On, for example, species level. I have Vibrio cholerae, Salmonella enterica, Klebsiella pneumoniae, Acinetobacter baumannii, and Mycobacterium tuberculosis. And then, you could see how those species distributed in your other samples. And if we actually look on the genus level. And look at percentages Then, we would actually see that we would expect 20% for each of the top five bacteria. But we get a bit of variance here. This is because we have unclassified reads. And sequencing errors can also lead to misclassifications. So again, maybe a bigger database would make this a bit more evenly distributed. And then, the last thing is the alignment viewer. I won't show you this. I only will briefly tell you what it is. So after classification, you could try to align your reads to the reference genome. For example, using Bowtie 2 or BWA. And then, you could load in the band file and the index here. And then, you could see where your reads, actually, were classifying on the genome. And in some cases, you would see that they are evenly distributed. And then, in some other cases, maybe you would see that they are more common in some areas of the genome than others. [INAUDIBLE] So much to today's lecture on metagenomic classification. And thank you. [MUSIC]