In this lecture, we're going to overview the available Metagenomic analysis tools. We'll go through the common methods and focus a bit more on the classification methods and also mention some sequence search algorithms and commonly used algorithms. So, the reads could be analyzed in multiple ways. They can be used for classification, and they could be used for assembly into the contigs (contiguous DNA segments) and scaffolds (composed of contigs and gaps), where further they can go to the binning or annotation analysis, where annotation could include functional annotation, or taxonomy annotation. And about binning and assembly, you will hear more in the following sections of this course. And today, we're going to focus on the classification methods. Classification methods can be generally grouped into four groups, which are sequence similarity-based methods, sequence composition-based methods, which could be combined into the hybrid methods, and the last, marker-based methods. So, we'll take one by one and look at them. Sequence similarity-based methods use a homology search or comparison against the database of reference organisms. It's a good method and widely used. However, it has their own disadvantages, such as that you cannot identify organisms that are not present in the reference database. So, you need to be really careful about what you have in your database. And obviously, the bigger the better and the more variety in the reference database, the better chance that you will assign the reads correctly. Second method is a sequence composition-based method, and they are based on characteristics of the nucleotide composition such as GC content or codon usage, and they find the best fitting model to each sequence read. However, they also have their own drawbacks. Those methods can not be used for short reads, so they're requiring the longer reads more than 1000 base pairs. And those two methods are sequence similarity methods and composition methods. They could be combined in the hybrid methods, which uses the elements of the both methods. And the last one, marker based method, compares each metagenomic read to the curated collection of marker genes and identifies the high-confidence matches. But as all the methods, those methods, they also have their own disadvantages. They achieve a low-level of sensitivity if their reads don't come from the genomes represented by the marker gene database. Additionally, marker genes could be used for functional analysis and you can map your reads or align your reads against the different databases based on the, for example, antimicrobial resistance genes, or virulence factors, or transposones, or various enzymes which are involved in various metabolic pathways. And you can think of any database with any set of genes in your database. Here, we'll look at the commonly used sequence search algorithms, and here I'll show you the five algorithms, and we'll go through each of them briefly. So, the first and probably the most commonly used for a long time is the BLAST, or variations of blast where it can be nucleotide alignment (blastn), it could be protein alignment (blastp), it could be nucleotides against protein alignment, mega blast. So, those methods, they find regions of similarity between biological sequences. Another method could be used as a Hidden Markov Models. In this case, it's most commonly used using protein sequences, or amino acid sequences. And it searches sequence profile database, or as they call them model database for sequence homologs. Bowtie and Bowtie2 is another read alignment to the long reference sequences. Then, we have the Burrows-Wheeler Alignment which aligns nucleotide sequences, so, or, as people says, maps a lowdivergent sequences against a large reference genome. And the last one is k-mers, where the method searches against a database of substrings of length k that contains within the string. So, it can be a certain word, itcan be length k nucleotide sequence, it can be length k substring call length k amino acid sequence. So, whatever your focus of the project is. And all those methods have variety of tools that can perform them. So, here you see the most commonly used or the methods that have the most of citations. All the people are using them and you see that the biggest varieties is focused on the similarity-based methods. Here, I show five commonly used methods. Some of them used various BLAST programs such as MEGAN, MG-RAST and CARMA3, some of them could also use a HMMER. Something like Kraken uses the exact match k-mers. And MGmapper uses BWA. Additionally, those programs can also include the functional classification, classification are using KEGG or GO, COG databases and some of them uses as CARMA3 use Pfam and TIGRFAM Hidden Markov models. Marker-based methods also use BLAST-based search or HMMER search, and some of them also provide functional classification. Composition-based methods, they use completely different methods which could include self-learning machines, neural networks, and they don't use sequence search methods. But then, they could be combined, Similarity-search and Composition-search. And they combine to the hybrid methods, which also used variety of tools. And you can read more about those tools in the paper provided by Peabody et al. And this is it for the Metagenomic analysis tools overview, and the further lectures, you will hear more about the two tools. Kraken, which is a K-mer based tool and MGmapper which is BWA based tool. Most of them has similarity-class tools. And you also hear more about the Assembly and Binning.