Today weâre going to talk about RNA-Seq At first, we have to introduce what RNA-seq can do. RNA-Seq can study the quantity of gene expression simply. Itâs so simple, so we just talk about RPKM in fact. Thenï¼level 2 is the point. After sequencing, we find which part is not the level of gene , but the level of transcriptome., and where they belong to. How much is the expression, then we can study whether thereâs alternative splicing internal. .We can compare different samples to find the differences of the expression. We can study deeper research with RNA-seq, such as RNA-editing and eQTL. Therefore, the point is level 2. We can analyze gene list which summarized from the above by GO and pathway. This is the concept of RNA-seq. This is the concept of RNA-seq. Level 2 is the main point today. We need to find a mapping method which is different from the method of genome to identify which transcriptome is expressed and the quantity of the expression.quantity of the expression. Then let us come to some mapping methods. At the very beginning, Scientists considered to using these genome mapping softwares, which could not split the reads when mapping. The typical software, such as bwa and bowtie, they cannot split the reads too. A strategy for solving this problem is to separating the genome sequence into transcriptomesomes, and use the transcriptome as a new reference. This is a part I cut off from the example, and they all belong to one gene. But the gene has 3 transcriptome. So some genes of them is overlapped, and just the 5â UTR are different. The sequence of the 3 parts of transcriptome are similar to some extent. A problem will arise from here. When we want to map a read, we can identify which gene, but we cannot identify the certain transcriptome. Like this, the SRA read appears in these three transcripts. Like this, the SRA read appears in these three transcripts. So if follow the result of bwa, a mistaken result will sent, we don't know which transcript it is. But if we trace the transcripts to gene level, you will find an exact expression quantity of this gene. Then, how we could get the the expression quantity in transcript level? We do need a new strategy of alignment After disrupt these reads, we can get some junctions and we can define the border of different kinds of transcripts. Actually, we choose whole genome to be reference. To solve the problem, there are two strategies: alignment based on exons and seed-extension. Processes of these two strategies are present on screen. And we mostly focus on the first one, because of the similarity of reattachment to genome. Moreover, the pace of this method is much faster than seed-extension. While the algorithm of seed-extension is based on dynamic programming, slow and high consumed in RAM, is not a practical algorithm. Although âfastâ is an advantage of exon-first, a severe problem arises simultaneouslyââpseudogeneï¼ For instance, there maybe a junction in real gene (intron or others), but pseudogene will has no junction at that location, or just some SNPs. When we grade the alignment, absolute score of junction will much higher than a SNP So, reads prefer reattaching onto a pseudogene. It will cause a consequence that high expression level of pseudogene, but not real gene. How to deal with it? A previous research referred to a method to solve the problem. The principal is very easy that firstly do a reattachment to cDNA, just like the file of ref-RNA which showed on preceding slides. Thus, the problem is solved. Reattaching to cDNA firstly then turn back to genome, a more accurate result will be provided. In this way, we highly recommend TopHat software; This is the result of using TopHat to do remapping and it's a bam file. We can see there is an insertion of 659bp. It is how they exist in genome, and this is a junction. So, we can use similar remapping results to reconstruct their transcripts. We can put these reads having gaps out again, then think about whether they mapped to a junction between two exons. And we can set junctions according to these reads. After that, we will be able to linking whole exons to a large transcript. Besides this strategy there is another one which called de novo assemble not based on mapping. Of course, this method mainly is applied to situation we donât know transcripts. Its theoretical principle is a graph traveler. For example, we obtain five short fragments from circle DNA sequence such as plasmid. We could split these five fragments in various dimers such as AA, AT. Based on this we will split the fragments into trimmers. The sum is on that.Based on ti ,we can spilt a trimer from that. If we want to find the relationship between Dimer and Trimerï¼a matchup will be got Then we can find two context dimer We could draw a graph based on the data structure, which is called De Bruijn graph. Then we make an effort to get the complete graph. Firstly we could use a long reads as a reference. For example, the reads is ATGGCGT here and we could find its start point. The start point is at the place of AT in the graph. With the route goes on we find there is 2 selections at the point of G. We use the reads to find route as same as that of before. It is G after ATG in the reads so we select the point of GG. And with this method we could get the complete graph. We can see the graph is a circulation. Actually the sequence of gene is like that. And this is the roughly mechanism of reconstruction. By the way, when we could use methods other than reconstruction, especially in the research of common species such as human or mouse, it will be better to use software like Cufflinks instead of reconstruction. If you want to know more about alternative splicing such as the 8 types of splicing in the course, it is recommended to read the Science paper published in 2008, which is shown on the right. the 8 types of splicing was explained in detail. If you need to do some analysis with this method you may read the paper shown on the left. The paper explained the methods and introduced a model called MISO. Finally, we will talk about Differential Expression. First we need to understand what indicator and definition we set for expression. For example, there is a concept of FPKM in Cufflinks. The difference between FPKM and RPKM is that FPKM based on pair-end. And why do we need the FPKM? Because we need to do a normalization. For example, in terms of 3 and 4 these two transcripts, whose reads shown here, 4âs reads is obviously higher than 3âs. On this condition, if we simply count the number of reads, we will find 4 is obviously higher than However, if we do normalization according to their length, weâll find there is no differential expression between them actually, which exactly accords with our expectation that there is no difference between their gaps in this figure. So FPKM is also a reflection of gaps and different transcripts of two genes. Then we will talk about how to define the boundaries of transcripts. When we get a read, how do we make sure which transcript itâs belong? Just like this yellow pair-end sequence, firstly we suppose itâs belong to C, which is very long. If we define the length of C is 500, we will find the probability that itâs on C is very low according to the Normal distribution, the middle of which is 150. But if we suppose this read is belong to B, we will find the probability is higher actually. And then if the yellow one is on A transcript, its projected length on A will be 150, which is just on the peak point of Normal distribution. According to this method, we could know the probability of A, B, C separately. Based on this result, we could reject it on all reads in this figure. And then we get a figure like this. We would know what percentage of these transcripts on this gene. Finally we will get this result. Thatâs the differential expression of one geneâs different transcript. And if we combine multiple samples together, we can get a picture like figure a. If we donât consider the transcripts and just focus on all genes, we could get a heatmap figure like figure b on the right side. These all can be done by RNA-Seq research. If you are interested in doing this, there is a Nature Protocol paper which described it very well. It generally introduces a process which starts from Tophat until the finally analysis. And thatâs the reference. Thank you all!