Welcome back to Peking University MOOC: "Bioinformatics: Introduction and Methods". Let's start this week's topic:Transcriptome study by deep sequencing technique Similar to the previous two weeks, in the following two units I will first illustrate how to process RNA data generated by RNA-seq. Then I will illustrate by noncoding RNAs how to explore biological questions further based on the processed result. First, let's have a quick look at the background. Transcriptome is the set of all transcripts in a given type of cells. In other words, it is a snapshot of expression profile at a given time of cells. In transcriptome, there are not only the classical messenger RNAs (mRNAs) coding for proteins, but also microRNAs, long non-coding RNAs and other non-coding RNAs recently discovered that do not code for proteins. These RNA transcripts cooperate to regulate cell growth, development, apoptosis and many other important physiological processes. Therefore, we often study the transcriptome quantitatively or quantitatively. A qualitative study aims at identifying all expressed transcripts. A quantitative study will determine the expression levels of each of these transcripts. We can quantitate the initial [transcript] template by real-time detection of the fluorescence signal emitted in each cycle of the classical PCR. This is the so-called "Real Time quantitative Reverse Transcription PCR" / "Real-Time qRT-PCR". With the primer and probes correctly designed, the qRT-PCR technique can quantitate the copy number, i.e. the expression level, of the target transcript with a wide measurement range. qRT-PCR is thus often treated as the gold standard in transcriptome analysis. However, qRT-PCR suffers from the limitation that it can quantitate the expression level of only ONE transcript at a time. Also, this technique needs to know in advance the sequence of the transcript to quantitate, making it difficult to discover unknown transcripts. Microarray had been the major transcriptome analysis technique before the NGS was widely used. Microarray, or "gene chip", is an 1cm x 1cm squared solid substrate to which several hundred thousand probes are attached. By exploiting the rules of base pairing of nucleotide sequences forming a double-strand, microarray can detect from the sample all nucleotide segments complementary to the probes simultaneously. The expression profile of genes in this sample can thus be obtained easily. Therefore, microarray was widely used in biology, medical science, and agricultural science and other fields shortly after its invention in the 1990s. Compared with qRT-PCR, microarray has its throughput increased considerably, but still needs to know in advance the sequence of transcripts to quantitate. Expressed sequence tag (EST) can get part of the sequence of a randomly chosen cDNA segment by cloning and sequencing once. As based on sequencing, EST differs from microarray in its capability of sequencing the transcript without knowing its sequencing in advance. Therefore, EST can be used to discover new transcripts. In fact, Craig Venter and others from NIH had already applied EST in 1991 to discover new human genes. However, due to the limitation of sequencing throughput at that time, a run of EST could often result in sequences of only several thousand transcripts, which was hardly enough for [transcript profiling] at the level of whole transcriptome. The development of deep sequencing technique allowed the researchers to study the whole transcriptome both qualitatively and quantitatively. It is the so-called "RNA-Seq" technique. Specifically, we first generate cDNAs from RNAs in the biological samples by retrotranscription, and break them into smaller fragments to load into sequencers and sequence them. On one hand, RNA-Seq allows the researchers to get the transcriptome quickly to identify exisitng alternative splicing isoforms, which is very hard for traditional techniques such as microarray. Therefore, RNA-Seq technique allows researchers to study the transcriptome both qualitatively and quantitatively. Please note that RNA-Seq is in nature a random sampling of transcript sequences. Thus its power of detection and sensitivity strongly depends the depth of sequencing. Lacking enough depth of sequencing will make it very hard to detect low-copy genes. In principle, the depth cannot be considered enough unless the saturation curve reaches the plateau. A rule of thumb for the depth of sequencing mammalian transcriptomes is 100~150x coverage. Under the random sampling condition, the count of reads mapped to a specific transcript is positively proportional to the abundance of that transcript. Therefore, we can estimate the expression level of a transcript by the total number of reads mapped to that transcript. However, the number of reads mapped to a transcript is also positively proportional BOTH to the length of the transcript AND to the total sequencing depth. For example, there are two genes A and B. Assume that they have the same expression level ; both transcribe two transcripts each. Because A is twice as long as B, the number of reads mapped to A is twice as large as the number of reads mapped to B. Because A is twice as long as B, the number of reads mapped to A is twice as large as.. (Repeat)Because A is twice as long as B, the number of reads mapped to A is twice as large as the number of reads mapped to B. If we only looked at the number of reads, we would think that A has its expression level twice as high as B does. This is, however, obviously not correct. Let's take another example where there are two RNA-seq runs. The gene B has its expression level unchanged in the two runs. However, as the sequencing depth of the first run is twice as large as the depth of the second run, the number of reads observed to be mapped to gene B in the first run is also twice as that in the second run. Again, If we only looked at the number of reads, we would think that gene B has its expression level in the first run twice as high as it is in the second run. This is also obviously incorrect. Therefore, in practice we often linear scaling the raw read counts to transform them into RPKM values and normalize them. RPKM is a usual normalized method Here C denotes the total number of reads napped to the transcript. N denotes the total number of reads in this experiment/run, i.e. the sequencing depth. L denotes the length of this sequence. Assuming the consistency of global distributions of RNAs from different samples, RPKM can correctly handle artifacts generated from transcript length and sequencing depth. This makes it possible to correctly compare expresson profiles from different genes, different sequencing runs, and even different samples. Please note that RPKM is not the only way of normalization. Different normalization methods can be constructed by considering different bias effectors and introducing different biological assumptions. In fact, it has been documented that compared to other later methods such ad TMM and DESeq, RPKM does not perform the best in analyses such as differential gene expression analysis between samples. Also, please note that the strand-specific in RNA-Seq technique also matters. As we all know, both strands of DNA can be transcribed to generate different transcripts. However, commonly used Illumina RNA-Seq kit is strand-inspecific. In other words, we cannot know which one of paired reads is in the same direction of that of the transcript, and which is in the opposite. For data that are strand-specific, there are also two different cases. In the dUTP-labelling method adopted by the Illumina strand-specific kit, the second read [of paired reads] is in the same direction of that of the transcript, while the first read is in the opposite. hen it comes to the second strand method adopted by SOLiD and other platform, the first read become the one that has the same direction with the transcript and the second become the opposite direction. Therefore, we must make it clear before analysis whether the data is strand-specific, and if yes, how strand-specific it is. For more details, you can watch the computer lab video course for this MOOC made by Mei Hou and Feng Tian, two students from Center for Bioinformatics, Peking University. We have had a brief introduction of basic background and common experimental measurement techniques for transcriptome study. Here are some summary questions. You are encouraged to think about them and discuss them with other students and TAs in the online forum. In next unit we will explain the specific methods starting from reads mappong of RNA-Seq reads. See you next unit!