Once bits are aligned to the genome, the next step is to assemble them into transcripts. And for that purpose, we'll illustrate how we can use Cufflinks. So go back to our dataset. And we have alignments in a directory called Tophat. And now we want to run Cufflinks. Cufflinks uses an overlap graph, in order to represent in a compact way, a gene. And then it finds the minimum number of pass through, through the overlap graph, that can explain all the weights, or rather all the splicing paths. So that's, in a nutshell, the algorithm. Now let's see how we can use it. Again, I'm going to just call Cufflinks. To see the command line usage. All of this tools have fairly complicated command line arguments, and that's why like them in a file, and you can do so too, cufflinks.log. Okay, so just like before, we need to create a directory ideally for the output to be placed in. So we're calling Cufflinks, a list of options, and then the list of alignments, read alignments, the BAM file. Okay. So we have to create a directory for the output. We can specify the number of threads. We can give it a reference, but then it couldn't perform an assembly. Some information about the inner mate distance. But again, this is not necessary, it can be estimated from the input read. One that I found very useful is to give a label. So this is going to be used as the prefix for the name of the genes and transcripts that we assemble. And it's very useful to differentiate from isoforms from transcripts assembled from another sample. And then there are a couple of parameters, such as -F/--min-isoform-fraction, that I found particularly useful. By default -F is ten percent, which means that for a gene all the transcripts, only transcripts that are above ten percent of the transcript level, expression level, of the most abundant one for the gene are going to be reported. And similarly, -j/--pre-mrna-fraction to suppress intra-intronic transfer below this level. It provides a way to interpret the intronic reads as being either pre-MRNA fraction or being. And the default here is 15 percent, and I usually leave it as is. But it can be modified, for instance, when working with total RNA. So, these are some of the basic commands. In the case of strand specific libraries, one can use one of the options here, depending on the library. So, let's give it a try. First of all, we're going to create a directory called Cufflinks, and we would like for the outputs to be in Cufflinks. And we'll create one subdirectory for every sample. So just like before, let's create a little command or batch file. com.cufflinks. And now we pretty much know what we need. We're probably going to need the location of the same files, so I'm going to say THDIR. And I already know what that is, but it's easy to copy and paste. L4/Tophat. And then, we need to create a directory. We're going to create a directory. So let's create DATADIR as being, actually, WORKDIR, because that's where we're going to be working, and there's going to be our Cufflinks directory. So under WORKDIR, we're going to create for the first sample, directory Test1. And Cufflinks creates transcripts and intermediate files in the directory where it's being called, so we're going to go to the directory. Okay, once we are located there, we can call Cufflinks, cufflinks2. We're going to tell it a label. I'm just going to come up with that one. So I happen to know this, these are patient identifier, for the particular data set. We would like, say 8 threads, and then we will be using the BAM file in THDIR, in the Tophat directory, this one set accepted_hits.bat. So that's it. And these will create a set of files in the Cufflinks directory, Cufflinks and subdirectory Test1. And again, I'm going to show how you can call this command line. Command5, nohup sh shell com.cufflinks, and we'll save all the standard error to com.cufflinks.log. And put this in the background so we can exit without a comment. And this might take up to one day, or maybe sometimes longer to execute, and in the mean time, we can perform other operations. But now I'm going to check again, and instead of waiting for a day, I'm going to show you what the output looks like. So, let's go to results, and we're going to go to Cufflinks. And there's one directory for each of the samples. Let's go to Test1. And as you can see, the main file here is transcripts.gtf. These are assembled transcripts in gtf format. And you will recognize the format. So it's chromosome, the source, which is the program that's reading those, whether it's a transcript or exome, so the type of feature. Beginning of the feature, end of the feature, a score that can be represented graphically, the information about the strand, if none, and the coding sequence, the frame, and finally, the ninth column gives information about the gene ID, transcript ID, estimated expression level. And either below our bond of the confidence interval, higher bond, and so on. So a variety of types of informations that can be added. So this is the main file, transcripts.gtf. And you will see that there are a number of other files, for instance, a number of skipped transcripts. That usually contains a small number of transcripts that could not be processed, because there was some sort of exception. For instance, there were too many reads at the logs. And then, together with those, we have two files that give us the estimates, the expression estimates, fpkm. Let's take a look at genes.fpkm_tracking, and similarly for transcripts, or for isoforms. So the label, as you might recall, we asked for the label bja0z2. So all the gene names, and the transfer name are going to have this prefix. Class code not known, nearest preference, gene identifier, the same one, the location on the genome, and then information about the FPKM, the confidence intervals, and the FPKM status, whether it could be calculated or not, and why. So let's count how many genes and how many transcripts we have in the trascripts.gtf file. So we can find that information in column nine. So cut -f9, using the tools that were described in the first lecture, transcripts.gtf. Again, cut, but now we're going to have to separate the fields by a space. -F, I'm looking at column number 24G9B. F2, let's see what this looks like. And now we're going to call this unique, so that we have only, among consecutive records of the same type, we'll only keep one. We sort that uniquely, and then we're counting. So we have 43,983 genes. And if we want to identify the number of transcripts, then, you might recall, let's do another more, or rather, a head. So we can find that in column nine, one, two, three, four, so it's field number four. So, again, cut column nine from transcript, and cut by space -F4 uniq, and lets see how many we get. So, mostly we have one transcript per gene, in this particular case, but many of them might be single exon, and therefore, cannot have alternative variants. So this concludes the section on Cufflinks. And next, we're going to be looking at how we can perform differential expression with Cuffdiff.