So in this lecture we're going to continue RNA-seq analysis and look at RNA-seq assembly, quantitation, and estimation of differential expression using Galaxy. So to, once we've mapped our RNA-seq data to a reference genome, we can use our spliced alignment data to under, to understand the locations of exons, so where you have reads aligning, potentially, probably unspliced to the genome. That suggests that there's an exon in that region, and the locations of splice junctions, based on the splice mapped reads that we see. While this information is not sufficient to know exactly what transcripts are present, we can make a reasonable prediction. And that's what the process of assembly is going to do. And so the tool we're going to use for this is a tool called Cufflinks. And we're going to use our gene annotations to guide the identification of locations of exons and splice junctions. And we're going to do this for each of the four accepted_hits files from Tophat. If you went through the last lecture, you should see that you have the results of running Tophat on four sets of RNA-seq data. And so we have four data sets labeled accepted_hits. These will form the input to Cufflinks. And so if we go to Cufflinks, there are a number of different tools that are part of the Cufflinks package here. The one we want at this point is just Cufflinks. So if you click on Cufflinks, you now have the tool forum for Cufflinks. We want to do a couple of things differently. The first is, we have three options for reference annotation. We can choose not to use reference annotation, in which case, all of the gene information is going to be inferred from the RNA-seq data that we have. We can choose to use a reference annotation, in which case that information will come from a existing gene annotation that we're going to provide. And so the Cufflinks tool will use the gene models from that existing annotation but align our RNA-seq data to it for estimating levels of expression and such. What we want to do here is something in between, which is use reference annotation as a guide. So what this is going to do is it's going to use the information from both the RNA-seq data and reference annotation for developing gene models. This is useful because for example, if you have, if you have, if you've discovered genes in RNA-seq data that are consistent with an existing annotation, it will have the gene name associated with it. So we're going to say use reference annotations guide and select data set 5, chromosome 19-annotations.gtf. This is, this is a file that came from data library and contains the gene annotation information. And as I said before, we have to run this over all four of our accepted_hits data sets. And so we can use the run over multiple datasets feature in Galaxy. So if we click multiple datasets, all four of the accepted_hits data sets show up here. The reason for this is these are the only data sets that are in BAM format in our current history. And so we can just go ahead and select all four of these. We can leave all of our other options at the defaults. And click Execute. And now this will run four Cufflinks jobs over all of our Tophat output and give us sets of transcripts that are assembled from each set of splice mapped reads. So once Cufflinks is completed, each Cufflinks job has generated a couple of different data sets. So first we have the gene expression information. This is using the, RNA-seq reads to try and quantify the expression levels of each of the gene models that are discovered, and so this is reported in a using, using a number called FPKM, fragments per kilobase of exon per million reads. And so this is a this gives you levels, estimated levels of expression for different genes. You'll see here this tracking ID. So some of these have gene names like CUFF.1. These are novel transcripts that were discovered by the Cufflinks assembly. And then others have gene names that are more standard, so these are genes that are coming in from our reference annotation. And so we have expression at the level of genes in transcripts, and then we also have GTF files, which are the actual gene models that Cufflinks has discovered, and so this is showing you the location of transcripts and exons inside those transcripts. And, the, expression information is also, in, in these files. So we'd like to go a step farther. We have two conditions here. And so what we'd like to do is do a statistical test of whether genes are differentially expressed between the two conditions. However, we have a problem, which is we have four different sets of gene annotations coming from the four different RNA-seq samples. And so what we want to do is first run a tool called Cuffmerge that's just going to join these different Cufflinks assemblies together. And so if you've just search for Cuffmerge, or go into the NGS RNA-seq category in Galaxy, and click Cuffmerge, you'll see that you can that the Cuffmerge tool allows you to select a number of GTF files produced from Cufflinks. And so we're going to say, you, so this is a group, or rather a repeat parameter in Galaxy, so you can have an arbitrary number of the GTF files, so we're going to add, three more. And we'll select each of these datasets, so the cufflinks output for each of the four original RNA-seq data sets. In my case 29, 33, 37, and 41. We also want to include our reference annotation, so say use reference annotation and select that chromosome 19 annotation's GTF again. You can execute that and this will result in a set of merged transcripts. So once our Cuffmerge job has finished, we'll get one data set out, a GTF data set, which is merging all of the gene models from each of the four original RNA-seq data sets into a single gene annotation. And so once again, we can take a look at this by clicking the i icon. And so now we have a set of set of exons grouped into transcripts, linked to original names from the gene annotation. And we can use this now as our kind of consensus gene model set for additional analysis. So now that we have this, we want to go back and ask the original question we were interested in, which is are things differentially expressed? We're going to use the tool Cuffdiff for this. And so, again, under NGS RNA-seq you'll find Cuffdiff. Now Cuffdiff is going to allow us to bring in all four of our data sets to ask whether things are differentially expressed. So we need a couple of things. First, we need one set of transcripts. This is going to be the set of gene models that Cuffdiff will use for its differential expression testing. And so this is why we created that Cuffmerge data set. So we'll select the output of Cuffmerge here. We have two conditions. In this case, our conditions are the cell types. You remember the original data we used was data from CD20 and data from H1 HES cell and so I can just say condition 1 SCE 20. Condition two, H1hesc. And for each of these conditions we have two replicates. So we can click, we have a repeat parameter here again, we can say insert replicate and insert replicate for both. And so for CD20, I have the first two datasets for Tophat, right, so now we're going all the way back to our Tophat and bringing in those accepted hits again. So I'll select the two datasets that came from our CD20 data and the two datasets that come from the H1hesc data. And there are a number of additional options here. Like all these tools I would strongly suggest reading through the documentation and understanding some of the different options that may be appropriate for your experiment. But for the purposes of this data, the default options are reasonable. So we are going to go ahead and click Execute, and this will run Cuffdiff. Now, as you see, Cuffdiff generates a very large number of data sets for us. It's going to create differential expression data at the level of genes, of transcripts, transcription start sites, promoters and so on. And all of these data sets are going to be tabular format data sets that we can actually use to ask the question. Whether different gene models, promoters, etc., are differentially expressed between these two samples. So once Cuffdiff is completed, you'll have the data sets that I just mentioned. Let's take a look at, for example, our gene differential expression data set. So, If I can find it. Gene differential expression testing. And so if we look at this, now we have the IDs for each of the genes. And a a status, whether a test was done. You can see if cases where there's no expression data, and one of the two samples of the test, the differential expression test can't be performed. But in cases where we have data for most samples, then we have information including a p_value and a q_value. So a p_value that's been corrected for multiple testing for each of these genes, and then a call as to whether things are significantly differentially expressed. So this last column we see no for the first where no test was done, but our second case here is is gene MADCAM1. And what Cuffdiff is telling us is that with a q_value of 0.0084, this is significantly differentially expressed. And so that's how these files can be interpreted and using Galaxy's filtering tools, you could extract genes that are differentially expressed at different thresholds. So in summary, using displaced alignment data from Tophat, Cufflinks is one tool that allows us to assemble transcripts in Galaxy. It will also quantify relative abundance of each transcript in the sample. And then using Cuffdiff, we can perform a statistical test for differential expression based on the quantitation data, for multiple conditions and potentially multiple replicates within those conditions.