All right. Welcome to the second module of Plant Bioinformatics. This week we'll be doing some expression analysis. What can gene expression data tell us? Well, sequence similarity seems to provide function identification, and only about 40% to 60% of the genes identified in genome sequencing projects, and there are many lineage and species-specific genes. So, sequence similarity will not identify novel functions of proteins, different functions under different conditions. Genes that are involved in regulation, interaction, or integration of pathways are the most difficult to identify in this context. Traditionally, these have been identified using genetic/mutant analysis and also biochemically. Many of those kind of genes are expressed at low levels or show transient expression and these may have been missed by typical molecular, biological or genetic methods. So, this has led to the whole field of functional genomics which seeks to devise and apply technologies that take advantage of the growing body of sequence information to analyze the full complement of genes and proteins encoded by an organism. There are several major approaches that might provide insight into the possible function of genes, and those are determining the expression pattern for all genes, determining the expression and distribution of all proteins, knocking out genes, and then subsequently examining the organisms for their phenotype or expression patterns, identifying interactions among proteins using two-hybrid analysis or TAP-tagging methods. There are different levels of functional genomics, and this is a table modified from Oliver et al. At the genome-level of analysis, what we're looking at is the complete set of genes of an organism or its organelles. This is context-independent, in the sense that the genome doesn't change in different cell types or in different tissues and the method of analysis is systematic DNA sequencing. The other levels of analysis that we can have are epigenome which is the set of epigenetic modifications to DNA or histones in a cell, tissue or organ. This is context-dependent in the sense that changes may be made in response to environmental pathogens in specific cell types, or tissues, or organs. The way for determining the epigenome is systematic DNA sequencing. The next level of analysis is transcriptome which is the complete set of mRNA molecules present in a cell tissue or organ. Again, context-dependent, depending on the changes in physiology, development or pathology. The way we can analyze the transcriptome is with RNA-seq or microarrays, ESTs, Serial Analysis of Gene Expression, in quotes "high-throughput" Northern analysis. The other couple of levels of analysis that we can consider in terms of functional genomics are proteomics and metabolomics and both of which are context-dependent. So, we'll have a brief overview of profiling technologies. So, the first we'll talk about is RNA-seq. In the case of RNA-seq, what we do is we take mRNA, purified from a cell, type, or an organ, or tissue, we convert that in either of these three possibilities here into cDNA and we attach linkers to create libraries. Then we sequence, and we follow that up with bioinformatic analysis, and that bioinformatic analysis basically is mapping these fairly short reads onto reference genome or de novo assembled transcriptome, and then looking at how many reads for particular gene map back to that position. In this way, we can tell genes that have high expression, from those that have low expression, where we have fewer numbers of reads mapping. The other main kind of technology is oligonucleotide microarrays, which were popular a couple years ago. In this case, what we do is we synthesize probes that correspond to the three prime end of the mRNA, and there are several probes in the probe set, and these probe sets are grown on silicon wafers, and then that creates the GeneChip. Then what we do is we take our cells, we isolate mRNA, we convert into labeled transcript cRNA, we hybridize to the oligonucleotide microarray, we wash and stain, and then we scan, and basically the brighter the signal at a given location on the cell, more mRNA was present in the original sample. The need for bioinformatics in terms of RNA-seq is great. We have the experimental design and sequencing design aspect, where we want to perhaps randomize our libraries, randomize the sequence run. We also need to consider some statistical elements. How many replicates do we need? There's some need for bioinformatics in terms of quality control. In terms of transcript profiling, what we want to do often is figure out the level of expression of a particular transcript. Another important analysis aspect is differential expression. We do talk about differential expression in the fourth lab of Bioinformatic Methods II, where you can explore some command line tools for doing differential analysis and also read mapping. Further analyses include data visualization, comparing with the existing databases, and that's what we're going to actually be doing today, and some other analyses that we can do are looking at small non-coding RNAs, looking for gene fusion discovery long reads, and single cell analysis which we'll touch on briefly in the lab. So, once gene expression profiling experiment is conducted, there are several databases that are around for depositing your gene expression profiles into, and those are ArrayExpress or GEO at the NCBI. There are tens of thousands of samples that have been archived in those databases, there's also the Sequence Read Archive at NCBI for RNA-seq data and this is explorable for many plants on NCBI's Gene pages, and we'll be looking at that in this lab. There are many organism-specific gene expression databases that have been developed. Bio-Analytic Resource for plant biology has expression data for more than a dozen plants, Araport.org has an RNA-seq compendium used to generate the Araport11 build, and Genevestigator also contains array-based and RNA-seq expression data for several plant species, and we'll be exploring that in today's lab too. So, what can gene expression data help us with? Well, it can help us guide our wet-lab experiments. Here's an example from the Christendat Lab, who's in the Department of Cells and Systems Biology, at the University of Toronto. He's interested in the shikimate biosynthesis pathway, and the first step of that pathway is catalyzed by an enzyme called 3-deoxy-D-arabino-heptulosonate 7-phosphate synthase. There are three isoforms, three genes in Arabidopsis that can catalyze that reaction. So, the interesting thing is if you order the knockouts of those genes and you look at the plants, the knockout plants, plants lacking those genes, under normal growth conditions you don't see any phenotype. So, where do you start to look for a phenotype? Well you can use gene expression data to narrow down the phenotypic search space. So if we go to gene expression databases, we actually see that those three isoforms are strongly increased in expression, in response to UV-B light. This is just a heat map with red colour showing stronger expression, in response to the various conditions that are denoted along the top here by the coloured bar. Lo and behold, if you look at the mutant plants and expose them to ultraviolet light, you see that they don't respond as well to ultraviolet light, they don't do as well under ultraviolet light as a wild-type plant. So this use of gene expression data can really be valuable for narrowing down the phenotypic search space as I said. So we've developed a tool called the eFP Browser, which allows you to explore gene expression data from these online expression databases that have been created over the years, and basically, what you're looking at is a pictograph of the samples that were used to generate the RNA expression data. You can switch between different compendia that we've compiled over the years. You can view in absolute, relative, and compare modes, and relative is quite useful if you're looking at a perturbation kinds of experiments. If you click on the tissues, you'll be taken to the literature record for the particular sample. The expression level distribution graph tells you if your gene of interest is strongly expressed relative to all of the other genes in this particular compendia, and in this case, it is...the further to the right along that little graph, the stronger is that gene's expression level. We have a colour scale, so red denotes strong expression, and we see strong expression of DHS3 and stems, and also in some floral organs. You can also click on those little magnifying glasses to be taken to some cell type specific datasets that have been generated by various labs over the years. Again, we see the strong expression in the stem and in some of the meristem tissue for DHS3. There's also a very high resolution root dataset generated by Phil Benfey and Siobhan Brady, and we see strong expression of DHS3 in parts of the developing root. What its role there remains to be elucidated biologically, but certainly the expression pattern would suggest that it's important in the developing root. We also have a light series, and we see some oscillations in DHS3 over the course of the day. There's also pathogen response series and we do see some response to pathogens for DHS3, and again here's that abiotic stress series and the increase in DHS3 expression in response to UV-B light. We also see strong expression of DHS3 in certain parts of the developing seed, namely the seed coat. What its role there also remains to be elucidated, but certainly again, the expression patterns are highly suggestive of an important role because of the strong levels of expression. There's also an inhibitor series and natural variation, and several other series that are worth exploring. RNA-seq data can be used to identify alternative splicing events, and that's the nice thing about RNA-seq data, apart from cost, what we do with RNA-seq data just to go over this, again we convert the mRNA population into cDNA, we get lots of cDNAs in the case of large levels of mRNA or few cDNAs, if we don't have a lot of mRNA, and then we generate short sequence reads from that cDNA population, and we then map those reads back to a reference genome, and then we count the number of reads at a given position, that is reflective of the transcript, the original transcript abundance. So, what we sometimes see however in certain conditions is that a given exon, for instance is skipped, so there won't be any reads mapping to that exon. What we can also sometimes see is that for a given intron, we will have some reads mapping to that intron, implying intron retention, and this aspect of intron retention or exon skipping can have important biological implications, and this is just an example from 2010 paper where Tom Brutnell and colleagues aim to understand the expression level differences and alternative splicing events in tissues of different developmental stages, and in bundle sheath and mesophyll cells from maize, which is a major C4 photosynthesis crop, a very efficient kind of photosynthesis, and the bundle sheath cells are important for that C4 pathway. So, what they did is they took parts of the leaf, four sections, and then also sequence the isolated bundle sheath cells, and they generated sequencing libraries from those four sections. Now, as you proceed towards the tip of the leaf, the C4 photosynthesis pathway kicks on. Here the leaf is not photosynthetically competent, here the leaf is C4 photosynthetically competent. Then they did the analysis, and basically, a very interesting take-home message from this paper is that many alternative splicing events were detected, things like exon skipping, where several exons are skipped or where there's a cases of intron retention, alternative three-prime splice sites, five-prime splice sites. A very interesting quote from that paper is that 56.4 percent of all possible targets showed evidence of alternative splicing along the development gradient. So, more than half of cases where alternative splicing as possible seem to have alternative splicing event. This is just an example from that paper where of a glycoside hydrolase, where the transcript changes along the developmental gradient of the leaf from this version of the genome, where we don't have this particular part of the gene included in the transcript, to this particular version SJ3 where that part is included in the in the transcript. Now what the effect is on the function of the glycoside hydrolase wasn't reported in the paper, but certainly these patterns of alternative transcript usage do suggest some importance in terms of biology. All right. In today's lab, we'll be exploring the eFP browser that I talked about, also Genevestigator, which is a tool for exploring expression data for many plants, TraVA, is an RNA-seq database for Arabidopsis thaliana. We'll also be looking at Araport JBrowse, which is for Arabidopsis only, NCBI's RNA-seq Veiwer encompasses RNA-seq data for many plants. It's very useful for visualizing RNA-seq based gene expression level data across many plant classes. The last thing that we'll be looking at is the MPSS database from Blake Meyer's lab, which has small RNA data for many plants. Thanks and I hope you enjoy the lab.