Lab 05 RNAseq - jfgout/AppliedGenomics GitHub Wiki
RNAseq (RNA sequencing) is a genomics/transcriptomics technique which consists in extracting RNA molecules from cells, generating the corresponding cDNA via reverse-transcription and sequencing these cDNA molecules (see lectures #10 and 11).
The main application of RNAseq is to quantify the expression level of genes. The number of RNAseq reads coming from a given gene is used as a proxy for its expression level (see the RPKM formula in lecture #10).
Another application of RNAseq is to identify the location of genes (and more precisely, the location of exons).
In this lab, we will use a real RNAseq dataset for both applications.
The folder /mnt/scratch/AppliedGenomics/RNAseq-yeast/PRJNA484406 contains a file (XVI.fastq) with RNAseq reads from chromosome XVI of S. cerevisiae.
Our first goal is to map these reads onto the yeast genome. The program that we will be using for this task is called tophat: http://ccb.jhu.edu/software/tophat/manual.shtml#toph
Read the documentation from this link to find the general syntax of tophat. Note that <argument> denotes mandatory arguments to the program.
For the first mandatory argument: <genome_index_base> , use: /mnt/scratch/AppliedGenomics/RNAseq-yeast/DB/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel
This is an indexed database of the yeast reference genome. It is fairly similar to the database that you have built in a previous lab to use BLAST (except that this one is for tophat, not BLAST...).
The second mandatory argument is the file containing the RNAseq reads.
Note that tophat can take several sequencing reads and even several pairs of reads (in the case of paired-end data) for its second argument. Here, we are working with only a single sequencing read file.
The only option you will need to specify here is the location of the directory in which topHat will write all of its output. Direct this output to a folder named "tophat-XVI" somewhere in your home directory (for example, it could be in a sub-folder named "Lab-05").
Before using tophat, you will need to load the corresponding module with: module load tophat2
As with the genome assembly (see previous labs), you will need to submit the tophat command as a job using a script file and qsub. However, it is a good idea to first test the command directly in the terminal and make sure it works. If you immediately obtain an error message, modify your tophat command until you obtain an error-free command.
For example, if your tophat command terminates immediately with this error message:
Error: Could not find Bowtie 2 index files (Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.*.bt2l)
It indicates that the path to the first mandatory argument is erroneous.
Once you have the right command line, it should take a while to go past this message:
Mapping left_kept_reads to genome Saccharomyces_cerevisiae.R64-1-1.dna.toplevel with Bowtie2
At this point, you can interrupt the command with ctrl+c and copy-paste the correct command in your script file which you then submit with qsub. Mapping should take only a few minutes to be completed.
Once the mapping is done, you should have several files inside the output directory, including a 19M file named accepted_hits.bam.
The file accepted_hits.bam contains the mapping coordinates of each read that was mapped by tophat. However, this file is in a binary format, so you wont be able to explore it with the usual tools (cat, less, head, ...). For those of you curious to know how the mapping information is encoded, you can see a plain-text version of the same file here: /mnt/scratch/AppliedGenomics/RNAseq-yeast/PRJNA484406/XVI-accepted_hits.sam
You will now use the program cufflinks (http://cole-trapnell-lab.github.io/cufflinks/cufflinks/index.html) to parse the accepted_hits.bam file (this program is capable of working with this type of binary file).
module load cufflinks/2.2.1
cufflinks will perform two actions: (1) merge each overlapping transcribed region into one transcript and (2) compute the FPKM value (see lecture #10) for each of these transcripts.
To illustrate point (1), imagine there are overlapping reads covering positions 100 to 400 of chromosome I, then no read from position 401 to 600, then again overlapping reads from 601 to 800.
Under this scenario, cufflinks will annotate two genes: one from 100 to 400 and the other from position 601 to 800.
The only option that should be specified is the location of the directory where cufflinks will output the results.
After you successfully run cufflinks, the result directory should contain a number of files, including one named transcripts.gtf
You may use less (or any other command you prefer) to look at the content of this file. It should look quite familiar to you if you remember the previous lab (Genome Browsers and GFF files).
Question: Looking at this annotation file, do introns seems to be frequent or rare in S. cerevisiae? Is this surprising? Why?
You will now use the ensembl website to visualize the annotation (transcripts.gtf) that you have generated. To do this, you will need to transfer the transcripts.gtf file to your computer using scp (see: https://kb.iu.edu/d/agye).
If you experience difficulties using scp, I've also uploaded the transcripts.gtf on canvas, so you can download it directly from there.
Next step: go to https://fungi.ensembl.org/Saccharomyces_cerevisiae/Location/View?r=XVI:20266-41273;db=core and load the transcripts.gtf file as a custom track (see previous lab).
Make sure you are looking at the region: XVI:9763-51778 and compare your custom track to the SGD track.
Question: Compare your custom track to the SGD track (already displayed by default in ensembl). Assuming that the SGD track is perfect, what kind of errors do you find for your custom track in this region?
We are now going to use RNAseq to measure gene expression level (still in S. cerevisiae) in two different conditions: aerobic and anaerobic growth.
The data you will be using is a subset from a large dataset that was uploaded to the NCBI SRA (Short Read Archive) public database: https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=484406
You can also read the publication that resulted form this analysis here (optional, although reading the abstract would be a good idea...): https://www.g3journal.org/content/8/12/3881
You will find 4 RNAseq Illumina fastq files in /mnt/scratch/AppliedGenomics/RNAseq-yeast/PRJNA484406:
SRX4506363.fastq --> aerobic condition, replicate #1
SRX4506364.fastq --> aerobic condition, replicate #2
SRX4506365.fastq --> anaerobic condition, replicate #1
SRX4506366.fastq --> anaerobic condition, replicate #2
Question: What is the point of having replicates of the same condition?
You will use tophat (same as in 1.1) to map each of these 4 fastq files onto the yeast reference genome. However, this time we will 'help' tophat by providing it the yeast genome annotation. This allows tophat to know where the introns are, speeding up the mapping process.
The yeast annotation file is available here: /mnt/scratch/AppliedGenomics/RNAseq-yeast/DB/Saccharomyces_cerevisiae.R64-1-1.84.gff3
Locate in the help of tophat (http://ccb.jhu.edu/software/tophat/manual.shtml#toph) under the paragraph Supplying your own transcript annotation data how to do so. We will also be using the --no-novel-juncs option to tell tophat to not search for splice junctions absent from the annotation (remember: annotations are never perfect and it is common practice to use novel RNAseq dataset to improve the annotation and discover new splicing junctions - however, we will not be doing this here).
Submit your 4 tophat commands as a job. Make sure to specify a different output folder for each of the 4 files. This step can take over an hour to complete, so when your job is running you can go directly to the step 2.3 of this lab and come back to perform 2.2 after your mapping job is finished.
Use cufflinks (as in 1.2) to compute the FPKM values for each of the 4 mapping results (each one should have its own accepted_hits.bam file). Contrary to step 1.3, this time we will provide cufflinks with the official yeast genome annotation (/mnt/scratch/AppliedGenomics/RNAseq-yeast/DB/Saccharomyces_cerevisiae.R64-1-1.84.gff3) so that it does not have to guess the location of genes/transcripts. To do so, you will use the "--GTF " when running cufflinks.
Each cufflinks run should produce a file named genes.fpkm_tracking. This file contains (among other things) the FPKM value for each gene in the annotation.
Because the mapping can take a long time, I've pre-computed these files and made them available here: /mnt/scratch/AppliedGenomics/RNAseq-yeast/PRJNA484406/solution-public
Use the command head to look at these files. The data is organised in columns separated by a tabulation. The first line is a header line indicating what each column represents, the rest of the file is the actual data. Locate the column containing the FPKM value (4th from the end).
Use the command grep on these files to find the FPKM value for these two genes: YIL053W and YJR150C
Include in your lab report a table with the FPKM value of these two genes in each of the 4 samples analyzed.
Questions: Based on these FPKM values, what can you say regarding the induction/repression of these genes between aerobic and anaerobic conditions? How do the differences in FPKM between conditions compare to the differences between replicates?
Use the https://www.yeastgenome.org/ website to obtain some information on each of these two genes. Based on this information, are the expression patterns that you have observed surprising/expected/something else?
This part is not mandatory/not graded. However, if you have already finished the lab and want to learn more, you'll find some useful information below.
Writing the job submission file with the tophat command for each sequencing read file is manageable when dealing with only 4 files but it would be extremely tedious to do so with 400 files. One way to handle large number of files is to write a bash script that will perform the task of crafting the tophat and cufflinks commands and even submit the jobs for you.
I've created such a file which you can find here: /mnt/scratch/AppliedGenomics/RNAseq-yeast/PRJNA484406/solution-public/runAll.sh
Copy this file into your home directory and edit the script so that the "bd_out" variable points to a folder inside your home directory (where you have write permissions). Remember: everything after a "#" in a script file is a comment.
Then, run the script (from the folder where you have copied the script, type: ./runAll.sh) and see how it creates all the job files for you.
I've created a table that contains the FPKM values for all 4 samples: /mnt/scratch/AppliedGenomics/RNAseq-yeast/PRJNA484406/solution-public/all.tab
Here, I will show you how to use the program R (https://www.r-project.org/) to visualize this data.
Let's start by loading the R module: module load R
You can start R simply by typing "R"
To exit the R program, type q() and then "n" when prompted if you want to save the workspace image.
Type the following commands in R:
options(width=Sys.getenv("COLUMNS")) # This is just to make sure the display is using the full width of your screen.
# Now, we are loading the table of FPKM values with this command:
t = read.table("/mnt/scratch/AppliedGenomics/RNAseq-yeast/PRJNA484406/solution-public/all.tab", h=T, as.is=T, sep="\t")
# You can display the first 5 lines of this table with the following command:
t[1:5,]
# Now, let's make a histogram of FPKM values for the first replicate of the aerobic samples.
# This part requires that you have the X11 forwarding activated (if you cannot open gedit on the server, this part will not work). It might take a few seconds before the image appears if you are working from home unless you have an extremely fast internet connection.
hist( log2(t$aero1 + 0.01) , nclass=100, col="red" )
# Note that we are using the log2 transformation of the FPKM values and we add 0.01 to each FPKM value to avoid doing a log2-transformation of a zero value.

# The next step consists in looking at how similar the FPKM values are between replicates.
# Visual representation with a scatter plot:
plot( log2(t$aero1+0.01) , log2(t$aero2+0.01) )

# And calculation of the r-square value:
summary(lm(log2(t$aero1+0.01) ~ log2(t$aero2+0.01) ) )
# In an ideal world, all points on the graph will be on the diagonal (y=x) and the r-square value would be equal to 1.0
# The variation that you observe here comes from both biological variation (the expression level of genes is not going to be the same in two different cells, even if these two cells have the same genotype and are maintained in the same condition) and technical variation.
# If we now compare the FPKM values between aerobic and anaerobic, you will see that the points tend to fall much further away from the diagonal.
plot( log2(t$aero1+0.01) , log2(t$anaero1+0.01), xlab="Aerobic replicate #1", ylab="Anaerobic replicate #1" )

# Computing the average FPKM value across the two replicates:
t$aero = (t$aero1 + t$aero2) / 2
t$anaero = (t$anaero1 + t$anaero2) / 2
# Finally, let's highlight the two genes that we looked at earlier in the lab:
tr = t[which( is.element(t$geneID, c("YIL053W", "YJR150C"))==T ) , ]
plot( log2(t$aero+0.01) , log2(t$anaero+0.01), xlab="Aerobic (average)", ylab="Anaerobic (average)" )
points( log2(tr$aero+0.01) , log2(tr$anaero+0.01), pch=23, col="red", bg="red" )

While tophat is still frequently used, more recent methods based on pseudoalignment have been developed recently. The most popular one is called Kallisto: https://pachterlab.github.io/kallisto/about