Running Tophat - ccsstudentmentors/tutorials GitHub Wiki
On Pegasus, head to ~/RNASeq/thisProject/Control1/
and we will use the vi text editor to make our first job file.
NB: To learn more about the vi editor, visit this vi tutorial or find an easy tutorial on YouTube. It is not as difficult as it seems, and will make your life as an analyst a lot easier.
mv ~/RNASeq/thisProject/Control1/
vi Tophat_Control1.job
The vi editor opens a file in "command mode". To convert to "insert mode", type i
, and then you will be free to type as normal.
Every job file that we will write in this tutorial has an analogous header region which has the following structure:
#!/bin/bash
#BSUB -J Tophat_Sample1
#BSUB -o Tophat_Sample1.out
#BSUB -e Tophat_Sample1.err
#BSUB -W 80:00
#BSUB -q general
#BSUB -n 4
Let's go through each of these lines one at a time and see what they mean:
Line #1: this command is the first thing you need to write in all of your job files. It specifies which scripting language the file uses. In this case (and in all cases in this tutorial), we are using the language bash.
Lines 2-7 are us telling the computing cluster a few basic things about what we want it to do. For a complete list of the parameters that can be specified here, see this page.
- Line #2: the #BSUB means we are telling the pegasus cluster something. That something is specified by which parameter we choose. In this line, we use the -J parameter to specify a name for this job.
- Line #3: The -o parameter allows us to name the standard output file that will be created by this job's execution.
Most jobs will create a standard output and a standard error file filled with information about the job's run/execution (the output file) or any errors that it ran into (the error file).
In most cases (I don't know of any exceptions), these files are distinct from the result files created by the execution of the specific job. - Line #4: The -e parameter allows us to name the standard error file that I referred to before.
- Line #5: The -W parameter allows us to specify the time limit for how long to allow the job to run. The time is specified as HH:MM (two digits of hours separated by a colon, then two digits of minutes).
- Line #6: The -q parameter allows us to choose which of Pegasus' predefined queues to use for this job. Each queue serves a different purpose and has different types of computers available. But in most cases, you will just use the 'general' queue. See this page for more information.
- Line #7: The -n parameter allows us to request a particular number of processors for this job.
See? That wasn't too bad. Now we just need one more line of code in this job file which actually tells Tophat to run (and how to run) on this particular file. The syntax of this line will depend on what format your data is in. We will cover the four most common situations below.
But first, let's talk about the different parameters that we will be using when calling tophat:
The complete list of these parameters and their official descriptions are located here.
First, we call the executable file 'tophat'. The computer knows where this executable file is because we loaded the tophat module earlier (see the section on Preparing to Use Tophat).
Next, we specify two optional parameters. You don't have to set/use these parameters if you don't want to.
- The first is the -G parameter which allows us to specify a genomic annotation to use. Tophat does not technically need this information. It can generate compute a transcriptome de novo, but giving it this genomic annotation provides a nice starting point to guide its efforts.
- The second optional parameter is -o which allows us to set the name of the directory where the output will be placed. I like to set this to the current directory (whichever directory you are in when you submit the job file) by setting the value to '.' (which basically means 'here'). Setting this -o parameter this way is more of a personal preference than anything else, but it works well for keeping everything organized.
The optional parameters are denoted by a dash (-) because you are writing in unix and this is the way unix understands code that contains options. You can choose to set any or none of them.
Now on to the required parameters!
The required parameters do not get written in the same way. Tophat just expects you to put them where they need to be and in the correct order.
The first required parameter that tophat expects is the 'base name' of the genome sequence/index data so that it can go and find it.
Remember back in the Preparing to Use Tophat file when we made those files called GRCh38.fa, GRCh38.1.bt2, GRCh38.rev.1.bt2, etc (or similarly with GRCm38 for mouse)? Well this parameter wants us to specify the name leading up to that first decimal.
So, we tell tophat which folder we put all of those files in. And then end with 'GRCh38' (or 'GRCm38' for mouse). Tophat knows that the files are going to have the extensions '.fa', '.1.bt2', etc. and this way we are providing it with all the information it needs to find that whole set of files.
After specifying the name and location of the genomic sequence and index files, Tophat will need to know where the fastq files are for this sample. Below are examples pertaining to specific types of experiments.
This is the simplest example:
tophat -G ~/RNA-Seq/hg38/hg38.gtf -p4 -o . ~/RNA-Seq/hg38/hg38 ../FastQFiles/Sample1.R1.fastq
If you just have one fastq file for your sample, then this is easy! This first example shows that you just enter the name of that file and its location, and you are done!
Unfortunately, most RNA-Seq projects these days get a bit more complicated than that.
tophat -G ~/RNA-Seq/hg38/hg38.gtf -p4 -o . ~/RNA-Seq/hg38/hg38 ../FastQFiles/Sample1.R1.fastq ../FastQFiles/Sample1.R2.fastq
If your sample was generated with paired-ended reads, then you will have (at least) two fastq files for each sample.
To enter your paired-ended fastq files properly to tophat, you need to give it the name of the 'left ended' read file and then the name of the 'right ended' read file separated by a space.
Do not separate these two files with a comma -- that tells tophat that they are two sets of single-ended reads, rather than the paired ends of the same reads (which is the example case shown next).
tophat -G ~/RNA-Seq/hg38/hg38.gtf -p4 -o . ~/RNA-Seq/hg38/hg38 ../FastQFiles/Sample1.Lane1.R1.fastq,../FastQFiles/Sample1.Lane2.R1.fastq,../FastQFiles/Sample1.Lane3.R1.fastq
You will often get multiple sets of fastq files for each sample. This is usually a result of simultaneously running a single sample out on several lanes of the HiSeq machine.
To tell Tophat to align several single-ended fastq files as a single sample, follow the example shown above and input the files as a comma-separated list.
tophat -G ~/RNA-Seq/hg38/hg38.gtf -p4 -o . ~/RNA-Seq/hg38/hg38 ../FastQFiles/Sample1.Lane1.R1.fastq,../FastQFiles/Sample1.Lane2.R1.fastq,../FastQFiles/Sample1.Lane3.R1.fastq ../FastQFiles/Sample1.Lane1.R2.fastq,../FastQFiles/Sample1.Lane2.R2.fastq,../FastQFiles/Sample1.Lane3.R2.fastq
Finally, if your sample was run across multiple lanes and paired-ended sequencing was performed, follow this example.
Essentially, you need to give a comma-separated list of your 'left-ended' reads followed by a comma-separated list of your 'right-ended' reads where the two lists are separated by a space.
It is very important to input the sets of read pairs in the same order in each of the two lists. If you don't do this, Tophat will throw an error.
Alright, that is all the information in the example Tophat job file! Once you have edited the job for your sample (we have called our example here the Control1 sample), submit it to the cluster with the following command:
bsub < Tophat_Control1.job
You can check on the status of any jobs submitted in this manner by using this command:
bjobs
You should expect your Tophat alignment to take several hours. I usually just let it run overnight. Make sure to take advantage of the fact that Pegasus can run many jobs at once and submit jobs for aligning each of your samples so it all gets done in a single night.
A successful run of Tophat will return the following files (as well as many other files, which are not useful as far as genomic analysis is concerned but may be useful to your system admin in case something goes wrong):
accepted_hits.bam
junctions.bed
insertions.bed
deletions.bed
The 'accepted_hits.bam' file will be our input for our future work. It is an aligned read file, which basically has all of the information found in the fastq file plus where each read aligned in the genome. BAM files are compressed in binary format though, so if you want to view it and make sense of it, you will need to convert it to a SAM file by using the 'samtools view' command. To do that, you will need to load samtools:
module load samtools/0.1.19
And then run the command:
samtools view accepted_hits.bam | head -100
The [pipe] head -100 is not necessary, but if you don't include it the command will spit out tens of millions of lines of information to your screen. And nobody really wants that. This allows you to see the first 100 lines of the file. If you really want to see the first 500, just use -500 instead, and so on (but why would you).
Congratulations! You have now aligned your sequences to a genome. Now comes the fun part. You now need to choose how you want to analyze the rest of your data. The easiest way to make this decision is to choose whether you want to look at gene-level information or isoform-level information, and then head to those corresponding tutorials on our wiki! We have a short discussion of the pros and cons of each of those analysis approaches here.