Intro to Tophat - ccsstudentmentors/tutorials GitHub Wiki

So, you've gotten your fastq files (files containing millions of reads for each sample) onto your chosen computing server. The next step is to align them to the correct genome.

Lots of different next-generation sequencing applications require aligning the produced reads to the genome, but RNA-Seq has an additional layer of complexity. Since the reads are coming from processed mRNA which only covers coding exons, a read might overlap a splice junction and the raw reads will not align properly to the genome (because introns). To deal with this, RNA-Seq reads must be aligned to the genome with a 'splice-aware aligner'.

There are numerous splice-aware aligners available for use. The two most common are Tophat2 and STAR. Both are installed on Pegasus (and can be loaded as modules) if you wish to use them. Currently, Tophat2 is the more commonly used tool, but STAR is quickly gaining popularity because it is much faster while being just as accurate. There are plenty of papers comparing the two, but one that I think does a particularly excellent job (and also compares many other RNA-Seq related tools) is Williams et al., 2014.

Whichever tool you choose to use is up to you. The paper above can help you make an informed decision. In this module, you will find tutorials on how to install and use both Tophat and STAR.