Alignment - Bioinformatics-Institute/transcriptomics_WBC GitHub Wiki

RNA-seq Flowchart - Module 3

2-i. Alignment

Use Bowtie2/Tophat2 to align all pairs of read files to the genome. The output of this step will be a SAM/BAM file for each data set.

Refer to TopHat manual and tutorial for a more detailed explanation:

TopHat2 basic usage:

#tophat2 [options] <bowtie_index> <lane1_reads1[,lane2_reads1,...]> <lane1_reads2[,lane2_reads2,...]>

Extra options specified below:

'-p 2' tells TopHat to use 2 CPUs for bowtie alignments
'-r 60' tells TopHat the expected inner distance between the reads of a pair. [fragment size - (2 x read length)]. 260 - (2 x 100) = 60
'--library-type fr-firststrand' since the TruSeq strand-specific library was used to make these libraries
'-o' tells TopHat to write the output to a particular directory (one per sample)
'--rg-id' specifies a read group ID
'--rg-sample' specified a read group sample ID. This together with rg-id will allow you to determine which reads came from which library in the merged bam later on
'-G ' supplies a list of known transcript models. These will be used to help TopHat measure known exon-exon connections (novel connections will still be predicted)
Note that the '-G' option for TopHat has a different meaning than the '-G' option of Cufflinks that we will use in step 9 later

TopHat alignment

Tophat expects the fasta file that a bowtie2 index was derived from to be in the same folder as the index. Let's put it there, otherwise Tophat will waste time recreating it every time.

Then we will run a loop as before, which will run 6 Tophat runs on the 6 paired FASTQ files to produce 6 directories with a SAM (mapping) file output.

    cd $RNAWORKING
    cp fasta/chr22_ERCC92.fa index/bwt2/chr22_ERCC92.fa

    mkdir tophat

    for sample in HBR UHR
    do
        for rep in 1 2 3
        do

        tophat2 -p 2 -r 60 --library-type fr-firststrand --rg-id=${sample}_R${rep} --rg-sample=${sample}_R${rep} -o tophat/${sample}_R${rep} -G annotation/genes_chr22_ERCC92.gtf index/bwt2/chr22_ERCC92 final/${sample}_R${rep}_1.fastq.gz final/${sample}_R${rep}_2.fastq.gz

        done
    done

Note: in the above alignments, we are treating each library as an independent data set. If you had multiple lanes of data for a single library, you would want to align them all together in one TopHat command Similarly you might combine technical replicates into a single alignment run (perhaps after examining them and removing outliers...) To combine multiple lanes, you would provide all the read1 files as a comma separated list, followed by a space, and then all read2 files as a comma separated list (where both lists have the same order): You can also use samtools merge to combine bam files after alignment. This is the approach we will take.

TopHat Alignment Summary

TopHat generates a summary of the alignments in a text file next to the aligned BAM file. The below command will print the summary for each alignment to the terminal.

cat tophat/*/align_summary.txt

Optional Alternative - HISAT2 alignment

Perform alignments with HISAT2. HISAT2 uses a graph-based alignment and has succeeded HISAT and TOPHAT2.

    cd $RNAWORKING
    mkdir  hisat2
    
    for sample in HBR UHR
    do
        for rep in 1 2 3
        do

	hisat2 -p 2 -x index/hisat2/chr22_ERCC92 --known-splicesite-infile index/hisat2/splicesites.txt --dta-cufflinks --rna-strandness RF -1 final/${sample}_R${rep}_1.fastq.gz -2 final/${sample}_R${rep}_2.fastq.gz -S hisat2/${sample}_R${rep}.sam
#Convert HISAT2 sam files to bam files (required for cufflinks)
        samtools view -b -S hisat2/${sample}_R${rep}.sam > hisat2/${sample}_R${rep}.bam
#Now sort the HISAT2 bam files (also required for cufflinks)
        samtools sort hisat2/${sample}_R${rep}.bam -f hisat2/${sample}_R${rep}_sorted.bam

        done
    done

END OF OPTIONAL ALTERNATIVE - HISAT2 alignment

Previous Section	This Section	Next Section
Preprocessing	Alignment	Postalignment-visualization