Alignment - Bioinformatics-Institute/transcriptomics_WBC GitHub Wiki
2-i. Alignment
Use Bowtie2/Tophat2 to align all pairs of read files to the genome. The output of this step will be a SAM/BAM file for each data set.
Refer to TopHat manual and tutorial for a more detailed explanation:
TopHat2 basic usage:
#tophat2 [options] <bowtie_index> <lane1_reads1[,lane2_reads1,...]> <lane1_reads2[,lane2_reads2,...]>
Extra options specified below:
- '-p 2' tells TopHat to use 2 CPUs for bowtie alignments
- '-r 60' tells TopHat the expected inner distance between the reads of a pair. [fragment size - (2 x read length)]. 260 - (2 x 100) = 60
- '--library-type fr-firststrand' since the TruSeq strand-specific library was used to make these libraries
- '-o' tells TopHat to write the output to a particular directory (one per sample)
- '--rg-id' specifies a read group ID
- '--rg-sample' specified a read group sample ID. This together with rg-id will allow you to determine which reads came from which library in the merged bam later on
- '-G ' supplies a list of known transcript models. These will be used to help TopHat measure known exon-exon connections (novel connections will still be predicted)
- Note that the '-G' option for TopHat has a different meaning than the '-G' option of Cufflinks that we will use in step 9 later
TopHat alignment
Tophat expects the fasta file that a bowtie2 index was derived from to be in the same folder as the index. Let's put it there, otherwise Tophat will waste time recreating it every time.
Then we will run a loop as before, which will run 6 Tophat runs on the 6 paired FASTQ files to produce 6 directories with a SAM (mapping) file output.
cd $RNAWORKING
cp fasta/chr22_ERCC92.fa index/bwt2/chr22_ERCC92.fa
mkdir tophat
for sample in HBR UHR
do
for rep in 1 2 3
do
tophat2 -p 2 -r 60 --library-type fr-firststrand --rg-id=${sample}_R${rep} --rg-sample=${sample}_R${rep} -o tophat/${sample}_R${rep} -G annotation/genes_chr22_ERCC92.gtf index/bwt2/chr22_ERCC92 final/${sample}_R${rep}_1.fastq.gz final/${sample}_R${rep}_2.fastq.gz
done
done
Note: in the above alignments, we are treating each library as an independent data set. If you had multiple lanes of data for a single library, you would want to align them all together in one TopHat command Similarly you might combine technical replicates into a single alignment run (perhaps after examining them and removing outliers...) To combine multiple lanes, you would provide all the read1 files as a comma separated list, followed by a space, and then all read2 files as a comma separated list (where both lists have the same order): You can also use samtools merge to combine bam files after alignment. This is the approach we will take.
TopHat Alignment Summary
TopHat generates a summary of the alignments in a text file next to the aligned BAM file. The below command will print the summary for each alignment to the terminal.
cat tophat/*/align_summary.txt
Optional Alternative - HISAT2 alignment
Perform alignments with HISAT2. HISAT2 uses a graph-based alignment and has succeeded HISAT and TOPHAT2.
cd $RNAWORKING
mkdir hisat2
for sample in HBR UHR
do
for rep in 1 2 3
do
hisat2 -p 2 -x index/hisat2/chr22_ERCC92 --known-splicesite-infile index/hisat2/splicesites.txt --dta-cufflinks --rna-strandness RF -1 final/${sample}_R${rep}_1.fastq.gz -2 final/${sample}_R${rep}_2.fastq.gz -S hisat2/${sample}_R${rep}.sam
#Convert HISAT2 sam files to bam files (required for cufflinks)
samtools view -b -S hisat2/${sample}_R${rep}.sam > hisat2/${sample}_R${rep}.bam
#Now sort the HISAT2 bam files (also required for cufflinks)
samtools sort hisat2/${sample}_R${rep}.bam -f hisat2/${sample}_R${rep}_sorted.bam
done
done
END OF OPTIONAL ALTERNATIVE - HISAT2 alignment
Previous Section | This Section | Next Section |
---|---|---|
Preprocessing | Alignment | Postalignment-visualization |