Step 3: Alignment - srkoppolu/SK_RNA-Seq GitHub Wiki
If the raw read quality is sufficiently clean or after the data has been filtered and trimmed to acceptable standards, the next step would be to align the reads to a reference genome/transcriptome.
There is no standard aligner tool that can be applied to all projects. Novel aligners are actively being developed and so the choice of aligners will depend on several factors:
- reference to be used (genome / transcriptome)
- the data type (short reads / longer reads)
- the available computational power, etc.
Most recent aligners use either BWT (Burrows-Wheeler transformation) or MEM (Maximum Exact Matches) based approaches to perform alignment, while older generation aligners relied on a spliced-seed approach. There are a variety of options in selecting the alignment strategy. Picking the most accurate aligner with the optimal parameter set for a project is a huge task, and there is usually no guideline available as to which aligner is most appropriate for a given situation. Therefore, I have presented the implementation protocols for several of the most used genome-based aligners, such as STAR, HISAT2, and transcriptome-based aligners, such as Bowtie2, BWA-MEM and TopHat. For a more updated list of the different aligners, check here.
These alignment tools generally require two steps:
- The reference genome/transcriptome is converted to an 'indexed reference'
- Fast read mapping
Note: Based on the speed of the aligner and the number of reads to align, consider breaking up reads into multiple files for parallel processing across multiple nodes on a HPC cluster.
Reference Genome Alignment:
The main consideration for aligners to map reads to a reference genome is their ability to split between distant regions of the reference to account for reads that span across two or more exons. The sequencing libraries are constructed from transcribed RNAs, which are usually free of intronics sequences, and thus when aligning back to reference genomes,we need to consider the possibility of a read to be split across different distant regions of the genome. Some aligners can do de novo spliced alignment while others need intron/exon boundary annotations from the reference genome through a GFF3 or GTF file.
Most popular aligners for genome based alignment are STAR, HISAT2.
STAR
STAR (Spliced Transcripts Alignment to a Reference) is a fast RNA-Seq read mapper, with support for splice-junction and fusion read detection. It aligns reads by finding the maximum mappable prefix (MMP) hits between reads (or read pairs) and the genome. Different parts of a read can be mapped to different genomic positions, corresponding to splicing or RNA-fusions.
The indexing and alignment can be achieved using the following code:
# Generating genome indices
STAR --runMode genomeGenerate --genomeDir indices/genome --genomeFastaFiles genome.fa --runThreadN 16 --sjdbOverhang 99 --sjdbGTFfile genome.gff3
# Aligning sample reads to indexed genome
STAR --genomeDir indices/genome --readFilesIn sample_1_sortmerna_trimmomatic_1.fq.gz sample_1_sortmerna_trimmomatic_2.fq.gz --runThreadN 16 --sjdbGTFfile genome.gff3 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix results/sample_1_sortmerna_trimmomatic_STAR_
HISAT2
HISAT2 (Hierarchical Indexing for Spliced Alignment of Transcripts) is a fast and sensitive alignment program for mapping NGS reads (whole-genome, transcriptome, and exome sequencing data) against the general human population or against a single reference genome. The HISAT2 index files can be downloaded from here.
The alignment can be achieved using the following code on the command line:
# Aligning sample reads to indexed genome using HISAT2
hisat2 -p 11 --dta -x ~/indices/genome -1 ~/samples/sample_1_sortmerna_trimmomatic_1.fq.gz -2 ~/samples/sample_1_sortmerna_trimmomatic_2.fq.gz -S sample_1_sortmerna_trimmomatic_HISAT2.sam
In case of multiple samples, use the hisat2align.sh script to run the alignment for all samples.
Reference Transcriptome Alignment:
When mapping to a reference transriptome, the intron/exon boundaries are irrelevant. The only consideration is for gapped alignments depending on the type of transcriptome assembled. This is usually achieved through two basic strategies:
- assemble a single, longest gene model for each set of splice variants, and
- try to reconstruct each variant separately.
The first case needs intron/eon boundary information to account for splicing, while the second does not need that. However, it is much more difficult to assemble all variants accurately, and so might be practical to assemble a single, longest gene model.
BWA
BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads. [source]
Depending on read length, BWA has different modes optimized for different sequence lengths:
- BWA-backtrack: designed for Illumina sequence reads up to 100bp (3-step)
- BWA-SW: designed for longer sequences ranging from 70bp to 1Mbp, long-read support and split alignment
- BWA-MEM: shares similar features to BWA-SW, but faster and more accurate.
The alignment can be performed using the following code:
# Generating the index files
bwa index -a bwtsw indices/transcripts.fa
# Align using the bash
bwa mem indices/transcripts.fa -t 16 sample_1_sortmerna_trimmomatic_1.fq.gz sample_1_sortmerna_trimmomatic_2.fq.gz | gzip -3 > sample_1_sortmerna_trimmomatic_BWAmem_sam.gz
BowTie2
Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.
The alignment can be performed using the following code:
# Generating the index files
bowtie2-build indices/transcripts.fa transcripts
# Align using the bash
bowtie2 -x transcripts -1 sample_1_sortmerna_trimmomatic_1.fq.gz -2 sample_1_sortmerna_trimmomatic_2.fq.gz -S sample_1_sortmerna_trimmomatic_BWAmem_sam
The outputs of these aligners are usually SAM or BAM (compressed, binary equivalents of SAM) files which contain the summarized information of the short read alignments necessary for the downstream analyses such as annotation, transcript abundance comparisons and polymorphism detection. SAMtools and its companion package Picard provide a convenient platform for manipulating SAM and BAM files, which contain one row for each alignment with the information regarding the alignment is represented in various columns, such as read name, position of alignment, quality score, etc.