Step 3a: Genome Alignment (STAR, HISAT2) - srkoppolu/SK_RNA-Seq GitHub Wiki

Reference Genome Alignment:

The main consideration for aligners to map reads to a reference genome is their ability to split between distant regions of the reference to account for reads that span across two or more exons. The sequencing libraries are constructed from transcribed RNAs, which are usually free of intronics sequences, and thus when aligning back to reference genomes,we need to consider the possibility of a read to be split across different distant regions of the genome. Some aligners can do de novo spliced alignment while others need intron/exon boundary annotations from the reference genome through a GFF3 or GTF file.

Most popular aligners for genome based alignment are STAR, HISAT2. For a complete list of spliced aligners, click here.


STAR

STAR (Spliced Transcripts Alignment to a Reference) is a fast RNA-Seq read mapper, with support for splice-junction and fusion read detection. It aligns reads by finding the maximum mappable prefix (MMP) hits between reads (or read pairs) and the genome. Different parts of a read can be mapped to different genomic positions, corresponding to splicing or RNA-fusions.
.

Indexing:

First, the genome needs to be indexed:

# Generating genome indices
STAR --runMode genomeGenerate --genomeDir indices/genome --genomeFastaFiles genome.fa --runThreadN 16 --sjdbOverhang 99 --sjdbGTFfile genome.gff3 

where, the basic parameters to be set are:

  • --runThreadN NumberOfThreads (16)
  • --runMode genomeGenerate (genomeGenerate)
  • --genomeDir /path/to/genomeDir (indices/genome)
  • --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 ... (genome.fa)
  • --sjdbGTFfile /path/to/annotations.gtf (genome.gff3) {ENSEMBL - gtf format; UCSC - gff3 format}
  • --sjdbOverhang ReadLength-1 (99)

Note: When creating a draft genome with large number of scaffolds, STAR is known to get resource-greedy. In such cases, the parameters --genomeChrBinNbits and --limitGenomeGenerateRAM can protect from abuse of memory usage during index creation. For more information, refer to the STAR documentation.

.

Alignment:

After generating the genome index, we need to align our sample reads to the indexed genome with the following command:

# Aligning sample reads to indexed genome
STAR --genomeDir indices/genome --readFilesIn sample_1_sortmerna_trimmomatic_1.fq.gz sample_1_sortmerna_trimmomatic_2.fq.gz --runThreadN 16 --sjdbGTFfile genome.gff3 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix results/sample_1_sortmerna_trimmomatic_STAR_  

In case of multiple alignments, edit and run the STARalign.sh script to run the alignment for all samples.

After the alignment, the following types of output files would be generated:

  • Log Files:
    • Log.out
    • Log.progress.out
    • Log.final.out
  • SAM Files
  • Sorted-by-Coordinate BAM Files: the output would be sorted by coordinate, similar to samtools sort command.
  • Splice junctions: contains high confidence collapsed splice junctions in tab-delimited format with the following columns:
    • column 1: chromosome
    • column 2: first base of the intron (1-based)
    • column 3: last base of the intron (1-based)
    • column 4: strand (0:undefined; 1:+; 2:-)
    • column 5: intron motif (0:non-canonical; 1:GT/AG; 2:CT/AC; 3:GC/AG; 4:CT/GC; 5:AT/AC; 6:GT/AT)
    • column 6: annotation (0:unannotated; 1:annotated)
    • column 7: number of uniquely mapping reads crossing the junction
    • column 8: number of multimapping reads crossing the junction
    • column 9: maximum spliced alignment overhang

HISAT2

HISAT2 (Hierarchical Indexing for Spliced Alignment of Transcripts) is a fast and sensitive alignment program for mapping NGS reads (whole-genome, transcriptome, and exome sequencing data) against the general human population or against a single reference genome. It is developed based on HISAT and Bowtie2 implementations, and gives ouput alignments in SAM format. It can be run on the command line under Linux, Mac OS X and Windows. For proper installation instructions, please refer to the manual.
.

HISAT2 indices:

Most of the HISAT2 indices are downloadable through their website. Each of these indexes comes with a shell script (e.g. make_grch38.sh) that provides instructions (or shell commands) for downloading a reference sequence, gene annotations, and SNPs, and building a HISAT2 index, which can be used to build the same downloadable HISAT2 index.

HISAT2 indexes named genome_tran or genome_snp_tran use Ensembl gene annotations, which include many more transcripts than RefSeq annotations, due to the inclusion of annotations as predicted by software.

Note: The HISAT2 indexes for the human genome include only primary assembly. If you choose to include alternative sequences introduced in GRCh38 and build your own HISAT2 indexes, then please be aware that those alternative sequences are nearly identical to the primary assembly in GRCh38 and as a result some reads may map to many more locations compared to when using the primary assembly only. HISAT2 often skips those multi-mapped reads if the number of potential locations is more than the value specified by the -k option. You may want to use a higher value for the -k option to resolve the issue.
.

Alignment:

The alignment can be achieved using the following code on the command line:

# Aligning sample reads to indexed genome using HISAT2
hisat2 -p 11 --dta -x ~/indices/genome -1 ~/samples/sample_1_sortmerna_trimmomatic_1.fq.gz -2 ~/samples/sample_1_sortmerna_trimmomatic_2.fq.gz -S sample_1_sortmerna_trimmomatic_HISAT2.sam 

In case of multiple samples, use the hisat2align.sh script to run the alignment for all samples.