Step 4: Alignment Quality Control - srkoppolu/SK_RNA-Seq GitHub Wiki

After the alignment of the sequence reads to a reference genome/transcriptome, it is very important to check the quality of the mapping process. The percentage of mapped reads serve as a global indicator of the overall alignment accuracy and of the presence of any DNA contaminants. Sometimes, rather than analyzing each independent read alignment, it can be more useful to obtain a general statistical overview of the mapping process. There are several tools that provide such statistics.

Most of the following descriptions have been adapted from this source.

STAR

If the alignment is performed through the STAR aligner, we would obtain a number of output files like:

  • Log.final.out - a summary of mapping statistics for the sample
  • Aligned.sorttedByCoord.out.bam - the aligned reads, sorted by coordinate, in BAM format
  • Log.out - a running log from STAR, with information about the run
  • Log.progress.out - job progress with the number of processed reads, % of mapped reads etc., updated every ~1 minute
  • SJ.out.tab - high confidence collapsed splice junctions in tab-delimited format. Only junctions supported by uniquely mapping reads are reported.

The Log.final.out output file contains the mapping statistics. We can take a closer look at that file using the following command:

$ less Sample_1_Log.final.out

This log file provides information on reads that,

  1. mapped uniquely
  2. reads that mapped to multiple locations, and
  3. reads that are unmapped.

Note: Additionally, we get details on splicing, insertion and deletion as well. From this file the most informative statistics include the mapping rate and the number of multimappers. A good quality sample will have at least 75% of the reads uniquely mapped. If this value falls below 60%, it is advisable to troubleshoot. However, these numbers vary largely with respect to the organism that we are working with.


In addition to the aligner-specific summary, we can also obtain quality metrics using tools like Qualimap or RNA-SeQC. These tools examine sequencing alignment data according to the features of the mapped reads and their genomic properties and provide an overall view of the data that helps to detect the biases in the sequencing and/or mapping of the data.The input can be one or more BAM files and the output consists of HTML or PDF reports with useful figures and tab delimited files of metrics data.

Some of the features to look for when assessing alignment quality of RNA-Seq data:

  • Reads genomic origin: Even if we have high genomic mapping rate for all samples, check to see where the reads are mapping. Ensure that there is not an unusually high number of reads mapping to intronic regions (~30% expected) and fewer than normally observed mapping to exons (~55%). A high intronic mapping suggests possible genomic DNA contamination and/or pre-mRNA.
  • Ribosomal RNA (rRNA) constitutes a large majority of the RNA species in any total RNA preparation. Despite depletion methods, we can never achieve complete rRNA removal. Even with Poly-A enrichment, a small percentage of ribosomal RNA can stick to the enrichment beads non-specifically. Excess ribosomal content (> 2%) will normally have to be filtered out so that differences in rRNA mapped reads across samples do not affect alignment rates and skew subsequent normalization of the data.
  • Transcript coverage and 5’-3’ bias: assess the affect on expression level and on levels of transcript GC content
  • Junction analysis: analysis of junction positions in spliced alignments (i.e known, partly known, novel)
  • Strand specificity: assess the performance of strand-specific library construction methods. The percentage of sense-derived reads is given for each end of the read pair. A non-strand-specific protocol would give values of 50% | 50%, whereas strand-specific protocols typically yield 99% | 1% or 1% | 99% for this metric.

Qualimap

Qualimap is a platform-independent application written in Java and R that provides both a Graphical User Interface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data.

The Qualimap tool takes BAM files as input and explores the features of mapped reads and their genomic properties. It provides an overall view of the data quality in an HTML report format. The reports are comprehensive with mapping statistics summarized along with useful figures and tab delimited files of metrics data.

To run Qualimap, first unset the Display to suppress the GUI (Default) option and add to the path:

# By default, Qualimap will try to open a GUI to run Qualimap, so run the unset DISPLAY command:
$ unset DISPLAY

# Add the location of the Qualimap tool to the PATH variable:
$ export PATH=/n/app/bcbio/dev/anaconda/bin:$PATH

Now, run Qualimap on the aligned data. There are different tools/modules available through Qualimap, and the documentation details the tools and options available. For the rnaseq tool, use the following command to see the arguments available using help:

$ qualimap rnaseq --help

We can run Qualimap with the following specifications:

  • -outdir: output directory for html report
  • -a: Counting algorithm - uniquely-mapped-reads(default) or proportional (each multi-mapped read is weighted according to the number of mapped locations)
  • -bam: path/to/bam/file(s)
  • -p: Sequencing library protocol - strand-specific-forward, strand-specific-reverse or non-strand-specific (default)
  • -gtf: path/to/gtf/file - needs to match the genome build and GTF used in alignment
  • --java-mem-size=: set Java memory
$ qualimap rnaseq \
-outdir results/qualimap/Sample_1 \
-a proportional \
-bam results/STAR/Sample_1_Aligned.sortedByCoord.out.bam \
-p strand-specific-reverse \
-gtf /reference/Homo_sapiens.GRCh38.92.1.gtf \
--java-mem-size=8G

The Qualimap report should be present in the results/qualimap directory. To view this report we would need to transfer it over to our local computers.