RNASeq pipeline - mountetna/ruby_pipe GitHub Wiki

This is a description of the RNA alignment pipeline, accessed with rna_paired_align

QC Metrics

The pipeline reports a series of QC metrics to assess sample quality. When the pipeline has completed you may find this output in output/<cohort_name>/<cohort_name>.rna_seq_table. Here is a brief description of the QC fields you might find here and their utility:

  • rrna_count - The pipeline begins by aligning reads against an rRNA reference ("soaking"). This is a count of non-mitochondrial rRNA reads (i.e., cytoplasmic rRNA reads).
  • mt_rrna_count - The complement of the above, a count of mitochondrial rRNA reads in this sample.

rRNA contamination is a common issue in many RNA preps, usually when whole rRNA is depleted of rRNAs using a ribosome-depletion kit. These statistics can help diagnose the presence of contaminating rRNA reads. An ideal rRNA fraction, i.e. (rrna_count + mt_rrna_count)/read_count in a clean sample might be < 10%. If your sample has > 50% of reads coming from rRNA you should clean up your prep.

  • read_count - The total number of reads remaining after ribosomal soaking.

  • mapped_count - The total number of reads that the aligner could align against the reference. Your mapping fraction (mapped_count / read_count) is ideally 100%. Unmapped reads may be a sign of foreign genome contamination (e.g. bacteria or virus in your sample) or unclipped adapters on your reads.

  • duplicate_count - Duplication rates are expected to be high in RNAseq, where per-gene coverage spans many logs in the same sample. High duplication fractions may be a sign of PCR amplification issues, but is also expected to increase with increasing coverage.

  • intergenic_count, introns_count, utr_count, coding_count, mt_coding_count - counts according to genomic region. The number of reads in protein-coding regions (coding_count) is often a useful baseline QC filter.

  • median_3prime_bias, median_cv_coverage - These two statistics (from Picard tools' CollectRnaSeqMetrics) measure transcript fragmentation. The former identifies transcripts with reads piled up abnormally in the 3' UTR. The latter measures coefficient of variability along the whole transcript and is a more general measure of fragmentation. A rule-of-thumb cutoff for CV coverage is 1.0.

  • eisenberg_score - A score based on 10 housekeeping genes from this paper, which should have well-defined expression levels across any sample. This statistic reports the number of Eisenberg housekeeping genes whose expression for the given sample falls outside of the expected range.