Step 1: Quality Analysis - srkoppolu/SK_RNA-Seq GitHub Wiki
The first step in the pipeline is the quality control validation of the raw sequence data files.
FastQC is a tool that is most commonly used for checking the quality of raw sequence files. It can take input in the form of bam, sam or fastq files and provides a quick overview of the quality of reads. The summary graphs and tables can be used to assess areas which may be problematic.
FastQC analyses the sequence data to provide the following metrics:
- Basic statistics: filename, filetype, total sequences, sequence length and GC content
- Per base sequence quality: median(red), mean(blue), yellow box (25th-75th percentile), black whiskers (10th-90th percentile); should also be able to recognize the platform used automatically.
- Per sequence quality scores: A density plot for the mean sequecnce quality using 'Phred Score' (Q = -10log10P, where P is the probability of an incorrect base call).
- Per base sequence content: We should typically see an even distribution for the four bases across the sequence length. Parallel lines are best, while wobbly lines are of concern.
- Per base GC content: Distribution of GC content across the sequence length. irregularities in the GC content within the sequence length is of concern.
- Per sequence GC content: A density plot of the mean GC content across all sequences. Typically, a good overlay betwen theoretical and observed distributions is expected. An overshoot of otherwise a good overlay distribution suggests some kind of contamination.
- Per base N content: Tells if there are any uncalled bases in the library.
- Sequence Length Distribution: A density plot for sequence lengths.
- Sequence Duplication Levels: Tells us how unique the sequences are within the library.
- Overpresented sequences: Looks at individual sequences which are overrepresented within the library. This is another way of looking at the duplications.
Here is a good FastQC manual and here are some files describing the outputs of fastqc.
To run fastqc on every sequencing file from terminal/command line:
fastqc -t 8 –o qc/raw/ *.fq.gz
MultiQC is a relatively new tool to create a single report with interactive plots for multiple bioinformatic analyses across many samples. It is a python command line tool available through Python Package Index or through conda using Bioconda.
Note: The major advantage of using MultiQC is that its reports can describe multiple analyses steps and large number of samples within a single plot, and multiple analysis tools making it ideal for routine fast quality control.
Generally, run the following command:
# If in the directory containing the FastQC output files
multiqc [options] .
# If not,
multiqc [options] path/to/FastQC-ouput_directory
We usually get the following statistics as output (example):
- Sample Name
- % Dups
- % GC
- Length
- M Seqs (millions of sequences)
According to its GitHub page, tools currently supported by MultiQC include (complete list and instructions found here):
Read QC & pre-processing | Aligners / quantifiers | Post-alignment processing | Post-alignment QC |
---|---|---|---|
Adapter Removal | BBMap | Bamtools | biobambam2 |
AfterQC | BISCUIT | Bcftools | BUSCO |
Bcl2fastq | Bismark | GATK | Conpair |
BBTools | Bowtie | HOMER | DamageProfiler |
BioBloom Tools | Bowtie 2 | HTSeq | DeDup |
ClipAndMerge | HiCUP | MACS2 | deepTools |
Cluster Flow | HiC-Pro | Picard | Disambiguate |
Cutadapt | HISAT2 | Prokka | goleft |
leeHom | Kallisto | RSEM | HiCExplorer |
InterOp | Long Ranger | Samblaster | methylQA |
FastQC | Salmon | Samtools | miRTrace |
FastQ Screen | Slamdunk | SnpEff | mosdepth |
Fastp | STAR | Subread featureCounts | Peddy |
FLASh | Tophat | Stacks | phantompeakqualtools |
Flexbar | THetA2 | Preseq | |
Jellyfish | QoRTs | ||
KAT | Qualimap | ||
MinIONQC | QUAST | ||
Skewer | RNA-SeQC | ||
SortMeRNA | RSeQC | ||
Sargasso | |||
Supernova | |||
VCFTools | |||
VerifyBAMID |
The following figure shows a summary of the Quality Analysis steps involved in Step-1(source):
Citation format for MultiQC:
Please consider citing MultiQC if you use it in your analysis.
MultiQC: Summarize analysis results for multiple tools and samples in a single report
Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller
Bioinformatics (2016)
doi: 10.1093/bioinformatics/btw354
PMID: 27312411
@article{doi:10.1093/bioinformatics/btw354,
author = {Ewels, Philip and Magnusson, Måns and Lundin, Sverker and Käller, Max},
title = {MultiQC: summarize analysis results for multiple tools and samples in a single report},
journal = {Bioinformatics},
volume = {32},
number = {19},
pages = {3047},
year = {2016},
doi = {10.1093/bioinformatics/btw354},
URL = { + http://dx.doi.org/10.1093/bioinformatics/btw354},
eprint = {/oup/backfile/Content_public/Journal/bioinformatics/32/19/10.1093_bioinformatics_btw354/3/btw354.pdf}
}