Step 1: Quality Analysis - srkoppolu/SK

The first step in the pipeline is the quality control validation of the raw sequence data files.

FastQC is a tool that is most commonly used for checking the quality of raw sequence files. It can take input in the form of bam, sam or fastq files and provides a quick overview of the quality of reads. The summary graphs and tables can be used to assess areas which may be problematic.

FastQC analyses the sequence data to provide the following metrics:

Basic statistics: filename, filetype, total sequences, sequence length and GC content
Per base sequence quality: median(red), mean(blue), yellow box (25th-75th percentile), black whiskers (10th-90th percentile); should also be able to recognize the platform used automatically.
Per sequence quality scores: A density plot for the mean sequecnce quality using 'Phred Score' (Q = -10log₁₀P, where P is the probability of an incorrect base call).
Per base sequence content: We should typically see an even distribution for the four bases across the sequence length. Parallel lines are best, while wobbly lines are of concern.
Per base GC content: Distribution of GC content across the sequence length. irregularities in the GC content within the sequence length is of concern.
Per sequence GC content: A density plot of the mean GC content across all sequences. Typically, a good overlay betwen theoretical and observed distributions is expected. An overshoot of otherwise a good overlay distribution suggests some kind of contamination.
Per base N content: Tells if there are any uncalled bases in the library.
Sequence Length Distribution: A density plot for sequence lengths.
Sequence Duplication Levels: Tells us how unique the sequences are within the library.
Overpresented sequences: Looks at individual sequences which are overrepresented within the library. This is another way of looking at the duplications.

Here is a good FastQC manual and here are some files describing the outputs of fastqc.

To run fastqc on every sequencing file from terminal/command line:

fastqc -t 8 –o qc/raw/ *.fq.gz

MultiQC is a relatively new tool to create a single report with interactive plots for multiple bioinformatic analyses across many samples. It is a python command line tool available through Python Package Index or through conda using Bioconda.

Note: The major advantage of using MultiQC is that its reports can describe multiple analyses steps and large number of samples within a single plot, and multiple analysis tools making it ideal for routine fast quality control.

Generally, run the following command:

# If in the directory containing the FastQC output files
multiqc [options] .

# If not,
multiqc [options] path/to/FastQC-ouput_directory

We usually get the following statistics as output (example):

Sample Name
% Dups
% GC
Length
M Seqs (millions of sequences)

According to its GitHub page, tools currently supported by MultiQC include (complete list and instructions found here):

Read QC & pre-processing	Aligners / quantifiers	Post-alignment processing	Post-alignment QC
Adapter Removal	BBMap	Bamtools	biobambam2
AfterQC	BISCUIT	Bcftools	BUSCO
Bcl2fastq	Bismark	GATK	Conpair
BBTools	Bowtie	HOMER	DamageProfiler
BioBloom Tools	Bowtie 2	HTSeq	DeDup
ClipAndMerge	HiCUP	MACS2	deepTools
Cluster Flow	HiC-Pro	Picard	Disambiguate
Cutadapt	HISAT2	Prokka	goleft
leeHom	Kallisto	RSEM	HiCExplorer
InterOp	Long Ranger	Samblaster	methylQA
FastQC	Salmon	Samtools	miRTrace
FastQ Screen	Slamdunk	SnpEff	mosdepth
Fastp	STAR	Subread featureCounts	Peddy
FLASh	Tophat	Stacks	phantompeakqualtools
Flexbar		THetA2	Preseq
Jellyfish			QoRTs
KAT			Qualimap
MinIONQC			QUAST
Skewer			RNA-SeQC
SortMeRNA			RSeQC
			Sargasso
			Supernova
			VCFTools
			VerifyBAMID

The following figure shows a summary of the Quality Analysis steps involved in Step-1(source): steps in step-1

Citation format for MultiQC:

Please consider citing MultiQC if you use it in your analysis.

MultiQC: Summarize analysis results for multiple tools and samples in a single report
Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller
Bioinformatics (2016)
doi: 10.1093/bioinformatics/btw354
PMID: 27312411

@article{doi:10.1093/bioinformatics/btw354,
author = {Ewels, Philip and Magnusson, Måns and Lundin, Sverker and Käller, Max},
title = {MultiQC: summarize analysis results for multiple tools and samples in a single report},
journal = {Bioinformatics},
volume = {32},
number = {19},
pages = {3047},
year = {2016},
doi = {10.1093/bioinformatics/btw354},
URL = { + http://dx.doi.org/10.1093/bioinformatics/btw354},
eprint = {/oup/backfile/Content_public/Journal/bioinformatics/32/19/10.1093_bioinformatics_btw354/3/btw354.pdf}
}

Step 1: Quality Analysis - srkoppolu/SK_RNA-Seq GitHub Wiki

⚠️ GitHub.com Fallback ⚠️

Step 1: Quality Analysis - srkoppolu/SK_RNA-Seq GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️