Raw Read Quality Control‐Metagenomics - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
6.2.3 Raw Read Quality Control
Before any downstream analysis, it’s crucial to inspect and clean your raw FASTQ files to remove adapters, low-quality bases, and contaminants.
A. Installation
# Using Conda (recommended)
conda install -c bioconda fastqc multiqc cutadapt
# Or system-wide on Ubuntu
sudo apt update
sudo apt install fastqc multiqc cutadapt
B. Demultiplexing
If you receive BCL files from the sequencer, convert them to per-sample FASTQ with Illumina’s bcl2fastq or dragen:
# Example with bcl2fastq
bcl2fastq \
--runfolder-dir /path/to/run_folder \
--output-dir raw_reads/ \
--sample-sheet SampleSheet.csv \
--no-lane-splitting
This will produce your paired FASTQ:
raw_reads/
├─ SampleA_S1_L001_R1_001.fastq.gz
├─ SampleA_S1_L001_R2_001.fastq.gz
├─ SampleB_S1_L001_R1_001.fastq.gz
└─ SampleB_S1_L001_R2_001.fastq.gz
If your provider already delivered FASTQs, skip this step.
C. Adapter & Quality Trimming
Use Cutadapt to remove residual Illumina adapters and trim low-quality bases:
mkdir -p trimmed/
cutadapt \
-j 8 \ # number of threads
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \ # 3' adapter for R1
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \ # 3' adapter for R2
-q 20,20 \ # trim 3' ends below Q20
--minimum-length 50 \ # discard reads < 50 bp after trimming1.
-o trimmed/SampleA_R1.trimmed.fastq.gz \
-p trimmed/SampleA_R2.trimmed.fastq.gz \
raw_reads/SampleA_S1_L001_R1_001.fastq.gz \
raw_reads/SampleA_S1_L001_R2_001.fastq.gz
-
q 20,20 trims bases with Phred < 20 on both ends.
-
minimum-length 50 ensures very short reads are discarded.
Repeat for each sample (or wrap in a loop).
D. Quality Filtering & Reporting
- FastQC — generates per-sample HTML/QC reports:
mkdir -p qc/fastqc
fastqc -t 4 -o qc/fastqc trimmed/*.fastq.gz
2.** MultiQC** — aggregates all FastQC reports into a single dashboard:
cd qc/fastqc
multiqc .
# Outputs: multiqc_report.html
- Inspect reports
-
Per-base quality: look for drop-offs at read ends
-
Adapter content: should be near zero after trimming
-
Per-sequence GC: uniform distribution
-
Overrepresented sequences: none or expected spike-ins
E. Example Directory Layout
Bioinformatics/
├─ raw_reads/
│ ├─ SampleA_R1.fastq.gz
│ └─ SampleA_R2.fastq.gz
├─ trimmed/
│ ├─ SampleA_R1.trimmed.fastq.gz
│ └─ SampleA_R2.trimmed.fastq.gz
└─ qc/
└─ fastqc/
├─ SampleA_R1_fastqc.html
├─ SampleA_R1_fastqc.zip
├─ SampleA_R2_fastqc.html
├─ SampleA_R2_fastqc.zip
└─ multiqc_report.html
Next: Proceed to 6.2.4 16S/ITS Amplicon Analysis or 6.2.5 Shotgun Taxonomic Profiling, depending on your data type.