Raw Read Quality Control‐Metagenomics - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

6.2.3 Raw Read Quality Control

Before any downstream analysis, it’s crucial to inspect and clean your raw FASTQ files to remove adapters, low-quality bases, and contaminants.

A. Installation

# Using Conda (recommended)
conda install -c bioconda fastqc multiqc cutadapt

# Or system-wide on Ubuntu
sudo apt update
sudo apt install fastqc multiqc cutadapt

B. Demultiplexing

If you receive BCL files from the sequencer, convert them to per-sample FASTQ with Illumina’s bcl2fastq or dragen:

# Example with bcl2fastq
bcl2fastq \
  --runfolder-dir /path/to/run_folder \
  --output-dir raw_reads/ \
  --sample-sheet SampleSheet.csv \
  --no-lane-splitting

This will produce your paired FASTQ:

raw_reads/
├─ SampleA_S1_L001_R1_001.fastq.gz
├─ SampleA_S1_L001_R2_001.fastq.gz
├─ SampleB_S1_L001_R1_001.fastq.gz
└─ SampleB_S1_L001_R2_001.fastq.gz

If your provider already delivered FASTQs, skip this step.

C. Adapter & Quality Trimming

Use Cutadapt to remove residual Illumina adapters and trim low-quality bases:

mkdir -p trimmed/

cutadapt \
  -j 8                                  \  # number of threads
  -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \  # 3' adapter for R1
  -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \  # 3' adapter for R2
  -q 20,20                              \  # trim 3' ends below Q20
  --minimum-length 50                   \  # discard reads < 50 bp after trimming1. 
  -o trimmed/SampleA_R1.trimmed.fastq.gz \
  -p trimmed/SampleA_R2.trimmed.fastq.gz \
  raw_reads/SampleA_S1_L001_R1_001.fastq.gz \
  raw_reads/SampleA_S1_L001_R2_001.fastq.gz

q 20,20 trims bases with Phred < 20 on both ends.
minimum-length 50 ensures very short reads are discarded.

Repeat for each sample (or wrap in a loop).

D. Quality Filtering & Reporting

FastQC — generates per-sample HTML/QC reports:

mkdir -p qc/fastqc
fastqc -t 4 -o qc/fastqc trimmed/*.fastq.gz

2.** MultiQC** — aggregates all FastQC reports into a single dashboard:

cd qc/fastqc
multiqc .
# Outputs: multiqc_report.html

Inspect reports

Per-base quality: look for drop-offs at read ends
Adapter content: should be near zero after trimming
Per-sequence GC: uniform distribution
Overrepresented sequences: none or expected spike-ins

E. Example Directory Layout

Bioinformatics/
├─ raw_reads/
│   ├─ SampleA_R1.fastq.gz
│   └─ SampleA_R2.fastq.gz
├─ trimmed/
│   ├─ SampleA_R1.trimmed.fastq.gz
│   └─ SampleA_R2.trimmed.fastq.gz
└─ qc/
    └─ fastqc/
        ├─ SampleA_R1_fastqc.html
        ├─ SampleA_R1_fastqc.zip
        ├─ SampleA_R2_fastqc.html
        ├─ SampleA_R2_fastqc.zip
        └─ multiqc_report.html

Next: Proceed to 6.2.4 16S/ITS Amplicon Analysis or 6.2.5 Shotgun Taxonomic Profiling, depending on your data type.