Readqualitycontrolandreadassembly - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

Process Document: Quality Control - Read Quality Control and Read Assembly

Overview

This document outlines the steps for performing read quality control and assembly using the fastp tool and MEGAHIT assembler.

Objectives

  • To ensure the quality of raw sequencing reads.
  • To assemble reads into longer sequences.

Steps

1. Download Raw Data

Before starting, ensure that the raw sequencing data is downloaded:

wget -c http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/069/SRR17982369/SRR17982369_1.fastq.gz
wget -c http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/069/SRR17982369/SRR17982369_2.fastq.gz

2. Assessing Read Quality

Understand the quality of your data by:

  • Checking the overall distribution of base quality values.
  • Analyzing the distribution proportion of bases at various positions.
  • Reviewing the distribution of GC content.
  • Checking for sequencing adapter sequences.
  • Evaluating the duplication rate of reads.

3. Perform Quality Control with fastp

Activate the conda environment and install fastp:

conda activate “YourEnvName”
conda install fastp

Run fastp for single-end or paired-end sequencing:

# For single-end sequencing
fastp -i input.fastq -o output.fastq -j report.json -h report.html

# For paired-end sequencing
fastp -i R1.fq -o R1.clean.fq -I R2.fq -O R2.clean.fq -j report.json -h report.html

4. Understand Quality Control Results

Ensure that:

  • More than 95% of bases are at Q20 quality (at least not less than 90%).
  • Q30 quality should be greater than 85% (at least not less than 80%).
  • The base quality distribution at different positions should be greater than 30 with minimal fluctuations.

5. Read Assembly with MEGAHIT

Install MEGAHIT and perform assembly:

conda install -c bioconda megahit

# For paired-end sequence assembly
megahit -1 pe_1.fq -2 pe_2.fq -o out

# For single-end sequence assembly
megahit -r single_end.fq -o out

# For interleaved paired-end sequences
megahit --12 interleaved.fq -o out

6. Set Kmer Parameters

Use either –k-list or combined parameters to set the kmer sizes for assembly:

# Using –k-list
megahit --k-list 21,29,39,59,79,99,119,141 -o out

# Using combined parameters
megahit --k-min 21 --k-max 141 --k-step 12 -o out

Conclusion

After completing these steps, you should have high-quality assembled sequences ready for further analysis.

Thanks
OMICS FOR ALL - Genomic Technologies for the Benefit of Humanity