Readqualitycontrolandreadassembly - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

Process Document: Quality Control - Read Quality Control and Read Assembly

Overview

This document outlines the steps for performing read quality control and assembly using the fastp tool and MEGAHIT assembler.

Objectives

To ensure the quality of raw sequencing reads.
To assemble reads into longer sequences.

Steps

1. Download Raw Data

Before starting, ensure that the raw sequencing data is downloaded:

wget -c http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/069/SRR17982369/SRR17982369_1.fastq.gz
wget -c http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/069/SRR17982369/SRR17982369_2.fastq.gz

2. Assessing Read Quality

Understand the quality of your data by:

Checking the overall distribution of base quality values.
Analyzing the distribution proportion of bases at various positions.
Reviewing the distribution of GC content.
Checking for sequencing adapter sequences.
Evaluating the duplication rate of reads.

3. Perform Quality Control with fastp

Activate the conda environment and install fastp:

conda activate “YourEnvName”
conda install fastp

Run fastp for single-end or paired-end sequencing:

# For single-end sequencing
fastp -i input.fastq -o output.fastq -j report.json -h report.html

# For paired-end sequencing
fastp -i R1.fq -o R1.clean.fq -I R2.fq -O R2.clean.fq -j report.json -h report.html

4. Understand Quality Control Results

Ensure that:

More than 95% of bases are at Q20 quality (at least not less than 90%).
Q30 quality should be greater than 85% (at least not less than 80%).
The base quality distribution at different positions should be greater than 30 with minimal fluctuations.

5. Read Assembly with MEGAHIT

Install MEGAHIT and perform assembly:

conda install -c bioconda megahit

# For paired-end sequence assembly
megahit -1 pe_1.fq -2 pe_2.fq -o out

# For single-end sequence assembly
megahit -r single_end.fq -o out

# For interleaved paired-end sequences
megahit --12 interleaved.fq -o out

6. Set Kmer Parameters

Use either –k-list or combined parameters to set the kmer sizes for assembly:

# Using –k-list
megahit --k-list 21,29,39,59,79,99,119,141 -o out

# Using combined parameters
megahit --k-min 21 --k-max 141 --k-step 12 -o out

Conclusion

After completing these steps, you should have high-quality assembled sequences ready for further analysis.

Thanks
OMICS FOR ALL - Genomic Technologies for the Benefit of Humanity