Readqualitycontrolandreadassembly - BGIGPD/BestPractices4Pathogenomics GitHub Wiki
Process Document: Quality Control - Read Quality Control and Read Assembly
Overview
This document outlines the steps for performing read quality control and assembly using the fastp tool and MEGAHIT assembler.
Objectives
- To ensure the quality of raw sequencing reads.
- To assemble reads into longer sequences.
Steps
1. Download Raw Data
Before starting, ensure that the raw sequencing data is downloaded:
wget -c http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/069/SRR17982369/SRR17982369_1.fastq.gz
wget -c http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/069/SRR17982369/SRR17982369_2.fastq.gz
2. Assessing Read Quality
Understand the quality of your data by:
- Checking the overall distribution of base quality values.
- Analyzing the distribution proportion of bases at various positions.
- Reviewing the distribution of GC content.
- Checking for sequencing adapter sequences.
- Evaluating the duplication rate of reads.
3. Perform Quality Control with fastp
Activate the conda environment and install fastp:
conda activate “YourEnvName”
conda install fastp
Run fastp for single-end or paired-end sequencing:
# For single-end sequencing
fastp -i input.fastq -o output.fastq -j report.json -h report.html
# For paired-end sequencing
fastp -i R1.fq -o R1.clean.fq -I R2.fq -O R2.clean.fq -j report.json -h report.html
4. Understand Quality Control Results
Ensure that:
- More than 95% of bases are at Q20 quality (at least not less than 90%).
- Q30 quality should be greater than 85% (at least not less than 80%).
- The base quality distribution at different positions should be greater than 30 with minimal fluctuations.
5. Read Assembly with MEGAHIT
Install MEGAHIT and perform assembly:
conda install -c bioconda megahit
# For paired-end sequence assembly
megahit -1 pe_1.fq -2 pe_2.fq -o out
# For single-end sequence assembly
megahit -r single_end.fq -o out
# For interleaved paired-end sequences
megahit --12 interleaved.fq -o out
6. Set Kmer Parameters
Use either –k-list or combined parameters to set the kmer sizes for assembly:
# Using –k-list
megahit --k-list 21,29,39,59,79,99,119,141 -o out
# Using combined parameters
megahit --k-min 21 --k-max 141 --k-step 12 -o out
Conclusion
After completing these steps, you should have high-quality assembled sequences ready for further analysis.
Thanks
OMICS FOR ALL - Genomic Technologies for the Benefit of Humanity