Malaria_upstream - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

Introduction to WGS Analysis for Malaria Surveillance

Whole Genome Sequencing (WGS) is a powerful tool in modern genomics research, especially in the surveillance of infectious diseases. In malaria monitoring, the application of WGS allows us to gain insights into the genetic diversity of Plasmodium spp., the distribution of drug resistance genes, and the epidemiological characteristics of transmission chains. This is crucial for developing effective malaria control strategies and monitoring the spread of drug resistance.

Role of WGS in Malaria Surveillance

Genomic Diversity Analysis: By conducting high-throughput sequencing of the Plasmodium genome, we can study the genetic diversity of parasite populations. This helps in understanding the transmission and epidemiological characteristics between different geographical regions.
Drug Resistance Monitoring: WGS can detect drug resistance mutations in the Plasmodium genome. This allows researchers to track the emergence and spread of resistance genes, aiding in the formulation of appropriate treatment strategies.
Transmission Chain Tracking: Genomic data can be used to map the transmission routes of Plasmodium across different regions, revealing transmission chains and providing a basis for disease prevention and control.

Overview of Bioinformatics Analysis Workflow

The bioinformatics analysis of WGS data mainly includes the following steps:

Raw Data Quality Control: First, tools such as FastQC are used to assess the quality of sequencing data, filtering out low-quality reads to ensure the accuracy of subsequent analyses.
Reference Genome Alignment: High-quality reads are aligned to the reference genome of Plasmodium. Common alignment tools include BWA and Bowtie2.
Variant Calling: After alignment, tools such as GATK or FreeBayes are used for variant calling to identify single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels) in the genome.
Population Genetic Analysis: The variant information is used to construct phylogenetic trees, analyze drug resistance mutations, and evaluate population structure.
Result Visualization and Interpretation: The analysis results are visualized, such as mutation hotspots and phylogenetic trees, to better understand malaria transmission and resistance patterns.

Through this workflow, we can derive meaningful biological conclusions from raw sequencing data, providing important genomic support for malaria control efforts.

Upstream Bioinformatics Workflow for Plasmodium falciparum Analysis

WGS_pipeline

Below is a specific workflow for analyzing Plasmodium falciparum genome data, focusing on quality control, alignment, and variant calling (using chromosome 1 for instance)

Analysis Environment Configuration:

0.1. First, create a conda environment to manage the required bioinformatics tools.
```
conda create -n WGS_analysis -c bioconda seqkit fastqc fastp bwa samtools freebayes vcftools
conda activate WGS_analysis
```
This step ensures all necessary tools are installed and available in a isolated vitural environment (Control the version of softwares and avoid version confilits).

0.2. Make a new directory for today's workshop
```
mkdir -p ~/workshop_WGS_upstream
cd ~/workshop_WGS_upstream
```
0.3. Make symlinks of today's sequencing data
```
mkdir dataset
ln -s /home/renzirui/dataset/SRR629180_fastq/SRR629180_1.fastq.gz dataset/
ln -s /home/renzirui/dataset/SRR629180_fastq/SRR629180_2.fastq.gz dataset/
```
Select & Download Reference Genome: Select and Download the reference genome for Plasmodium falciparum.

1.1. Select: Learn how to find the reference genome for a specific species on NCBI

1.2. Download: Then download the reference genome using wget

Tips: Before download, usually we create a standlone directory to store the reference genome:
```
mkdir reference
cd reference
```
Then download the reference genome using the link we get on NCBI:
```
wget <LINK_OF_REFERENCE_GENOME>
```
Finally we can rename the file for easier using:
```
mv GCF_000002765.6_GCA_000002765_genomic.fna.gz Pf3D7_genomic.fna.gz
cd .. # go back to ~/workshop_WGS_upstream
```
Indexing the Reference Genome: Before alignment, the reference genome must be indexed using BWA.
```
bwa index reference/Pf3D7_genomic.fna.gz
```

（Not Necessary) FastQC Quality Assessment: Use FastQC to assess the quality of the raw sequencing reads.

mkdir -p FastQC_Reports
fastqc -o FastQC_Reports dataset/SRR629180_1.fastq.gz dataset/SRR629180_2.fastq.gz

Data Filtering and Trimming with fastp: Use fastp to filter and trim the paired-end reads based on quality.
```
mkdir -p cleanreads
fastp -w 4 -i dataset/SRR629180_1.fastq.gz -I dataset/SRR629180_2.fastq.gz -o cleanreads/SRR629180.filtered_r1.fq.gz -O cleanreads/SRR629180.filtered_r2.fq.gz -h fastp_report.html
```
This step removes low-quality reads and trims adapter sequences, ensuring high-quality data for alignment.
Alignment with BWA and Filtering Aligned Reads: Use BWA to align the quality-filtered paired-end reads to the Plasmodium falciparum reference genome and filter out unaligned reads using Samtools in a single step.

Tips:
- Here shows the mapping & SAM to BAM converting & mapped reads filtering & sorting in one step, its better than split into a few steps to avoid generating intermediate files or consuming storages.
- Don't forget to add @RG of sample ID into the header (Joint-calling is not allowed without unique ID in header)
```
bwa mem -t 4 -R '@RG\tID:SRR629180\tSM:SRR629180' reference/Pf3D7_genomic.fna.gz cleanreads/SRR629180.filtered_r1.fq.gz cleanreads/SRR629180.filtered_r2.fq.gz | samtools view -Sb -F 4 | samtools sort -o SRR629180_sorted.bam
```
File format: SAM format definition document: https://samtools.github.io/hts-specs/SAMv1.pdf

This command produces a sorted BAM file containing only the reads that successfully aligned to the reference genome, which is essential for efficient variant calling.
BAM File Statistics: Use Samtools to generate statistics for the BAM file, including general alignment metrics, coverage, and depth.
- Basic Alignment Statistics:
```
samtools stats SRR629180_sorted.bam > alignment_stats.txt
grep ^SN alignment_stats.txt | cut -f 2-  # Grep basic summary statistics info
```
  This command generates a report with general information about the alignment, such as total reads, mapped reads, and other metrics.
- Coverage Statistics:
```
samtools coverage SRR629180_sorted.bam > coverage_stats.txt # tabular format coverage per chromosome
```
  This step provides information on coverage depth across each chromosome, which is useful for assessing sequencing uniformity.
  
  samtools coverage also provides simple plot output:
  
  -m/ --histogram parameter show coverage histogram instead of tabular output
```
samtools coverage -m SRR629180_sorted.bam 
```
  -D/ --plot-depth parameter plot depth instead of tabular output
```
samtools coverage -D SRR629180_sorted.bam
```
  Depth Profile**:
- ```
samtools depth SRR629180_sorted.bam > depth.txt # depth per site
```
  This command generates a depth profile showing the sequencing depth at each position, which can be used to visualize and evaluate coverage across the genome.
- Using softwares to visualize mapping results: IGV, Geneious, etc
Variant Calling with FreeBayes: Use FreeBayes to perform variant calling on chromosome 1 (NC_004325.2) of Plasmodium falciparum.

Tips: freebayes receive uncompressed FASTA file as reference input, and both FASTA file and BAM file have to be indexed when calling a specific region on the genome. So we have to do this before the variant calling

# Uncompress FASTA
gzip -dc reference/Pf3D7_genomic.fna.gz > reference/Pf3D7_genomic.fna
# Index FASTA
samtools faidx reference/Pf3D7_genomic.fna
# Index BAM
samtools index SRR629180_sorted.bam

You can find files named reference/Pf3D7_genomic.fna.fai and SRR629180_sorted.bam.bai when succeeded

After these preparations, we can do the variant calling using freebayes

freebayes -f reference/Pf3D7_genomic.fna -r NC_004325.2 SRR629180_sorted.bam > SRR629180_chr1.vcf

This command generates a VCF file containing SNPs and indels for chromosome 1.

File format: VCF format definition document: https://samtools.github.io/hts-specs/VCFv4.2.pdf

Joint-Variant Calling with Freebayes

The main difference between multi-sample joint variant calling and single-sample variant calling lies in their scope and accuracy:
1. Scope of Variant Detection: Single-sample variant calling is performed on each sample independently, and the results reflect only the variants present in that specific sample. In contrast, multi-sample joint variant calling is done across multiple samples simultaneously, which helps identify shared variants and better detect low-frequency mutations, especially useful in population studies.
2. Accuracy of Variants: Joint variant calling takes advantage of the information from multiple samples to improve the accuracy of variant detection, reducing both false positives and false negatives. The statistical model in joint calling can utilize the evidence from different samples to increase confidence in calling variants, even those that are rare.
3. Population Analysis: Multi-sample joint variant calling is particularly beneficial for population genetics analysis, as it provides information about both common and unique variants across all samples, which is crucial for understanding population structure and evolutionary dynamics.
Here we use 2 bam file as an example to try to joint-calling the variants in 2 samples
```
# Create a symlink from a generated BAM file from another sample
ln -s /home/renzirui/workshop_WGS_upstream/ERR1787741_sorted.bam ./
# index BAM file
samtools index ERR1787741_sorted.bam
# Create BAM file list
ls ERR1787741_sorted.bam SRR629180_sorted.bam > joint.rglist
```
Then run the freebayes in joint mode, using -L to input the BAM file list

(Here use the head 200kb region for example, due to the joint calling will consume more computing resources)
```
freebayes -f reference/Pf3D7_genomic.fna -L joint.rglist -0 -r NC_004325.2:1-200000 > joint.head200kb.vcf
```

By following this workflow, we can obtain variant information for chromosome 1, which can be used for further population genetics studies and drug resistance monitoring.