VCF_Annotation - BGIGPD/BestPractices4Pathogenomics GitHub Wiki
Workshop: Annotating Malaria WGS VCF Files with snpEff
1. Introduction
In this workshop, we will learn how to use snpEff to annotate variant data (VCF files) from whole genome sequencing (WGS) of malaria samples. snpEff is a powerful and widely used tool that can annotate variants based on a reference genome, helping researchers better understand the functional impact of genomic changes.
2. Installing snpEff
We can just simply install snpEff using conda:
conda install -n WGS_analysis -c bioconda -c conda-forge snpeff
And DON'T forget to activate the environment before the analysis!
3. Preparing the Malaria Reference Genome Database
To annotate malaria WGS data, we need to use the appropriate reference genome. snpEff supports multiple reference genomes, and you can download the Plasmodium falciparum (malaria parasite) genome information from the snpEff database.
3.1 Configure the Database
Before using snpEff, you need to add the target genome to the configuration file.
The path to the configuration file is in ~/miniconda/envs/WGS_analysis/share/snpeff-5.2-1/
Edit the snpEff.config
file using vim
to add the malaria genome entry, for example:
Pf3D7.genome : Plasmodium_falciparum_3D7
And then create the path to make the new custom database
- Tips: If the path
data
not exist, you can just create it
# Create directory for this new genome
cd ~/miniconda/envs/WGS_analysis/share/snpeff-5.2-1/data
mkdir Pf3D7
cd Pf3D7
3.2 Download the Database Files
Building a database from GTF files
GTF 2.2 files are supported by SnpEff (e.g. ENSEMBL releases genome annotations in this format).
-
Get the genome and uncompress it:
# Get the genome reference sequence file wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/765/GCF_000002765.6_GCA_000002765/GCF_000002765.6_GCA_000002765_genomic.fna.gz # Uncompress and rename it gzip -dc GCF_000002765.6_GCA_000002765_genomic.fna.gz > sequences.fa
-
Get the annotation file (GTF file) and uncompressed it
# Get annotation files wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/765/GCF_000002765.6_GCA_000002765/GCF_000002765.6_GCA_000002765_genomic.gtf.gz # Uncompress and rename it gzip -dc GCF_000002765.6_GCA_000002765_genomic.gtf.gz > genes.gtf
- If the download is too slow, you can use the files downloaded:
/home/renzirui/database/Pf3D7
- If the download is too slow, you can use the files downloaded:
3.3 Build the database
When finished above preparations, run the command to build the database, we can simply skip the check process to make easier building of the database
snpEff build -noCheckCds -noCheckProtein Pf3D7
4. Annotating VCF Files with snpEff
Assuming you already have a VCF file containing malaria WGS variants, here are the steps to annotate these variants using snpEff.
4.1 Running snpEff
You can use the following command to annotate the VCF file generated yesterday:
snpEff eff Pf3D7 SRR629180_chr1.vcf > SRR629180_chr1.annotated.vcf
In the above command, Pf3D7
is the name of reference genome we configured earlier
4.2 Interpreting the Output
snpEff will provide detailed annotation information for each variant, including:
- Gene name
- Type of variant effect (e.g., missense, synonymous, etc.)
- Potential impact on gene function
You can review the annotated_output.vcf
file to understand the functional impact of the variants.