VCF_Annotation - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

Workshop: Annotating Malaria WGS VCF Files with snpEff

1. Introduction

In this workshop, we will learn how to use snpEff to annotate variant data (VCF files) from whole genome sequencing (WGS) of malaria samples. snpEff is a powerful and widely used tool that can annotate variants based on a reference genome, helping researchers better understand the functional impact of genomic changes.

2. Installing snpEff

We can just simply install snpEff using conda:

conda install -n WGS_analysis -c bioconda -c conda-forge snpeff

And DON'T forget to activate the environment before the analysis!

3. Preparing the Malaria Reference Genome Database

To annotate malaria WGS data, we need to use the appropriate reference genome. snpEff supports multiple reference genomes, and you can download the Plasmodium falciparum (malaria parasite) genome information from the snpEff database.

3.1 Configure the Database

Before using snpEff, you need to add the target genome to the configuration file.

The path to the configuration file is in ~/miniconda/envs/WGS_analysis/share/snpeff-5.2-1/

Edit the snpEff.config file using vim to add the malaria genome entry, for example:

Pf3D7.genome : Plasmodium_falciparum_3D7

And then create the path to make the new custom database

  • Tips: If the path data not exist, you can just create it
# Create directory for this new genome
cd ~/miniconda/envs/WGS_analysis/share/snpeff-5.2-1/data
mkdir Pf3D7
cd Pf3D7

3.2 Download the Database Files

Building a database from GTF files

GTF 2.2 files are supported by SnpEff (e.g. ENSEMBL releases genome annotations in this format).

  1. Get the genome and uncompress it:

    # Get the genome reference sequence file
    wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/765/GCF_000002765.6_GCA_000002765/GCF_000002765.6_GCA_000002765_genomic.fna.gz
    # Uncompress and rename it 
    gzip -dc GCF_000002765.6_GCA_000002765_genomic.fna.gz > sequences.fa
    
  2. Get the annotation file (GTF file) and uncompressed it

    # Get annotation files
    wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/765/GCF_000002765.6_GCA_000002765/GCF_000002765.6_GCA_000002765_genomic.gtf.gz
    # Uncompress and rename it 
    gzip -dc GCF_000002765.6_GCA_000002765_genomic.gtf.gz > genes.gtf
    
    • If the download is too slow, you can use the files downloaded: /home/renzirui/database/Pf3D7

3.3 Build the database

When finished above preparations, run the command to build the database, we can simply skip the check process to make easier building of the database

snpEff build -noCheckCds -noCheckProtein Pf3D7

4. Annotating VCF Files with snpEff

Assuming you already have a VCF file containing malaria WGS variants, here are the steps to annotate these variants using snpEff.

4.1 Running snpEff

You can use the following command to annotate the VCF file generated yesterday:

snpEff eff Pf3D7 SRR629180_chr1.vcf > SRR629180_chr1.annotated.vcf

In the above command, Pf3D7 is the name of reference genome we configured earlier

4.2 Interpreting the Output

snpEff will provide detailed annotation information for each variant, including:

  • Gene name
  • Type of variant effect (e.g., missense, synonymous, etc.)
  • Potential impact on gene function

You can review the annotated_output.vcf file to understand the functional impact of the variants.