Annotation and Enrichment - igheyas/Bioinformatics GitHub Wiki

Annotation & Enrichment After you’ve filtered your variants (see Variant Filtering Strategies ), the next step is to predict their functional effects and enrich your VCF with external databases:

Effect Prediction

SnpEff (local or custom DB)

VEP (Ensembl Variant Effect Predictor)

Adding Population Frequencies Annotate with gnomAD (or custom .tsv)

Adding Clinical Data Annotate with ClinVar (or custom .vcf)

Prepare a Toy VCF First, let’s make a tiny VCF (toy.vcf.gz) so we can demo everything without huge downloads:

# 1) Create a minimal VCF
cat > toy.vcf << 'EOF'
##fileformat=VCFv4.2
##contig=<ID=NC_000913.3,length=4641652>
#CHROM  POS     ID      REF ALT QUAL FILTER INFO
NC_000913.3 1000   .       A   G   50   PASS   .
NC_000913.3 2000   .       T   C   60   PASS   .
EOF

# 2) Compress and index
bgzip  toy.vcf

Decompress, fix, recompress, index Decompress to plain VCF

gunzip -c toy.vcf.gz > toy.vcf

Convert runs of spaces → tabs

sed -E 's/  +/\t/g' toy.vcf > toy.tab.vcf

A quick fix: preserve the header and force tab-separation

# 1) Decompress your original VCF, stitch header + cleaned data back together:
zcat toy.vcf.gz \
  | ( grep '^#'                                    \
      ; grep -v '^#' | sed -E 's/  +/\t/g'          \
    ) \
  | bgzip -c > toy.fixed.vcf.gz

# 2) Now index with Tabix
tabix -p vcf toy.fixed.vcf.gz

# You should now see:
#   toy.fixed.vcf.gz
#   toy.fixed.vcf.gz.tbi

Tabix will happily index a VCF without printing anything to the screen on success. To verify that it actually created the .tbi file, just list the files:

ls -lh toy.fixed.vcf.gz toy.fixed.vcf.gz.tbi

Effect Prediction with SnpEff A. Install SnpEff

# via Conda (isolated)
conda install -c bioconda snpeff

# or system-wide (Ubuntu)
sudo apt update
sudo apt install snpeff

B. Build a “toy” database locally

# 1) Make a data dir
mkdir -p snpEff_data/toy

# 2) Create a one‐line GFF3 for your genome
cat > toy.gff3 << 'EOF'
##gff-version 3
##sequence-region NC_000913.3 1 1000
NC_000913.3  .  gene  1 1000  .  +  .  ID=gene1;Name=DemoGene
EOF

# 3) Copy your genome FASTA into that folder
cp ref_genome.fa snpEff_data/toy/sequences.fa

# 4) Tell snpEff about your dataDir (skip `cd`)
snpEff -dataDir $(pwd)/snpEff_data build -gff3 -v toy

At the end you should see a new snpEff_data/toy folder containing the built DB.

C. Annotate your filtered VCF

snpEff \
  -dataDir $(pwd)/snpEff_data \
  -v toy \
  toy.vcf.gz \
| bgzip -c > annotated.vcf.gz

Output:

# index the annotated VCF
tabix -p vcf annotated.vcf.gz

The INFO column now has an ANN= tag describing predicted impacts (e.g. missense_variant, synonymous_variant, etc.).

Adding Population Frequencies Here we’ll simulate a tiny freq.tsv file, then annotate with bcftools:

cat > popfreq.tsv << 'EOF'
#CHROM  POS     ID      AF
NC_000913.3 1000   .       0.12
NC_000913.3 2000   .       0.03
EOF