Annotation and Enrichment - igheyas/Bioinformatics GitHub Wiki
Annotation & Enrichment After you’ve filtered your variants (see Variant Filtering Strategies ), the next step is to predict their functional effects and enrich your VCF with external databases:
Effect Prediction
SnpEff (local or custom DB)
VEP (Ensembl Variant Effect Predictor)
Adding Population Frequencies Annotate with gnomAD (or custom .tsv)
Adding Clinical Data Annotate with ClinVar (or custom .vcf)
- Prepare a Toy VCF First, let’s make a tiny VCF (toy.vcf.gz) so we can demo everything without huge downloads:
# 1) Create a minimal VCF
cat > toy.vcf << 'EOF'
##fileformat=VCFv4.2
##contig=<ID=NC_000913.3,length=4641652>
#CHROM POS ID REF ALT QUAL FILTER INFO
NC_000913.3 1000 . A G 50 PASS .
NC_000913.3 2000 . T C 60 PASS .
EOF
# 2) Compress and index
bgzip toy.vcf
- Decompress, fix, recompress, index Decompress to plain VCF
gunzip -c toy.vcf.gz > toy.vcf
- Convert runs of spaces → tabs
sed -E 's/ +/\t/g' toy.vcf > toy.tab.vcf
A quick fix: preserve the header and force tab-separation
# 1) Decompress your original VCF, stitch header + cleaned data back together:
zcat toy.vcf.gz \
| ( grep '^#' \
; grep -v '^#' | sed -E 's/ +/\t/g' \
) \
| bgzip -c > toy.fixed.vcf.gz
# 2) Now index with Tabix
tabix -p vcf toy.fixed.vcf.gz
# You should now see:
# toy.fixed.vcf.gz
# toy.fixed.vcf.gz.tbi
Tabix will happily index a VCF without printing anything to the screen on success. To verify that it actually created the .tbi file, just list the files:
ls -lh toy.fixed.vcf.gz toy.fixed.vcf.gz.tbi
- Effect Prediction with SnpEff A. Install SnpEff
# via Conda (isolated)
conda install -c bioconda snpeff
# or system-wide (Ubuntu)
sudo apt update
sudo apt install snpeff
B. Build a “toy” database locally
# 1) Make a data dir
mkdir -p snpEff_data/toy
# 2) Create a one‐line GFF3 for your genome
cat > toy.gff3 << 'EOF'
##gff-version 3
##sequence-region NC_000913.3 1 1000
NC_000913.3 . gene 1 1000 . + . ID=gene1;Name=DemoGene
EOF
# 3) Copy your genome FASTA into that folder
cp ref_genome.fa snpEff_data/toy/sequences.fa
# 4) Tell snpEff about your dataDir (skip `cd`)
snpEff -dataDir $(pwd)/snpEff_data build -gff3 -v toy
At the end you should see a new snpEff_data/toy folder containing the built DB.
C. Annotate your filtered VCF
snpEff \
-dataDir $(pwd)/snpEff_data \
-v toy \
toy.vcf.gz \
| bgzip -c > annotated.vcf.gz
Output:
# index the annotated VCF
tabix -p vcf annotated.vcf.gz
The INFO column now has an ANN= tag describing predicted impacts (e.g. missense_variant, synonymous_variant, etc.).
- Adding Population Frequencies Here we’ll simulate a tiny freq.tsv file, then annotate with bcftools:
cat > popfreq.tsv << 'EOF'
#CHROM POS ID AF
NC_000913.3 1000 . 0.12
NC_000913.3 2000 . 0.03
EOF