Annotation and Enrichment - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
Annotation & Enrichment
Once you’ve filtered and hard-filtered your VCF, the next step is to predict functional effects and annotate with external data (allele frequencies, clinical significance).
1. Effect Prediction
1.1 SnpEff
# Install (via Conda is recommended)
conda install -c bioconda snpeff
# Download a pre‐built database (e.g. human GRCh38)
snpeff download GRCh38.86
# Annotate your VCF
snpeff -v GRCh38.86 \
filtered.vcf.gz \
| bgzip -c > annotated.snpeff.vcf.gz
# Index for fast lookup
tabix -p vcf annotated.snpeff.vcf.gz
- Output: Adds an
ANN=
field to INFO.
Example:#CHROM POS REF ALT … INFO 1 879317 G A … ANN=A|missense_variant|MODERATE|TP53|…
### 1.2 VEP (Ensembl Variant Effect Predictor)
```bash
# Install
conda install -c bioconda ensembl-vep
# Fetch cache for your species (once)
vep_install -a cf -s homo_sapiens -y GRCh38
# Annotate your VCF
vep \
--cache --assembly GRCh38 \
--vcf --compress_output bgzip \
--input_file filtered.vcf.gz \
--output_file annotated.vep.vcf.gz \
--fields "Consequence,SYMBOL,IMPACT,Protein_position" \
--fork 4
# Index
tabix -p vcf annotated.vep.vcf.gz
- Output: Adds a
CSQ=
field to INFO.
Example:#CHROM POS REF ALT … INFO 1 879317 G A … CSQ=A|missense_variant|MODERATE|TP53|…
## 2. Population-Frequency Annotation
### 2.1 Public Databases (gnomAD / 1000 Genomes)
```bash
# Download or point to a sites‐only VCF with AF info:
# gnomad.genomes.r3.1.sites.vcf.gz
# Annotate
bcftools annotate \
-a gnomad.genomes.r3.1.sites.vcf.gz \
-c INFO/AF \
annotated.snpeff.vcf.gz \
| bgzip -c > with_gnomad.vcf.gz
# Re‐index
tabix -p vcf with_gnomad.vcf.gz
- Output: INFO now contains
AF=<allele-frequency>
Example:
…;ANN=…;AF=0.0021;…
2.2 Custom Frequency Table
Prepare a simple TSV:
#CHROM POS ID AF
1 879317 . 0.0054
1 879450 . 0.0008
bcftools annotate \
-a my_freq.tsv \
-c CHROM,POS,ID,AF \
with_gnomad.vcf.gz \
| bgzip -c > with_customAF.vcf.gz
tabix -p vcf with_customAF.vcf.gz
3. Clinical Annotation
3.1 ClinVar
# Download ClinVar VCF (sites‐only)
# clinvar.vcf.gz
# Annotate clinical significance & phenotype
bcftools annotate \
-a clinvar.vcf.gz \
-c INFO/CLNSIG,INFO/CLNDBN \
with_customAF.vcf.gz \
| bgzip -c > with_clinvar.vcf.gz
tabix -p vcf with_clinvar.vcf.gz
- Output: INFO now contains
CLNSIG
(e.g. “Pathogenic”) andCLNDBN
(disease name)
Example:
…;AF=0.0021;CLNSIG=Likely_pathogenic;CLNDBN="BRCA1;Breast cancer";…
### 3.2 COSMIC (Cancer Mutations)
```bash
# Download COSMIC mutation VCF
# cosmic.vcf.gz
bcftools annotate \
-a cosmic.vcf.gz \
-c INFO/MAF \
with_clinvar.vcf.gz \
| bgzip -c > enriched.vcf.gz
tabix -p vcf enriched.vcf.gz
- Output: Adds
MAF=<mutation-frequency-in-COSMIC>
4. Final VCF & Downstream
- Load
enriched.vcf.gz
into IGV, UCSC Genome Browser, etc. - Export selected columns to TSV for R/Python:
bcftools query \
-f '%CHROM\t%POS\t%REF\t%ALT\t%AF\t%CLNSIG\t%ANN\n' \
enriched.vcf.gz \
> final_annotation.tsv
- **Visualize **(e.g. allele‐freq histograms, variant counts per gene) in R or Python.
Required Tools & Libraries
- SnpEff (effect prediction)
- VEP (deep consequence annotation)
- bcftools (VCF annotate/query/index)
- tabix/bgzip (VCF compression & indexing)
- ClinVar, gnomAD, COSMIC (public annotation VCFs)
- R or Python (for downstream stats & plots)