Annotation and Enrichment - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

Annotation & Enrichment

Once you’ve filtered and hard-filtered your VCF, the next step is to predict functional effects and annotate with external data (allele frequencies, clinical significance).

1. Effect Prediction

1.1 SnpEff

# Install (via Conda is recommended)
conda install -c bioconda snpeff

# Download a pre‐built database (e.g. human GRCh38)
snpeff download GRCh38.86

# Annotate your VCF
snpeff -v GRCh38.86 \
  filtered.vcf.gz \
| bgzip -c > annotated.snpeff.vcf.gz

# Index for fast lookup
tabix -p vcf annotated.snpeff.vcf.gz
  • Output: Adds an ANN= field to INFO.
    Example:
    #CHROM  POS     REF  ALT  …  INFO
    1       879317  G    A    …  ANN=A|missense_variant|MODERATE|TP53|…
    
### 1.2 VEP (Ensembl Variant Effect Predictor)
```bash
# Install
conda install -c bioconda ensembl-vep

# Fetch cache for your species (once)
vep_install -a cf -s homo_sapiens -y GRCh38

# Annotate your VCF
vep \
  --cache --assembly GRCh38 \
  --vcf --compress_output bgzip \
  --input_file filtered.vcf.gz \
  --output_file annotated.vep.vcf.gz \
  --fields "Consequence,SYMBOL,IMPACT,Protein_position" \
  --fork 4

# Index
tabix -p vcf annotated.vep.vcf.gz
  • Output: Adds a CSQ= field to INFO.
    Example:
    #CHROM  POS     REF  ALT  …  INFO
    1       879317  G    A    …  CSQ=A|missense_variant|MODERATE|TP53|…
    
## 2. Population-Frequency Annotation

### 2.1 Public Databases (gnomAD / 1000 Genomes)

```bash
# Download or point to a sites‐only VCF with AF info:
#   gnomad.genomes.r3.1.sites.vcf.gz

# Annotate
bcftools annotate \
  -a gnomad.genomes.r3.1.sites.vcf.gz \
  -c INFO/AF \
  annotated.snpeff.vcf.gz \
| bgzip -c > with_gnomad.vcf.gz

# Re‐index
tabix -p vcf with_gnomad.vcf.gz
  • Output: INFO now contains AF=<allele-frequency>
    Example:
…;ANN=…;AF=0.0021;…

2.2 Custom Frequency Table

Prepare a simple TSV:

#CHROM  POS    ID    AF
1        879317  .    0.0054
1        879450  .    0.0008

bcftools annotate \
  -a my_freq.tsv \
  -c CHROM,POS,ID,AF \
  with_gnomad.vcf.gz \
| bgzip -c > with_customAF.vcf.gz
tabix -p vcf with_customAF.vcf.gz

3. Clinical Annotation

3.1 ClinVar

# Download ClinVar VCF (sites‐only)
#   clinvar.vcf.gz

# Annotate clinical significance & phenotype
bcftools annotate \
  -a clinvar.vcf.gz \
  -c INFO/CLNSIG,INFO/CLNDBN \
  with_customAF.vcf.gz \
| bgzip -c > with_clinvar.vcf.gz
tabix -p vcf with_clinvar.vcf.gz

  • Output: INFO now contains CLNSIG (e.g. “Pathogenic”) and CLNDBN (disease name)
    Example:

…;AF=0.0021;CLNSIG=Likely_pathogenic;CLNDBN="BRCA1;Breast cancer";…

### 3.2 COSMIC (Cancer Mutations)
```bash
# Download COSMIC mutation VCF
#   cosmic.vcf.gz

bcftools annotate \
  -a cosmic.vcf.gz \
  -c INFO/MAF \
  with_clinvar.vcf.gz \
| bgzip -c > enriched.vcf.gz
tabix -p vcf enriched.vcf.gz

  • Output: Adds MAF=<mutation-frequency-in-COSMIC>

4. Final VCF & Downstream

  1. Load enriched.vcf.gz into IGV, UCSC Genome Browser, etc.
  2. Export selected columns to TSV for R/Python:
bcftools query \
  -f '%CHROM\t%POS\t%REF\t%ALT\t%AF\t%CLNSIG\t%ANN\n' \
  enriched.vcf.gz \
  > final_annotation.tsv

  1. **Visualize **(e.g. allele‐freq histograms, variant counts per gene) in R or Python.

Required Tools & Libraries

  • SnpEff (effect prediction)
  • VEP (deep consequence annotation)
  • bcftools (VCF annotate/query/index)
  • tabix/bgzip (VCF compression & indexing)
  • ClinVar, gnomAD, COSMIC (public annotation VCFs)
  • R or Python (for downstream stats & plots)