Annotating with ANNOVAR - asoltis/MutEnricher GitHub Wiki

Annotating mutations with ANNOVAR

Last updated: October 29, 2019

Introduction

When running MutEnricher's coding module, somatic VCFs must be annotated with some form of gene and non-silent term annotations (in the VCF INFO field). Several bioinformatic tools are available for this purpose; this page describes how this can be done with the popular ANNOVAR tool.

Annotation procedure

1a. Assure `perl` is available on your sytem

1b. Download ANNOVAR and databases

Download and install ANNOVAR from the ANNOVAR download and instructions page. Follow the instructions to obtain the gene annotation database of interest.

2. Prepare somatic VCF(s) for annotation

ANNOVAR can run on VCF files directly with the -vcfinput flag; however, we have encountered errors when attempting to run on VCFs without the genotype (i.e. GT) field set (which may not be set in somatic VCFs depending on the program). A workaround for this is to modify somatic VCFs to include only the first 8 columns, e.g.:

# With bcftools
bcftools view sample.vcf.gz | cut -f1-8 | bcftools view -Oz > sample.cut.vcf.gz
bcftools index -t sample.cut.vcf.gz

# with bgzip/tabix
zcat sample.vcf.gz | cut -f1-8 | bgzip > sample.cut.vcf.gz
tabix -p vcf sample.cut.vcf.gz

3. Run ANNOVAR

Run ANNOVAR's table_annovar.pl script on the native somatic VCF (or cut version if necessary from (2)) with the desired gene annotation database:

# hg19, refGene models
perl /path/to/annovar/table_annovar.pl 
/path/to/sample.vcf.gz 
/path/to/annovar/humandb
-buildver hg19
-out /path/to/output/directory/sample.annovar
-vcfinput 
-remove
-protocol refGene
-operation g
-nastring . 

# hg38, refGene models
perl /path/to/annovar/table_annovar.pl 
/path/to/sample.vcf.gz 
/path/to/annovar/humandb
-buildver hg38
-out /path/to/output/directory/sample.annovar
-vcfinput 
-remove
-protocol refGene
-operation g
-nastring .

# Other annotations follow the same general format #

The above command will produce several output files, with the output of interest being the annotated VCF file (e.g. sample.annovar.hg19_multianno.vcf or sample.annovar.hg38_multianno.vcf, depending on the genome build).

4. Compress and index VCF file(s)

MutEnricher requires sorted, bgzipped, and tabix-indexed VCF files. Run one of the following commands on the ANNOVAR output VCF(s):

# with bcftools (hg19 annotated output example)
bcftools view sample.annovar.hg19_multianno.vcf -Oz > sample.annovar.hg19_multianno.vcf.gz
bcftools index -t sample.annovar.hg19_multianno.vcf.gz

# with bgzip/tabix directly (hg19 example)
bgzip sample.annovar.hg19_multianno.vcf > sample.annovar.hg19_multianno.vcf.gz
tabix -p vcf sample.annovar.hg19_multianno.vcf.gz

5. Remove temporary files (optional)

After the annotated bgzipped VCF files and their indexes are generated (i.e. .vcf.gz and .vcf.gz.tbi files), the additional ANNOVAR files and other temporary files can be removed.