User Guide - diskin-lab-chop/AutoGVP GitHub Wiki

AutoGVP input requirements

It is recommended to place all input files into the `data/` folder:

VEP-, ANNOVAR-, and ClinVar annotated VCF file with multiallelic sites split (*VEP.vcf)
ANNOVAR multianno file (*hg38_multianno.txt)
InterVar file (*intervar.hg38_multianno.txt.intervar)
AutoPVS1 file (*autopvs1.txt)
Variant submissions file (ClinVar-selected-submissions.tsv generated by select-clinVar-submissions.R)

Custom input workflow - step by step

Setting up input files for the custom workflow

Annotate the germline VCF with VEP and any additional desired annotations, including control population allele frequencies, such as gnomAD. Note: It is recommended to run VEP 104 to ensure optimal tool compatibility since AutoPVS1 hg38 uses gene symbols from VEP 104.

If using VEP > 104, it is recommended to lift over the gene symbols in the PVS1.level file. More information here.

Example VEP command - must have --xref_refseq argument:

bash vep --offline --cache --dir_cache $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --use_given_ref --species homo_sapiens --assembly GRCh38 --fork 1 --xref_refseq --hgvs --hgvsg --canonical --symbol --distance 0 --exclude_predicted --flag_pick --lookup_ref --force --input_file filename.vcf --output_file filename_VEP.vcf --format vcf --vcf --no_stats --numbers

Split multiallelic sites

Example command:

bcftools norm --threads 4 -m "-any" filename_VEP.vcf | \
vt normalize - -n -r <reference_fasta> | \
bgzip -@ 4 -c >  data/test_VEP.vcf  && \
tabix data/test_VEP.vcf

Run InterVar with the following command:

python InterVar.py -b hg38 -i data/test_VEP.vcf --input_type=VCF -o test_VEP

Run ANNOVAR with the following options in order to create ANNOVAR annotated file using VCF input:

perl table_annovar.pl data/test_VEP.vcf hg38 --buildver hg38 --out test_VEP --remove --protocol gnomad211_exome,gnomad211_genome --operation f,f --vcfinput

Run D3b-AutoPVS1 v2.0.0

python autoPVS1_from_VEP_vcf.py --genome_version hg38 --vep_vcf test_VEP.vcf > test_autopvs1.txt

Optional: provide a ClinVar VCF file. If not supplied by the user, the most recent ClinVar file will be downloaded with download_db_files.sh and used in AutoGVP.

Run AutoGVP for the custom workflow

Run select-clinVar-submissions.R:

Rscript scripts/select-clinVar-submissions.R --variant_summary data/variant_summary.txt.gz --submission_summary data/submission_summary.txt.gz --outdir results --conceptID_list data/clinvar_cpg_concept_ids.txt --conflict_res "latest"

Run the AutoGVP wrapper script using --workflow="custom". Example command:

bash run_autogvp.sh --workflow="custom" \
--vcf=data/test_VEP.vcf \
--filter_criteria=<filter criteria>
--clinvar=data/clinvar.vcf.gz \
--intervar=data/test_VEP.hg38_multianno.txt.intervar \
--multianno=data/test_VEP.vcf.hg38_multianno.txt \
--autopvs1=data/test_autopvs1.txt \
--outdir=results \
--out="test_custom" \
--selected_clinvar_submissions=results/ClinVar-selected-submissions.tsv \
--variant_summary=data/variant_summary.txt.gz \
--submission_summary=data/submission_summary.txt.gz \
--conceptIDs=data/clinvar_cpg_concept_ids.txt \
--conflict_res="latest"

CAVATICA input workflow - step by step

Run the Kids First Germline Annotation Workflow first. This workflow currently annotates variants with ClinVar (2022-05-07).
Run the Pathogenicity Preprocessing Workflow, which performs ANNOVAR with InterVar and AutoPVS1 annotations.
Run select-clinVar-submissions.R:

Rscript scripts/select-clinVar-submissions.R --variant_summary data/variant_summary.txt.gz --submission_summary data/submission_summary.txt.gz --outdir results --conceptID_list data/clinvar_cpg_concept_ids.txt --conflict_res "latest"

Run AutoGVP wrapper script using --workflow="cavatica". Example command:

bash run_autogvp.sh --workflow="cavatica" \
--vcf=data/test_pbta.single.vqsr.filtered.vep_105.vcf \
--filter_criteria=<filter criteria> \
--intervar=data/test_pbta.hg38_multianno.txt.intervar \
--multianno=data/test_pbta.hg38_multianno.txt \
--autopvs1=data/test_pbta.autopvs1.tsv \
--outdir=results \
--out="test_pbta" \
--selected_clinvar_submissions=results/ClinVar-selected-submissions.tsv \
--variant_summary=data/variant_summary.txt.gz \
--submission_summary=data/submission_summary.txt.gz \
--conceptIDs=data/clinvar_cpg_concept_ids.txt \
--conflict_res="latest"

AutoGVP script descriptions

Resolve conflicting ClinVar variants. The Rscript select-clinVar-submissions.R takes as input ClinVar variant and submission summary files, and identifies variants with conflicting interpretations to be resolved in the following manner:

If a concept ID list is provided, submissions for conflicting variants are filtered based on association with any concept ID in provided list. If a single submission is retained for a variant, the final call is taken to resolve conflict.
For remaining submissions associated with concept IDs, conflicts are resolved first by determining if a consensus/majority call exists among submissions. If not, a user-provided conflict resolution criterion is applied: either the call at latest date evaluated (the default) or the most severe call is taken (P > LP > VUS > LB > B).
For variants with no submissions associated with concept IDs (if list provided), or else for all conflicting variants (if no list provided), conflicts are resolved by determining if consensus/majority call exists among submissions.
If no consensus call exists, the call at the latest date evaluated is taken.

Filter VCF file. By default, 01-filter_vcf.sh filters based on FILTER column (PASS or .). Other criteria can be specified by filter_criteria argument as follows:

filter_criteria='FORMAT/DP>=10 (FORMAT/AD[0:1-])/(FORMAT/DP)>=0.2 (gnomad_3_1_1_AF_non_cancer<0.001|gnomad_3_1_1_AF_non_cancer=".")'

NOTE: As of GATK 4.2.3.0, the nomenclature for missing genotype calls has changed from . to 0 (see blog post here). Because missing genotypes will have FORMAT/DP set to 0, we strongly recommend including FORMAT/DP>0 as a minimum filtering criteria to remove these variants.

Run Pathogenicity Assessment. The R scripts 02-annotate_variants_CAVATICA_input.R and 02-annotate_variants_custom_input.R perform the following steps:
Read in ClinVar-annotated VCF file
Assign ClinVar stars based on CLNREVSTAT*
For ClinVar variants, report CLINSIG as final call; resolve ambiguous variants (criteria_provided,_conflicting_interpretations) by checking against ClinVar variant submission file
Identify variants that need further Intervar annotation and possible re-adjustment (variants with 0 stars or not in ClinVar database)
Load and merge ANNOVAR multianno, InterVar, and AutoPVS1 files
Create columns for evidencePVS1, evidencePS, evidencePM, evidencePP, evidenceBP, evidencePM and evidenceBA1 (variables that may need re-adjusting) by parsing InterVar: InterVar and Evidence column
Adjust evidence columns based on AutoPVS1 criterion column
Report InterVar final call (if unadjusted) or final call based on re-calculated evidence variables (if adjusted)
Save output
Parse VCF file. 03-parse_vcf.sh converts the VCF file to a TSV file with INFO fields as tab-separated columns.
Resolve gene annotations and produce final output files (04-filter_gene_annoations.R)
Read in parsed VCF file, and select single VEP annotation based on PICK column (PICK == 1). See R script for criteria used to select pick transcripts.
Merge gene annotation with AutoGVP results
Select columns to retain in final output, and save full and abridged output files

AutoGVP ClinVar star annotation

Based on review status, stars ref

1 = 'criteria_provided,_single_submitter','criteria_provided,_conflicting_interpretations'
2 = 'criteria_provided,_multiple_submitters'
3 = 'reviewed_by_expert_panel'
4 = 'practice_guideline'
0 = 'no_assertion_provided','no_assertion_criteria_provided','no_assertion_for_the_individual_variant'

AutoGVP InterVar adjustments

Based on Abou Tayoun, et. al. 2018

if criterion is NF1|SS1|DEL1|DEL2|DUP1|IC1 then PVS1=1
if criterion is NF3|NF5|SS3|SS5|SS8|SS10|DEL4|DEL8|DEL6|DEL10|DUP3|IC2 then PVS1 = 0; PS = PS+1
if criterion is NF6|SS6|SS9|DEL7|DEL11|IC3 then PVS1 = 0; PM = PM+1;
if criterion is IC4 then PVS1 = 0; PP = PP+1;
if criterion is na|NF0|NF2|NF4|SS2|SS4|SS7|DEL3|DEL5|DEL9|DUP2|DUP4|DUP5|IC5  then PVS1 = 0;

New ClinSig
Pathogenic - Criteria 1
  (i) 1 Very strong (PVS1) AND
        (a) ≥1 Strong (PS1–PS4) OR
        (b) ≥2 Moderate (PM1–PM6) OR
        (c) 1 Moderate (PM1–PM6) and 1 supporting (PP1–PP5) OR
        (d) ≥2 Supporting (PP1–PP5)

Pathogenic - Criteria 2
  (ii) ≥2 Strong (PS1–PS4) OR

Pathogenic - Criteria 3
  (iii) 1 Strong (PS1–PS4) AND
        (a)≥3 Moderate (PM1–PM6) OR
        (b)2 Moderate (PM1–PM6) AND ≥2 Supporting (PP1–PP5) OR
        (c)1 Moderate (PM1–PM6) AND ≥4 supporting (PP1–PP5)

Likely pathogenic
        (i) 1 Very strong (PVS1) AND 1 moderate (PM1– PM6) OR
        (ii) 1 Strong (PS1–PS4) AND 1–2 moderate (PM1–PM6) OR
        (iii) 1 Strong (PS1–PS4) AND ≥2 supporting (PP1–PP5) OR
        (iv)  ≥3 Moderate (PM1–PM6) OR
        (v) 2 Moderate (PM1–PM6) AND ≥2 supporting (PP1–PP5) OR
        (vi) 1 Moderate (PM1–PM6) AND ≥4 supporting (PP1–PP5)

Benign
        (i) 1 Stand-alone (BA1) OR
        (ii) ≥2 Strong (BS1–BS4)

Likely Benign
        (i) 1 Strong (BS1–BS4) and 1 supporting (BP1– BP7) OR
        (ii) ≥2 Supporting (BP1–BP7)

Uncertain  significance
        (i) non of the criteria were met.
        (ii) Benign and pathogenic are contradictory.