User Guide - diskin-lab-chop/AutoGVP GitHub Wiki
AutoGVP input requirements
data/
folder:
It is recommended to place all input files into the - VEP-, ANNOVAR-, and ClinVar annotated VCF file with multiallelic sites split (
*VEP.vcf
) - ANNOVAR multianno file (
*hg38_multianno.txt
) - InterVar file (
*intervar.hg38_multianno.txt.intervar
) - AutoPVS1 file (
*autopvs1.txt
) - Variant submissions file (
ClinVar-selected-submissions.tsv
generated byselect-clinVar-submissions.R
)
Custom input workflow - step by step
Setting up input files for the custom workflow
- Annotate the germline VCF with VEP and any additional desired annotations, including control population allele frequencies, such as gnomAD. Note: It is recommended to run VEP 104 to ensure optimal tool compatibility since AutoPVS1 hg38 uses gene symbols from VEP 104.
If using VEP > 104, it is recommended to lift over the gene symbols in the PVS1.level
file.
More information here.
Example VEP command - must have --xref_refseq argument:
bash vep --offline --cache --dir_cache $VEP_CACHEDIR --fasta $VEP_CACHEDIR/GRCh38.fa --use_given_ref --species homo_sapiens --assembly GRCh38 --fork 1 --xref_refseq --hgvs --hgvsg --canonical --symbol --distance 0 --exclude_predicted --flag_pick --lookup_ref --force --input_file filename.vcf --output_file filename_VEP.vcf --format vcf --vcf --no_stats --numbers
- Split multiallelic sites
Example command:
bcftools norm --threads 4 -m "-any" filename_VEP.vcf | \
vt normalize - -n -r <reference_fasta> | \
bgzip -@ 4 -c > data/test_VEP.vcf && \
tabix data/test_VEP.vcf
- Run InterVar with the following command:
python InterVar.py -b hg38 -i data/test_VEP.vcf --input_type=VCF -o test_VEP
- Run ANNOVAR with the following options in order to create ANNOVAR annotated file using VCF input:
perl table_annovar.pl data/test_VEP.vcf hg38 --buildver hg38 --out test_VEP --remove --protocol gnomad211_exome,gnomad211_genome --operation f,f --vcfinput
python autoPVS1_from_VEP_vcf.py --genome_version hg38 --vep_vcf test_VEP.vcf > test_autopvs1.txt
- Optional: provide a ClinVar VCF file. If not supplied by the user, the most recent ClinVar file will be downloaded with
download_db_files.sh
and used in AutoGVP.
Run AutoGVP for the custom workflow
- Run
select-clinVar-submissions.R
:
Rscript scripts/select-clinVar-submissions.R --variant_summary data/variant_summary.txt.gz --submission_summary data/submission_summary.txt.gz --outdir results --conceptID_list data/clinvar_cpg_concept_ids.txt --conflict_res "latest"
- Run the AutoGVP wrapper script using
--workflow="custom"
. Example command:
bash run_autogvp.sh --workflow="custom" \
--vcf=data/test_VEP.vcf \
--filter_criteria=<filter criteria>
--clinvar=data/clinvar.vcf.gz \
--intervar=data/test_VEP.hg38_multianno.txt.intervar \
--multianno=data/test_VEP.vcf.hg38_multianno.txt \
--autopvs1=data/test_autopvs1.txt \
--outdir=results \
--out="test_custom" \
--selected_clinvar_submissions=results/ClinVar-selected-submissions.tsv \
--variant_summary=data/variant_summary.txt.gz \
--submission_summary=data/submission_summary.txt.gz \
--conceptIDs=data/clinvar_cpg_concept_ids.txt \
--conflict_res="latest"
CAVATICA input workflow - step by step
- Run the Kids First Germline Annotation Workflow first. This workflow currently annotates variants with ClinVar (2022-05-07).
- Run the Pathogenicity Preprocessing Workflow, which performs ANNOVAR with InterVar and AutoPVS1 annotations.
- Run
select-clinVar-submissions.R
:
Rscript scripts/select-clinVar-submissions.R --variant_summary data/variant_summary.txt.gz --submission_summary data/submission_summary.txt.gz --outdir results --conceptID_list data/clinvar_cpg_concept_ids.txt --conflict_res "latest"
- Run AutoGVP wrapper script using
--workflow="cavatica"
. Example command:
bash run_autogvp.sh --workflow="cavatica" \
--vcf=data/test_pbta.single.vqsr.filtered.vep_105.vcf \
--filter_criteria=<filter criteria> \
--intervar=data/test_pbta.hg38_multianno.txt.intervar \
--multianno=data/test_pbta.hg38_multianno.txt \
--autopvs1=data/test_pbta.autopvs1.tsv \
--outdir=results \
--out="test_pbta" \
--selected_clinvar_submissions=results/ClinVar-selected-submissions.tsv \
--variant_summary=data/variant_summary.txt.gz \
--submission_summary=data/submission_summary.txt.gz \
--conceptIDs=data/clinvar_cpg_concept_ids.txt \
--conflict_res="latest"
AutoGVP script descriptions
- Resolve conflicting ClinVar variants. The Rscript
select-clinVar-submissions.R
takes as input ClinVar variant and submission summary files, and identifies variants with conflicting interpretations to be resolved in the following manner:
- If a concept ID list is provided, submissions for conflicting variants are filtered based on association with any concept ID in provided list. If a single submission is retained for a variant, the final call is taken to resolve conflict.
- For remaining submissions associated with concept IDs, conflicts are resolved first by determining if a consensus/majority call exists among submissions. If not, a user-provided conflict resolution criterion is applied: either the call at latest date evaluated (the default) or the most severe call is taken (P > LP > VUS > LB > B).
- For variants with no submissions associated with concept IDs (if list provided), or else for all conflicting variants (if no list provided), conflicts are resolved by determining if consensus/majority call exists among submissions.
- If no consensus call exists, the call at the latest date evaluated is taken.
- Filter VCF file. By default,
01-filter_vcf.sh
filters based onFILTER
column (PASS
or.
). Other criteria can be specified byfilter_criteria
argument as follows:
filter_criteria='FORMAT/DP>=10 (FORMAT/AD[0:1-])/(FORMAT/DP)>=0.2 (gnomad_3_1_1_AF_non_cancer<0.001|gnomad_3_1_1_AF_non_cancer=".")'
NOTE: As of GATK 4.2.3.0, the nomenclature for missing genotype calls has changed from .
to 0
(see blog post here). Because missing genotypes will have FORMAT/DP
set to 0
, we strongly recommend including FORMAT/DP>0
as a minimum filtering criteria to remove these variants.
-
Run Pathogenicity Assessment. The R scripts
02-annotate_variants_CAVATICA_input.R
and02-annotate_variants_custom_input.R
perform the following steps: -
Read in ClinVar-annotated VCF file
-
Assign ClinVar stars based on
CLNREVSTAT
* -
For ClinVar variants, report
CLINSIG
as final call; resolve ambiguous variants (criteria_provided,_conflicting_interpretations
) by checking against ClinVar variant submission file -
Identify variants that need further Intervar annotation and possible re-adjustment (variants with 0 stars or not in ClinVar database)
-
Load and merge ANNOVAR multianno, InterVar, and AutoPVS1 files
-
Create columns for
evidencePVS1
,evidencePS
,evidencePM
,evidencePP
,evidenceBP
,evidencePM
andevidenceBA1
(variables that may need re-adjusting) by parsingInterVar: InterVar and Evidence
column -
Adjust evidence columns based on AutoPVS1
criterion
column -
Report InterVar final call (if unadjusted) or final call based on re-calculated evidence variables (if adjusted)
-
Save output
-
Parse VCF file.
03-parse_vcf.sh
converts the VCF file to a TSV file with INFO fields as tab-separated columns. -
Resolve gene annotations and produce final output files (
04-filter_gene_annoations.R
) -
Read in parsed VCF file, and select single VEP annotation based on
PICK
column (PICK == 1
). See R script for criteria used to select pick transcripts. -
Merge gene annotation with AutoGVP results
-
Select columns to retain in final output, and save full and abridged output files
AutoGVP ClinVar star annotation
Based on review status, stars ref
1 = 'criteria_provided,_single_submitter','criteria_provided,_conflicting_interpretations'
2 = 'criteria_provided,_multiple_submitters'
3 = 'reviewed_by_expert_panel'
4 = 'practice_guideline'
0 = 'no_assertion_provided','no_assertion_criteria_provided','no_assertion_for_the_individual_variant'
AutoGVP InterVar adjustments
Based on Abou Tayoun, et. al. 2018
if criterion is NF1|SS1|DEL1|DEL2|DUP1|IC1 then PVS1=1
if criterion is NF3|NF5|SS3|SS5|SS8|SS10|DEL4|DEL8|DEL6|DEL10|DUP3|IC2 then PVS1 = 0; PS = PS+1
if criterion is NF6|SS6|SS9|DEL7|DEL11|IC3 then PVS1 = 0; PM = PM+1;
if criterion is IC4 then PVS1 = 0; PP = PP+1;
if criterion is na|NF0|NF2|NF4|SS2|SS4|SS7|DEL3|DEL5|DEL9|DUP2|DUP4|DUP5|IC5 then PVS1 = 0;
New ClinSig
Pathogenic - Criteria 1
(i) 1 Very strong (PVS1) AND
(a) ≥1 Strong (PS1–PS4) OR
(b) ≥2 Moderate (PM1–PM6) OR
(c) 1 Moderate (PM1–PM6) and 1 supporting (PP1–PP5) OR
(d) ≥2 Supporting (PP1–PP5)
Pathogenic - Criteria 2
(ii) ≥2 Strong (PS1–PS4) OR
Pathogenic - Criteria 3
(iii) 1 Strong (PS1–PS4) AND
(a)≥3 Moderate (PM1–PM6) OR
(b)2 Moderate (PM1–PM6) AND ≥2 Supporting (PP1–PP5) OR
(c)1 Moderate (PM1–PM6) AND ≥4 supporting (PP1–PP5)
Likely pathogenic
(i) 1 Very strong (PVS1) AND 1 moderate (PM1– PM6) OR
(ii) 1 Strong (PS1–PS4) AND 1–2 moderate (PM1–PM6) OR
(iii) 1 Strong (PS1–PS4) AND ≥2 supporting (PP1–PP5) OR
(iv) ≥3 Moderate (PM1–PM6) OR
(v) 2 Moderate (PM1–PM6) AND ≥2 supporting (PP1–PP5) OR
(vi) 1 Moderate (PM1–PM6) AND ≥4 supporting (PP1–PP5)
Benign
(i) 1 Stand-alone (BA1) OR
(ii) ≥2 Strong (BS1–BS4)
Likely Benign
(i) 1 Strong (BS1–BS4) and 1 supporting (BP1– BP7) OR
(ii) ≥2 Supporting (BP1–BP7)
Uncertain significance
(i) non of the criteria were met.
(ii) Benign and pathogenic are contradictory.