DeepVariant Analysis - tuiaswath/karyosoft GitHub Wiki

Introduction

Deep Variant can learn to call variants in a variety of sequencing technologies and experimental designs, from deep whole genomes from 10X Genomics to Ion Ampliseq exomes.
The pre-built model that ships with DeepVariant generalizes across genome builds and other mammalian species, thus allowing for non-human sequencing projects to benefit from the wealth of human ground truth data.
The widely-used GATK uses logistic regression to model base errors, hidden Markov models to compute read likelihoods, and naive Bayes classification to identify variants.
These techniques allow the GATK to achieve high but still imperfect accuracy on the Illumina sequencing platform 3,4.
Deep Variant is robust to changes in sequencing depth, preparation protocol, instrument type, genome build, and even mammalian species.
We can either call the existing deep variant base model or train a model on top of the base model and call it.

DeepVariant Variant Calling Pipeline

Stages of DeepVariant Variant Calling

1. make_examples

make_examples consumes sequence reads and the reference genome to create TensorFlow examples for evaluation with our deep learning models.
It supports sharding of the input files thus allowing multiple shards to be processed simultaneous.
It also supports parsing specific regions identified by the chromosome ids or the indices of the bases (--regions chr20:10,000,000-11,000,000)
When running in training mode the truth_vcf and confident_regions arguments should point to VCF and BED files containing the true variants and regions where we are confident in our calls

2. call_variants

call_variants consume TFRecord file(s) of tf.Examples protos created by make_examples and a deep learning model checkpoint and evaluates the model on each example in the input TFRecord file.
It can make use of multiple cores, but scaling is sub-linear
The output here is a TFRecord of CallVariantsOutput protos. It is not the full output as in the VCF file yet. The numbers are just field numbers in the data structure.

3. postprocess_variants

postprocess_variants reads all of the output TFRecord files from call_variants, sorts them, combines multi-allelic records, and writes out a VCF file
When gVCF output is requested, it also outputs a gVCF file which merges the VCF with the non-variant sites

DeepVariant Variant Calling Flowchart

Example command for Variant Calling using DeepVariant

BIN_VERSION="1.1.0"
docker run \
  -v "/mnt/disks/snpvariant/deepvariant/input":"/input" \
  -v "/mnt/disks/snpvariant/deepvariant/output":"/output" \
  google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WGS \ **Replace this string with exactly one of the following [WGS,WES,PACBIO,HYBRID_PACBIO_ILLUMINA]**
  --ref=/input/GRCh38.fasta \
  --reads=/input/SRR098401.bam \
  --output_vcf=/output/output.vcf.gz  \
  --output_gvcf=/output/output.g.vcf.gz  \
  --call_variants_extra_args="use_openvino=true" \ **Optional. Setting this will use OpenVINO on Intel CPUs, which empirically reduces call_variants runtime by 15%-25%.
  --num_shards=$(nproc) \ **This will use all your cores to run make_examples. Feel free to change.**
  --logging_dir=/output/logs **Optional. This saves the log output for each stage separately.

DeepVariant Pre-built Data Models

All the number of examples is from the current version of r1.1 of DeepVariant as of 07/12/2021. All referenced to Coriell university datasets.

Whole Genome Sequence (WGS) Model

It is trained with 388,337,190 variant calls from the following genomes samples:

12 Samples of HG001 - NA12878
3 Samples of HG002 - NA24385
3 Samples of HG004 - NA24149
3 Samples of HG005 - NA24631
3 Samples of HG006 - NA24694
3 Samples of HG007 - NA24695

Whole Exome Sequence (WES) Model

It is trained with 13,450,688 variant calls from the following exome samples:

41 Samples of HG001 - NA12878
9 Samples of HG002 - NA24385
6 Samples of HG004 - NA24149
12 Samples of HG005 - NA24631
9 Samples of HG006 - NA24694
9 Samples of HG007 - NA24695

PacBio HiFi

It is trained with 569,225,616 variant calls from the following genome samples:

1 Samples of HG001 - NA12878
9 Samples of HG002 - NA24385
2 Samples of HG004 - NA24149
1 Samples of HG005 - NA24631

Training Data

The Genome in a Bottle consortium from the National Institute of Standards and Technology (NIST) creates gold-standard samples with known variants covering the regions of the genome. These are used as labels to train DeepVariant. Using long-read technologies the Genome in a Bottle expanded the set of confident variants, increasing the regions described by the standard set from 85% of the genome to 92% of it.

Data Directory of Genome in a Bottle

Human Genome Data

The VCF files that is used to train the DeepVariant model is located inside the folders mentioned below

INFO Fields in the Base Truth VCF files of DeepVariant

datasets: Number of different datasets for which at least one callset called this genotype, whether filtered or not
datasetnames: Names of datasets for which at least one callset called this genotype, whether filtered or not
datasetsmissingcall: Names of datasets that are missing a call or have an incorrect call at this location, and the high-confidence call is a variant
callsets: Number of different callsets that called this genotype, whether filtered or not.
callsetnames: Names of callsets that called this genotype, whether filtered or not.
varType: Type of variant.
filt: List of callsets that had this call filtered.
callable: List of callsets that had this call in a region with low coverage of high MQ reads.
difficultregion: List of difficult region bed files containing this call.
arbitrated: TRUE if callsets had discordant calls so that arbitration was needed.
callsetwiththisuniqgenopassing: Callset that uniquely calls the PASSing genotype in GT when 2+ PASSing callsets support a different genotype.
callsetwithotheruniqgenopassing: Callset that uniquely calls a PASSing genotype different from GT when 2+ PASSing callsets support the genotype in GT.

Example Variant Call

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG002
chr1    602113  .       T       TGCCCA  50      PASS    platforms=3;platformnames=PacBio,10X,Illumina;datasets=3;datasetnames=CCS15kb_20kb,10XChromiumLR,HiSeq250x250;callsets=4;callsetnames=CCS15kb_20kbDV,CCS15kb_20kbGATK4,10XLRGATK,HiSeq250x250Sentieon;datasetsmissingcall=HiSeqPE300x,CGnormal,HiSeq250x250,HiSeqMatePair,IonExome,SolidSE75bp;callable=CS_CCS15kb_20kbDV_callable,CS_CCS15kb_20kbGATK4_callable;filt=CS_HiSeqPE300xSentieon_filt,CS_HiSeqPE300xfreebayes_filt,CS_HiSeq250x250Sentieon_filt,CS_HiSeq250x250freebayes_filt,CS_HiSeqMatePairfreebayes_filt;difficultregion=AJtrio-HG002.hg38.300x.bam.bilkentuniv.072319.dups,hg38.segdups_sorted_merged,lowmappabilityall        GT:PS:DP:ADALL:AD:GQ    1/1:.:93:2,56:2,38:99
   
INFO Fields:
    
platforms=3;
platformnames=PacBio,10X,Illumina;
datasets=3;
datasetnames=CCS15kb_20kb,10XChromiumLR,HiSeq250x250;
callsets=4;
callsetnames=CCS15kb_20kbDV,CCS15kb_20kbGATK4,10XLRGATK,HiSeq250x250Sentieon;
datasetsmissingcall=HiSeqPE300x,CGnormal,HiSeq250x250,HiSeqMatePair,IonExome,SolidSE75bp;
callable=CS_CCS15kb_20kbDV_callable,CS_CCS15kb_20kbGATK4_callable;
filt=CS_HiSeqPE300xSentieon_filt,CS_HiSeqPE300xfreebayes_filt,CS_HiSeq250x250Sentieon_filt,CS_HiSeq250x250freebayes_filt,CS_HiSeqMatePairfreebayes_filt;
difficultregion=AJtrio-HG002.hg38.300x.bam.bilkentuniv.072319.dups,hg38.segdups_sorted_merged,lowmappabilityall

In the above example, we can see that the variant call has been identified in 3 callsets which is 3 VCF files from different pipelines and platforms.

References

Technical Improvements in DeepVariant v1.0 and Training a Hybrid model