gatkRefine - gsudre/autodenovo GitHub Wiki

09/19/2017

I'm now playing with the refinement steps to the GATK pipeline which take into consideration population and family priors. From that we can also infer denovo mutations. Following directions from https://gatkforums.broadinstitute.org/gatk/discussion/4723/genotype-refinement-workflow and https://software.broadinstitute.org/gatk/documentation/article.php?id=4727.

Note that I'll need to make small modifications to these commands (e.g. reference, file locations) after I run all pipeline on my own. But for now, let's use a few hacks to use the files Linus has produced, and also a few commands from Sijung's pipeline to produce a jointly called VCF file (instead of merging it...).

gatk_memory="50g"

ref_fa="/fdb/igenomes/Homo_sapiens/NCBI/build37.2/Sequence/WholeGenomeFasta/genome.fa"
GATK -m ${gatk_memory} GenotypeGVCFs -R ${ref_fa} -V ../../Linus/gVCFs/CCGO_800734_raw_variants.g.vcf -V ../../Linus/gVCFs/CCGO_800983_raw_variants.g.vcf -V ../../Linus/gVCFs/CCGO_800984_raw_variants.g.vcf -V ../../Linus/gVCFs/CCGO_800986_raw_variants.g.vcf -o joint.vcf

GATK -m ${gatk_memory} VariantAnnotator -R /fdb/GATK_resource_bundle/b37/human_g1k_v37.fasta -V joint.vcf -A Coverage -o joint_annotated.vcf

 GATK -m ${gatk_memory} VariantRecalibrator -R ${ref_fa} -input joint_annotated.vcf -resource:hapmap,VCF,known=false,training=true,truth=true,prior=12.0 /fdb/GATK_resource_bundle/b37/hapmap_3.3.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 /fdb/GATK_resource_bundle/b37/1000G_phase3_v4_20130502.sites.vcf.gz -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /fdb/GATK_resource_bundle/b37/dbsnp_138.b37.vcf -an DP -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R -nt 24

GATK -m ${gatk_memory} ApplyRecalibration -R ${ref_fa} -input joint_annotated.vcf -mode SNP --ts_filter_level 99.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -o recalibrated_snps_raw_indels.vcf

GATK -m ${gatk_memory} VariantRecalibrator -R ${ref_fa} -input recalibrated_snps_raw_indels.vcf -resource:1K_indels,known=false,training=true,truth=true,prior=12.0 /fdb/GATK_resource_bundle/b37/1000G_phase1.indels.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /fdb/GATK_resource_bundle/b37/dbsnp_138.b37.vcf -an DP -an QD -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 -recalFile recalibrate_INDEL.recal -tranchesFile recalibrate_INDEL.tranches -rscriptFile recalibrate_INDEL_plots.R

I had to fudge some of the resources so that Sijung's script would work. I didn't have read access to the actual resources he used in his original script. Also, keep in mind that GATK recommends at least 30 samples to do this calling, but it might be worth a try doing it with the ones we have. At least we can have a joint VCF without needing to merge them. In othe words, do the joint-VCF calling, but don't do the re-calibration. Keep in mind that we might need to do a per-sample calling (e.g. no recalibration) to find denovo mutations... don't know. Didn't read that anywhere, but I'm just thinking that it might be better to have more variants to look at.

Now we actually start the refinement steps:

GATK -m ${gatk_memory} CalculateGenotypePosteriors -R /fdb/GATK_resource_bundle/b37/human_g1k_v37_decoy.fasta --supporting /fdb/GATK_resource_bundle/b37/1000G_phase3_v4_20130502.sites.vcf.gz -ped ../9020_quartet.ped -V recalibrated_variants.vcf -o recalibrated_variants.postCGP.vcf

GATK -m ${gatk_memory} VariantFiltration -R /fdb/GATK_resource_bundle/b37/human_g1k_v37_decoy.fasta -V recalibrated_variants.postCGP.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibrated_variants.postCGP.Gfiltered.vcf

GATK -m ${gatk_memory} VariantAnnotator -R /fdb/GATK_resource_bundle/b37/human_g1k_v37_decoy.fasta -V recalibrated_variants.postCGP.Gfiltered.vcf -A PossibleDeNovo -ped ../9020_quartet.ped -o recalibrated_variants.postCGP.Gfiltered.deNovos.vcf

It's possible that some of the steps above might need only the trios, and not the quartets, so I'll need to test that.

10/04/2017

Let's swarm this to get denovo calls for all trios:

while read s; do echo "bash ~/autodenovo/gatk_refine.sh $s" >> swarm.refine; done < trio_ids.txt
swarm -f swarm.refine -t 2 -g 10 --job-name gatkr --logdir trash --time=48:00:00 --gres=lscratch:100