indels - gsudre/autodenovo GitHub Wiki

09/08/2017

This is going well, but it takes a while to run. There is a challenge in running single/joint/batch calling in all subjects, so I'll see if that works. I want to check the differences.

But the main problem here is our data, which has several lanes for each sample. I converted from aligned sample BAM to 2 (paired) VCFs, but Jenny mentioned in an e-mail that it might be better to go from the lane bams back to VCF. So I need to redo that, and see how to best specify all these variables in bcbio-gen (i.e. multiple VCF files, lane ID, sample ID, etc).

09/15/2017

I just saved to my Evernote a bunch of Q/A on how to use the GATK framework to call denovo mutations. I guess I can use it on top of TrioDenovo (or polymutt), and see what we get. Also, worth using SURVIVOR to condense the calls first, and then run these tools?

Another option is to use:

It might actually even be possible to use some of these with the CNVs we got from WES?

Other interesting tools, when we start thinking about which filters to use and visualization:

While I'm waiting for my VCFs from bcbio-nextgen, I can use either Linus' VCFs or even the ones we got from NISC... at least I can try out some of these tools, while I wait on developments on the other 2 arms.

Another interesting point is that triodenovo takes in VCF, but Polymutt takes in GLF or VCF. So, it would be interesting to compare the 3 approaches:

triodenovo using VCF from pipeline
polymutt using VCF from pipeline
polymutt using GLF

It'd also be interesting if the calls are different if we include all families in the pedigree, compared to pedigrees with single families. In any case, it needs a merged VCF file, which is taking forever...

vcf-merge $(ls -1 *.vcf.gz | perl -pe 's/\n/ /g') > merge.vcf

A quick note on bcbio-nextgen, apparently it's dying during variant calling with multiple callers and people, when I set up a lscratch directory. I removed that form YAML, so let's see how it goes.

09/18/2017

Nope... it broke because of a malformed VCF. I don't think I'll spend much time on that, because it's not the data I'll be using. I do need to figure out how to get convert the lane BAMs from NISC, but I can either work on that, or just use the scripts Tri will provide later. Let's put that in the backburner for now.

09/19/2017

Wow, this merging process is taking forever (more than a day already and not done, for 4 samples). I even found a faster way to do it (supposedly), but it's still taking a while:

 bcftools merge -O z --threads 8 -o 9020_merge.vcf.gz CCGO_800734.mpg.snv.vcf.gz CCGO_800983.mpg.snv.vcf.gz CCGO_800984.mpg.snv.vcf.gz CCGO_800986.mpg.snv.vcf.gz

So, let's explore some of the other methods. From what I've seen, I could go with:

I could give DMNFilter another try, but I wasn't quite happy about the fact of building the training database... maybe I'm missing something. TADA seemd a bit too complicated, and maybe a bit outside of what I want to do. I think that running these 3 optons should give us a good overview of the data.

I can also play with those filtering/visualization options later. Let's give DenovoGear a chance for now.

09/20/2017

Merging from lscratch was a lot faster. It took only a couple hours, while merging from /data took over 10h! So, we know what to do here (if not doing joint calling).

10/04/2017

Just finished running de novo calls using GATK, triodenovo and denovogear. Time to get ensemble calls, and start evaluating them. Do we need to filter more? Can we go ahead and compare the affected and non-affected siblings? first, we use bcbio-variation-recall to get an ensemble call:

module load bcbio-nextgen
ref_fa=/fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa;
while read t; do
   bcbio-variation-recall ensemble --numpass 3 --names GATK,DeNovoGear,TrioDeNovo ${t}_ensemble.vcf $ref_fa ${t}_hiConfDeNovo.vcf ../dng/${t}_dnm.vcf ../triodenovo/${t}_denovo_v2.vcf;
   gunzip ${t}_ensemble.vcf.gz;
   rm -rf ${t}_ensemble-work ${t}_ensemble.vcf.gz.tbi;
done < ../trio_ids.txt

But before we run the command above we need to fix something in the TrioDenovo VCFs... badly formatted header!!!

while read t; do
   sed -e "s/Type\=Denovo\ Quality/Type=Float/g" ${t}_denovo.vcf > ${t}_denovo_v2.vcf
done < ../trio_ids.txt

Then, the approach would be to find which (if any) variants are present across all affected trios. Then, if we are lucky, we can check the likelihood that those variants are also in the unaffected trios.