AutoDenovo Wiki - gsudre/autodenovo GitHub Wiki

Welcome to the AutoDenovo wiki!

Here I plan to take notes about the journey in creating this pipeline. As expected, there are hurdles in all 3 arms of the project:

And of course, I'm having issues with the data I'm dealing with.

Data issues

I plan to describe these issues here, and go along with actual issue tickets in the appropriate tab, so that I can keep track of what still needs to be done.

09/12/2017

Today I got to thinking about where I should spend most of my time. The main goal of this project is to apply it to the quartet WES data I'm waiting to arrive from the sequencing center. Although I assume we'll have genotyping data with it (similar to the test data I'm using now), and that I'll have to re-align everything, starting from lane BAMs, according to GATK best practices, I don't know any of that for sure. The only things I know is that I'll need to be able to run a denovo analysis on the WES data I get.

So, I'll focus on making sure I can do that, using the SNP/indel arm, and also the CNV/WES arm. Ben also asked me to explore some of the PSST scripts that look into multiple variants related to an outcome. So, I'll focus on those topics, and wait to do more on the re-alignment and genotyping front later.

09/13/2017

I was also thinking that even though variant calling might work better using multiple samples (within family or even across all samples), I should try to optimize this pipeline to either do calls per sample, or at the most within families. That's just so it's not dependent on the total number of families (trios/quartets) that exist in the data. It could be a future improvement to call it using all samples, but maybe not a priority?

One of the main questions I have at the moment is when to use filtering. Specifically, I've found SURVIVOR, which can do some sort of majority voting on the variants called by the different methods. I also found cnvScan, which seems to do a good job filtering CNV calls for each package. Finally, I can use the de novo aspect of the studies to filter based on family relationships. In other words, I need to test what's the best order here. Maybe do cnvScan first, then SURVIVOR, then denovo? No... if we're looking for denovo variants, than that should be the very first filter. We can then check the input/output for SURVIVOR and cnvScan to check what makes more sense.

09/15/2017

I just ran into this: http://centre.bioinformatics.zj.cn/mirTrios/, which is a pipeline to call denovo mutations in NGS. Very similar to what I'm doing, at least in one of the arms. Let me see how it runs. It runs through their website though... don't like that. Prefer open-source tools...