CNVnator - gsudre/autodenovo GitHub Wiki

09/12/2017

Started playing with CNVnator today. The first thing I noticed is that it's not set up to work with hg38. I'll keep working on hg18, but we'll need to convert it later. Actually, this might not matter, as we might be able to not specify the genome option and get the positional information from the BAM headers.

Running the first step looks quite easy:

module load cnvnator
cnvnator -unique -root out.root -tree project2/final/CCGO_800818/CCGO_800818_ready.bam

It takes only one thread, but 13Gb of memory. So, we can easy swarm it later. Note that the bin size varies according to data quality, coverage, and read length. But, assuming homogeneous quality and length, we can specify bin size based on our estimated coverage. We need to estimate coverage for XHMM, but for now let's use whatever

Also, we can/should merge the output for different participants, before calling any of the stats below. However, we won't be able to specify different bin sizes i we merge files...

Then we need some more prep:

for i in {1..22} X Y; do
   ln -s /fdb/ensembl/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.${i}.fa chr${i}.fa;
done

cnvnator -root out.root -his 100
cnvnator -root out.root -stat 100
cnvnator -root out.root -partition 100 > cnvnator_calls.txt

And this is it... note that calling the CNVs takes 2 threads, but not too much memory. Just a long time...

The 100 number was just so we could keep going, but I need to figure it out later! Also, I will need to include more symlinks above if I want to play with anything other than those chromosomes.

Something else to keep in mind is that we should be able to do this in single trios/quartets, so we don't necessarily need to use the power of all samples to do anything other than when computing the differences between affected and non-affected sibling. True, it might help figuring out CNVs if we use all samples at the same time when calling variants, but we technically shouldn't need it.

Fake families

Alright, let's see what we get running our fake quartets. I'll run similar binsizes as I'm doing in CNVkit, for comparison. Need to check if they translate to similar parameters, but hopefully yes.

module load cnvnator
my_chr='' ; for c in {1..22} X Y; do my_chr=${my_chr}' chr'$c; done
cd /data/NCR_SBRB/fake_trios/cnvnator
while read s; do 
   echo $s; 
   cnvnator -unique -root ${s}.root -chrom $my_chr -tree ../bams/${s}-sort.bam;
done < ../sample_ids.txt

# windows mentioned in the paper, but need to check what works best for us. Per sample?
for window in 100 500 30; do 
   while read s; do
      cnvnator -root ${s}.root -chrom $my_chr -his $window -d ../;
      cnvnator -root ${s}.root -chrom $my_chr -stat $window;
      cnvnator -root ${s}.root -chrom $my_chr -partition $window;
      cnvnator -root ${s}.root -chrom $my_chr -call $window > ${s}_calls_w${window}.txt;
      cnvnator2VCF.pl ${s}_calls_w${window}.txt > ${s}_calls_w${window}.vcf;
   done < ../sample_ids.txt;
done

If this approach works, swarming it is trivial. I do need to test if it's worth it merging the members of a family in the root file (before any histogram action), which would complicate it a bit, but still not bad.

Just because it was taking a while to run, I created this bash script that I later swarmed:

s=$1
window=$2
root_dir=/data/NCR_SBRB/fake_trios

cd /lscratch/$SLURM_JOBID
for i in {1..22} X Y; do
   ln -s /fdb/ensembl/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.${i}.fa chr${i}.fa;
done
my_chr='' ; for c in {1..22} X Y; do my_chr=${my_chr}' chr'$c; done
cnvnator -unique -root ${s}.root -chrom $my_chr -tree ${root_dir}/bams/${s}-sort.bam;
cnvnator -root ${s}.root -chrom $my_chr -his $window;
cnvnator -root ${s}.root -chrom $my_chr -stat $window;
cnvnator -root ${s}.root -chrom $my_chr -partition $window;
cnvnator -root ${s}.root -chrom $my_chr -call $window > ${s}_calls_w${window}.txt;
cnvnator2VCF.pl ${s}_calls_w${window}.txt > ${s}_calls_w${window}.vcf;

mv *_calls_* ${root_dir}/cnvnator/

The swarm was simple:

while read s; do 
   for w in 30 100 500; do 
      echo "bash /data/NCR_SBRB/fake_trios/cnvnator/run_sample.sh $s $w" >> swarm.run;
   done;
done < ../sample_ids.txt

swarm -f swarm.run -t 2 -g 16 --job-name cnvnator --logdir trash -m cnvnator --gres=lscratch:100

And we can potentially just use the same strategy they used in their paper to check for denovo mutations:

To find de novo CNVs, we did the following: For each trio, we
genotyped across all members the q0-filtered CNV calls made for
the child. We selected putative de novo candidates that satisfied
the following criteria: (1) The normalized RD signal in the child is
less than 1.4 (more than 2.6 for duplications); and (2) the normalized
average RD signal in each parent is more than 1.6 (less
than 2.4 for duplications). Although 1.5 is the cutoff to discriminate
between heterozygous deletions and the normal diploid state,
we note that genotyping estimation can be biased by ;0.05 (see
Fig. 2), and thus we made the cutoff more stringent by subtracting/
adding the double (0.1) of the value. In the same way we made
a more stringent cutoff to detect de novo duplications.

They also did some previous filtering to only include CNVs that were longer than 1Kb. Need to check if we should do that or not.

It would also be nice if we could estimate the number of probes (or something analogous to that), so we can compute a stats similar to what PennCNV does, and keep it constant across methods. Maybe that's something we can derive before the call command? Of course, it only makes sense for calls in the kid that are not in the parents, but still worth checking!

All fake families

Because of the work I had done before, adapting the previous script to the entire family wasn't a big deal. Then, we just swarm it:

while read s; do 
   for w in 30 100 500; do 
      echo "bash ~/autodenovo/cnvnator_full.sh $s $w" >> swarm.cnvnator;
   done;
done < ../sample_ids.txt

swarm -f swarm.cnvnator -t 2 -g 16 --job-name cnvnator --logdir trash -m cnvnator --gres=lscratch:50 --time=48:00:00

TODO

figure out best binsize to run. Would it be dependent on results? (e.g. optimize it to find denovo results?)
should we merge the root files and calculate stats on everything, or do it by subject and be able to specify different bin sizes?
write script to figure out de novo mutations from trios

11/02/2017

I'll likely stop trying to run CNVnator. Not only because the author seems to discourage its use for WES (see one of the issues tracked on github), but it has some weird hard codes, such as the need to have a 1000bp window during genotyping (not possible with some of our samples), and also it keeps on giving weird -1 errors in our genotype. I'll play with other tools for now and check if ti works better.