CNVkit - gsudre/autodenovo GitHub Wiki

09/13/2017

I started playing with CNVKit, but I'm not terribly hopeful of it, because it didn't look very well suited to run analysis other than cancer. It does have an option for germline analysis, but it says it's not very accurate.

In any case, I had to jump through a few hoops to get it working. First, to create the access file, I had to clone the latest repository because the code imported in the module was breaking. There's a closed issue saying the bug was fixed, so it worked alright. After cloning the repo, it's just:

 ./cnvkit.py access /fdb/igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa -s 5000 -o access-5kb.hg38.bed

Then, I used the .bed I changed using crossmap and added the chr prefix (see Excavator), and had to copy the whole reference file to the local directory, as I couldn't figure out how to tell it not to rebuild the fasta index without going into the code. Then, it's just a matter of running:

cnvkit.py batch project2/final/CCGO_800818/CCGO_800818-ready.bam -n -t capture_targets_GRCh38_chrAdded.bed -f genome.fa --access access-5kb.hg38.bed --output-reference my_flat_reference.cnn -d example2/

Fake families

As this worked well the first time, let's try using the 3 fake quartets we have. First, generate a few different access windows (using the code compiled from github):

./cnvkit.py access /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa -s 5000 -o access-5kb.hg38.bed
./cnvkit.py access /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa -s 10000 -o access-10kb.hg38.bed
./cnvkit.py access /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa -s 50000 -o access-50kb.hg38.bed
./cnvkit.py access /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa -s 100000 -o access-100kb.hg38.bed
./cnvkit.py access /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa -s 1000 -o access-1kb.hg38.bed
cp access-* ../../fake_trios/cnvkit

Now, run all our samples, for the different windows:

for w in 1 5 10 50 100; do  
   cnvkit.py batch ../bams/*.bam -n -t ../capture_targets_GRCh38_chrAdded.bed -f ../genome.fa --access access-${w}kb.hg38.bed --output-reference my_flat_reference.cnn -d 3quartets_${w}kb/ -p 6;
done

for w in 1 5 10 50 100; do
   while read s; do
      cnvkit.py call 3quartets_${w}kb/${s}-sort.cns -y -m threshold -t=-1.1,-0.4,0.3,0.7 -o 3quartets_${w}kb/${s}-sort.call.cns;
      cnvkit.py export vcf 3quartets_${w}kb/${s}-sort.call.cns -i "${s}" -o 3quartets_${w}kb/${s}-sort.call.cnv.vcf;
   done < ../sample_ids.txt;
done

4 processes are taking about 3.5 Gb of RAM, so I increased it to 6.

Note that if this approach works, then swarming it is trivial.

One of the columns in the call file is probes, so we could potentially use a similar algorithm to what PennCNV uses to investigate how many of those probes are in the parents, and calculate a p-value from that (http://penncnv.openbioinformatics.org/en/latest/user-guide/denovo/).

But maybe I need to do some filtering first, according to this: http://cnvkit.readthedocs.io/en/stable/germline.html

It'd be nice if we could similar methods to detect CNVs across WES callers, but we might not be able to extract the same variables because of the algorithms they're running. In any case, let's at least try to determine de novo variants regardless of the algorithm we use. And then we can compare them?

09/18/2017

It looks like I'll be able to use the same method from CNVnator for cnvkit calls. The threshold values might not be exactly the same, so I'll need to do some type of histogram in the RD values (coverage depth?) from CNVkit. But the same approach seems plausible.

We cannot use the same thing they do for PennCNV because the binomial approach requires the single nucleotide calls.

All fake families

Let's run this with all our fake simplex families, swarming when needed. First, create the access files for hg19:

for w in 1 5 10 50 100; do
   ../../software/cnvkit/cnvkit.py access /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa -s ${w}000 -o access-${w}kb.hg19.bed;
done

Now, run all our samples, for the different windows:

while read s; do for w in 1 5 10 50 100; do echo "bash ~/autodenovo/cnvkit_full.sh $s $w" >> swarm.cnvkit; done; done < ../sample_ids.txt
swarm -f swarm.cnvkit -t 6 -g 16 --job-name cnvkit --logdir trash --time=48:00:00 --gres=lscratch:10

TODO:

write script to figure out de novo mutations from trios