structVariants - christianparobek/cambodiaWGS GitHub Wiki

Might want to use this probabilistic SV-calling program from Aaron Quinlan's lab. Also, Wham is a newer program that somehow uses machine learning to classify calls... if only we had a dataset of validated SNVs that we could use for training. Also can perform association tests, so would be interesting to use to search for differences between CP3 and CP4. Anyway, performance of Wham is similar to Lumpy, but Wham is more sensitive to detect small tandem repeats (<60bps).

Installed in /proj/julianog/src and /proj/julianog/bin on Kure. Must run addpython2.7.6 (which is found in the .bashrc file) to be able to run the scripts in here. Also installed the svtyper python executable script in both /proj/julianog/src and /proj/julianog/bin on Kure. Run that to annotate the VCF files made by lumpy. This must also be run with python/2.7.6. Note that you'll have to switch back and forth between different versions of python to run the various scripts in here... yikes.

The final step, svtyper, has big problems with the headers of my BAMs from the standard pipeline - says they're all messed up, and they might be because they've been through so many programs. So, I'm going to remake the BAMs myself, and not merge or deduplicate. To do this, I need to make concatenated files of all the reads for each sample. So making these in the folder catreads. This has symlinks to onesies and concatenated files for those samples with more than one run. Populating this folder with the readCombiner.sh script (only with our 70 good reads). Then going to run snalumper.py to align these files snakemake-style for lumpy. The run lumpy.sh and use svtyperStarter.sh to actually make the SV calls and annotate them.

I ran svtyper for several days, and some samples finished and others didn't. Ended up killing the remaining jobs, but they were hung up in the contigs, so they were done characterizing the chromosomes. lumpy identified duffy duplications in the same samples as Derrick's pipeline, plus two (KP063, KP067). Want to do an analysis of selective sweeps (iHS, nSL or something) but DBP occurs in the end telomere of chr06 (~970000), and I filtered it out when I variant called. Poopy. So what if I recall all variants across all of chr06 then look at heterozygosity and nSL over the whole thing. Once I call those variants and filter, maybe I can use http://faculty.washington.edu/browning/ibdseq/ibdseq.07Nov13.pdf

#####21 February 2016 Just coming back to this now. And decided not to add the lumpy analysis to the snanalysis.py snakemake file, since this doesn't rely on a VCF. In theory we could do this once from the bams and not worry about it any more.

#####07 March 2016 To search for an event in a particular location...