Excavator - gsudre/autodenovo GitHub Wiki

09/12/2017

Here are my experiences with EXCAVATORtool. First, it's not in BW, so I had to install it by hand. then, we need the requirements to run it:

module load R
module load samtools
module load bedtools

Now, unfortunately this one needs hg19... in fact I don't even know where I'd add hg38 to it. I'll contact the author about how to generate the uniqueome file for hg38, as this would be the only tool that would run with hg19, and I don't want to realign all BAMS for that.

Nevermind, looking at the discussion section I should be using EXCAVATOR2tool instead... https://sourceforge.net/projects/excavator2tool/

I still needed to get a .bed file for our kit (SeqCap EZ Exome + UTR lib, so I got it from here: http://sequencing.roche.com/products/nimblegen-seqcap-target-enrichment/seqcap-ez-system/seqcap-ez-exome-v3.html

the issue there is the file is on hg19, and again, I don't want to realign everything. So, let's convert it to hg38.

module load crossmap
crossmap bed ../crossmap_chains/hg19ToGRCh37.over.chain.gz SeqCapEZ_Exome_v3.0_Design_Annotation_files/SeqCap_EZ_Exome_v3_hg19_capture_targets.bed capture_targets_GRCh37.bed
crossmap bed ../crossmap_chains/GRCh37_to_GRCh38.chain.gz capture_targets_GRCh37.bed capture_targets_GRCh38.bed

And then I had to add chr to all lines in .bed (used vi...).

Note that we can see many reference genomes in /fdb/igenomes/Homo_sapiens/. So, my input file reads:

./data/GCA_000001405.15_GRCh38.bw /fdb/igenomes/Homo_sapiens/NCBI/GRCh38Decoy/Sequence/WholeGenomeFasta/genome.fa

and then I just run:

perl TargetPerla.pl mySourceTarget.txt ../../fake_trios/capture_targets_GRCh38_chrAdded.bed MyTarget_w10000 10000 hg38
perl EXCAVATORDataPrepare.pl ExperimentalFilePrepare.w10000.txt --processors 14 --target MyTarget_w10000 --assembly hg38
 perl EXCAVATORDataAnalysis.pl ExperimentalFileAnalysis.w10000.txt --processors 6 --target MyTarget_w10000 --assembly hg38 --output ./OutEXCAVATOR2/Results_MyProject_w10K --mode pooling

I honestly don't know what window to use here, so I'd have to take a look at the paper (apparently, the options are 10000, 20000 or 500000?)... doesn't make sense. Also, the parameter file used in the last command can be highly customized. But let's keep going. I'm also not sure how well it works the way I have formatted it, with all samples working as controls. But that was the only way I could think to test everyone. For now I only have one sample, but the idea is to list every sample as CX and TX.

The code for the last command broke. It could be because it only has one sample, because its listed as both control and target, or just the the software is buggy. But I'll only be able to figure it out later.

TODO

  • Figure out best window to use
  • Figure out what parameters to use
  • Run several samples in a family to do denovo analysis