XHMM - gsudre/autodenovo GitHub Wiki

09/13/2017

I have successfully used XHMM before, but my concern is that it uses a PCA of the RD bysamples matrix in its analysis. While it's a very cool approach, I'm worried about how it would work in reduced samples. In any case, here are the commands I used. I did it all on GRCh37, so that's something I'd have to change in the future as well.

It turns out that Biowulf already has XHMM, GATK, and plink/seq installed. So , let's just go ahead and use those modules.

I'm following the XHMM tutorial just to make sure it all works fine. After downloading everything to /data/NCR_SBRB/examples, it's time to run it.

First I had to run picard to create a .dict file for GATK:

java -jar $PICARDJARPATH/picard.jar CreateSequenceDictionary R= human_g1k_v37.fasta O= human_g1k_v37.dict

Then, we run the adapted version of the tutorial command to fit Biowulf (where each group is a different family):

GATK -m 30g DepthOfCoverage \
-I group1.READS.bam.list -L EXOME.interval_list \
-R ./human_g1k_v37.fasta \
-dt BY_SAMPLE -dcov 5000 -l INFO --omitDepthOutputAtEachBase --omitLocusTable \
--minBaseQuality 0 --minMappingQuality 20 --start 1 --stop 5000 --nBins 200 \
--includeRefNSites \
--countType COUNT_FRAGMENTS \
-o group1.DATA

GATK -m 30g DepthOfCoverage \
-I group2.READS.bam.list -L EXOME.interval_list \
-R ./human_g1k_v37.fasta \
-dt BY_SAMPLE -dcov 5000 -l INFO --omitDepthOutputAtEachBase --omitLocusTable \
--minBaseQuality 0 --minMappingQuality 20 --start 1 --stop 5000 --nBins 200 \
--includeRefNSites \
--countType COUNT_FRAGMENTS \
-o group2.DATA

GATK -m 30g DepthOfCoverage \
-I group3.READS.bam.list -L EXOME.interval_list \
-R ./human_g1k_v37.fasta \
-dt BY_SAMPLE -dcov 5000 -l INFO --omitDepthOutputAtEachBase --omitLocusTable \
--minBaseQuality 0 --minMappingQuality 20 --start 1 --stop 5000 --nBins 200 \
--includeRefNSites \
--countType COUNT_FRAGMENTS \
-o group3.DATA

GATK -m 30g GCContentByInterval -L EXOME.interval_list \
-R ./human_g1k_v37.fasta \
-o ./DATA.locus_GC.txt

cat ./DATA.locus_GC.txt | awk '{if ($2 < 0.1 || $2 > 0.9) print $1}' \
> ./extreme_gc_targets.txt

Now we start using XHMM:

xhmm --mergeGATKdepths -o ./DATA.RD.txt \
--GATKdepths group1.DATA.sample_interval_summary \
--GATKdepths group2.DATA.sample_interval_summary \
--GATKdepths group3.DATA.sample_interval_summary

/usr/local/apps/XHMM/2016-01-04/sources/scripts/interval_list_to_pseq_reg EXOME.interval_list > ./EXOME.targets.reg

Note that the pseq stuff didn't quite work out, as it couldn't find a dbseq. I can explore that further, but since the awk command also didn't select anything above, let's just ignore the removing of some targets:

xhmm --matrix -r ./DATA.RD.txt --centerData --centerType target \
-o ./DATA.filtered_centered.RD.txt \
--outputExcludedTargets ./DATA.filtered_centered.RD.txt.filtered_targets.txt \
--outputExcludedSamples ./DATA.filtered_centered.RD.txt.filtered_samples.txt \
--minTargetSize 10 --maxTargetSize 10000 \
--minMeanTargetRD 10 --maxMeanTargetRD 500 \
--minMeanSampleRD 25 --maxMeanSampleRD 200 \
--maxSdSampleRD 150

xhmm --PCA -r ./DATA.filtered_centered.RD.txt --PCAfiles ./DATA.RD_PCA

xhmm --normalize -r ./DATA.filtered_centered.RD.txt --PCAfiles ./DATA.RD_PCA \
--normalizeOutput ./DATA.PCA_normalized.txt \
--PCnormalizeMethod PVE_mean --PVE_mean_factor 0.7

xhmm --matrix -r ./DATA.PCA_normalized.txt --centerData --centerType sample --zScoreData \
-o ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt \
--outputExcludedTargets ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt.filtered_targets.txt \
--outputExcludedSamples ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt.filtered_samples.txt \
--maxSdTargetRD 30

xhmm --matrix -r ./DATA.RD.txt \
--excludeTargets ./DATA.filtered_centered.RD.txt.filtered_targets.txt \
--excludeTargets ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt.filtered_targets.txt \
--excludeSamples ./DATA.filtered_centered.RD.txt.filtered_samples.txt \
--excludeSamples ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt.filtered_samples.txt \
-o ./DATA.same_filtered.RD.txt

xhmm --discover -p params.txt -r ./DATA.PCA_normalized.filtered.sample_zscores.RD.txt -R ./DATA.same_filtered.RD.txt -c ./DATA.xcnv -a ./DATA.aux_xcnv -s ./DATA

OK, so it looks like it's working. It's time to run it with our data, and see what we get.

09/18/2017

They call denovo variants in an interesting way:

To answer this in the affirmative, we require that SQ R Q in the child and that NQ R Q in each of the parents so that we are confident that the read depths support at least some deletion event existing in the child’s exome and not a hint of a deletion in either parent.

And then something similar for no diploid in kid but diploid in parents. We still have an issue with needing many samples, but we should have enough there.

09/22/2017

All fake families

Let's run this with our data that we just got out of the GATK pipeline. It looks like it'll take about 1h per sample to run DepthOfCoverage, so let's swarm it:

while read s; do echo "bash ~/autodenovo/xhmm_get_DOC.sh $s" >> xhmm/swarm.DOC; done < sample_ids.txt
cd xhmm
swarm -f swarm.DOC -t 4 -g 55 --job-name xhmmDOC --logdir trash --time=48:00:00 --gres=lscratch:100

And when the results are ready (use interactive node... it takes less than 1h):

cp /usr/local/apps/XHMM/2016-01-04/params.txt .
bash ~/autodenovo/xhmm_eval.sh

TODO

Try different window (more targets, bigger matrix)
write script to find denovo mutations in trios