SNPchip - gsudre/autodenovo GitHub Wiki

09/08/2017

I'm currently struggling with a few things here:

  • The IDs for participants' SNP chip data does not correspond to their IDs in the WES data. I need to wait for my co-worker get back next Monday to get the maping from him.
  • Participants were genotyped in two different arrays, so I need to come up with different processing depending on the array in which samples were run.
  • Need to find Chr and position for all SNPs in each array. I might need to re-export the project final report files using BeadStudio.

Bullet 2 is not necessarily an issue, as the data I'l, be finally using won't be this. I'm currently using a standby just to test the scripts. So, hopefully I won't have this issue with the actual data.

But in the spirit that I'm just trying to get this to work, I don't care about the participant identity. I did find files in the Illumina website to give me population frequency for B allele (PFB), so I don't have to calculate it in my own data. https://support.illumina.com/array/downloads.html and ftp://webdata:[email protected]/Downloads/ProductFiles/HumanExome/ProductSupportFiles/HumanExome-12v1-2_A_MAF.txt

So, I should be about ready to call CNVs for a set of participants, and then do the de novo calculations.

09/15/2017

I just found this other option: http://www.bioconductor.org/packages/release/bioc/html/MinimumDistance.html which according to their paper is a bit better than PennCNV in some cases. Worth trying?

09/21/2017

I got PennCNV working. I should still play with a couple things (see below) to change the algorithm sensitivity, but at least we can get results now:

 detect_cnv.pl -test -hmm example.hmm -pfb HumanExome-12v1-2_A_MAF.pbf -log sampleall.log fs_ccgo_box4/CCGO_800986 fs_ccgo_box4/CCGO_800984 fs_ccgo_box1/CCGO_800734 -out 9020_trio1.rawcnv
 detect_cnv.pl -trio -hmm example.hmm -pfb HumanExome-12v1-2_A_MAF.pbf -cnv 9020_trio1.rawcnv  fs_ccgo_box4/CCGO_800986 fs_ccgo_box4/CCGO_800984 fs_ccgo_box1/CCGO_800734 -out 9020_trio1.triocnv

Then we can look at the .triocnv file and test each of the CNVs in the offspring but not in one of the parents. All the parameters to be used in the call are in that file. Unfortunately for this sample we didn't find anything. For example, take this CNV:

chr15:35834615-35834667       numsnp=3      length=53          state2,cn=1 fs_ccgo_box4/CCGO_800984 startsnp=exm1147192 endsnp=exm1147195 mother triostate=322
chr15:35834615-35834667       numsnp=3      length=53          state2,cn=1 fs_ccgo_box1/CCGO_800734 startsnp=exm1147192 endsnp=exm1147195 offspring triostate=322

It's also in the mom. But say it wasn't. Then, to test it would look like something this:

infer_snp_allele.pl -pfb HumanExome-12v1-2_A_MAF.pbf -hmm example.hmm -denovocn 1 fs_ccgo_box4/CCGO_800986 fs_ccgo_box4/CCGO_800984 fs_ccgo_box1/CCGO_800734 -out tempfile --startsnp exm1147192 --endsnp exm1147195

Then I started trying to create the GC content file for adjustment. It needs 500Kb around each SNP position, so in R:

pos = read.table('HumanExome-12v1-2_A_MAF.pbf')
start=pos[,3]-500000
end=pos[,3]+500000
start[start<0] = 0
chrs = vector()
for (i in pos[2]) {
    chrs = c(chrs, sprintf('chr%s', i))
}
df=data.frame(chr=chrs,start=start,end=end, rsid=pos[,1], pos=pos[,3])  # might need rsid and pos later for merging
keep_me = df$chr!='chrMT' & df$chr!='chrXY' & df$chr!='chrX' & df$chr!='chrY'
df = df[keep_me,]
# compromising based on chromosome lengths based on error outputs from bedtools
df[df$chr=='chr19' & df$end > 59128983,]$end = 59128983
df[df$chr=='chr16' & df$end > 90354753,]$end = 90354753
df[df$chr=='chr5' & df$end > 180915260,]$end = 180915260
df[df$chr=='chr22' & df$end > 51304566,]$end = 51304566
df[df$chr=='chr18' & df$end > 78077248,]$end = 78077248
df[df$chr=='chr8' & df$end > 146364022,]$end = 146364022
df[df$chr=='chr12' & df$end > 133851895,]$end = 133851895
df[df$chr=='chr13' & df$end > 115169878,]$end = 115169878
df[df$chr=='chr15' & df$end > 102531392,]$end = 102531392
df[df$chr=='chr17' & df$end > 81195210,]$end = 81195210
df[df$chr=='chr20' & df$end > 63025520,]$end = 63025520
df[df$chr=='chr21' & df$end > 48129895,]$end = 48129895
df[df$chr=='chr10' & df$end > 135534747,]$end = 135534747
df[df$chr=='chr1' & df$end > 249250621,]$end = 249250621
df[df$chr=='chr2' & df$end > 243199373,]$end = 243199373
df[df$chr=='chr3' & df$end > 198022430,]$end = 198022430
df[df$chr=='chr7' & df$end > 159138663,]$end = 159138663
df[df$chr=='chr4' & df$end > 191154276,]$end = 191154276
df[df$chr=='chr6' & df$end > 171115067,]$end = 171115067
df[df$chr=='chr9' & df$end > 141213431,]$end = 141213431
df[df$chr=='chr14' & df$end > 107349540,]$end = 107349540
df[df$chr=='chr11' & df$end > 135006516,]$end = 135006516
write.table(df,file='for_gc.bed',row.names=F,col.names=F,sep='\t',quote=F)

Then we do:

module load bedtools
bedtools nuc -fi /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa -bed for_gc.bed > gc_content.txt

And go back to R:

gc=read.table('gc_content.txt')
df = gc[,c(4,1,5,7)]
df[,4] = df[,4]*100
 write.table(df,file='HumanExome-12v1-2_A.h19.gcmodel',row.names=F,col.names=F,sep='\t',quote=F)

Now it's just a matter of re-running PennCNV:

detect_cnv.pl -test -hmm example.hmm -pfb HumanExome-12v1-2_A_MAF.pbf -log sampleall.log fs_ccgo_box4/CCGO_800986 fs_ccgo_box4/CCGO_800984 fs_ccgo_box1/CCGO_800734 -out 9020_trio1.adjusted.rawcnv -gcmodel HumanExome-12v1-2_A.h19.gcmodel

And the other commands stay the same.

NOTE that this GC file is exclusive to this array. We'll need to generate it again for other arrays! It's also a good idea to test the two different files, in case there are issues with this adjustment function.

I then created a quick script to check the box each of my subjects was in. Note that this might not be an issue in the future, only if different chips were used for genotyping. In any case, here are the results:

[sudregp@cn3125 penncnv]$ for f in `ls ../*trio*ped`; do echo $f;  bash ~/autodenovo/penncnv_check_chip.sh $f; done
../9004_trio.ped
CCGO_800754 in fs_ccgo_box1 = Content HumanExome-12v1-2_A.bpm
CCGO_800753 in fs_ccgo_box1 = Content HumanExome-12v1-2_A.bpm
CCGO_800736 in fs_ccgo_box1 = Content HumanExome-12v1-2_A.bpm
../9005a_trio.ped
CCGO_800824 in fs_ccgo_box2 = Content HumanExome-12v1-2_A.bpm
CCGO_800825 in fs_ccgo_box3 = Content HumanExome-12v1-2_A.bpm
CCGO_800878 in fs_ccgo_box5 = Content HumanExome-12v1-2_A.bpm
../9005b_trio1.ped
CCGO_800818 in fs_ccgo_box2 = Content HumanExome-12v1-2_A.bpm
CCGO_800821 in fs_ccgo_box2 = Content HumanExome-12v1-2_A.bpm
CCGO_800820 in fs_ccgo_box2 = Content HumanExome-12v1-2_A.bpm
../9005b_trio2.ped
CCGO_800821 in fs_ccgo_box2 = Content HumanExome-12v1-2_A.bpm
CCGO_800820 in fs_ccgo_box2 = Content HumanExome-12v1-2_A.bpm
CCGO_800819 in fs_ccgo_box2 = Content HumanExome-12v1-2_A.bpm
../9010a_trio.ped
CLIA_400099 in fs_clia_box2 = Content InfiniumExome-24v1-0_A1.bpm
CLIA_400100 in fs_clia_box2 = Content InfiniumExome-24v1-0_A1.bpm
CCGO_800802 in fs_ccgo_box2 = Content HumanExome-12v1-2_A.bpm
../9010b_trio.ped
CCGO_800790 in fs_ccgo_box2 = Content HumanExome-12v1-2_A.bpm
CLIA_400042 in fs_clia_box1 = Content InfiniumExome-24v1-0_A1.bpm
CLIA_400101 in fs_clia_box2 = Content InfiniumExome-24v1-0_A1.bpm
../9010c_trio1.ped
CCGO_800800 in fs_ccgo_box2 = Content HumanExome-12v1-2_A.bpm
CCGO_800916 in fs_ccgo_box5 = Content HumanExome-12v1-2_A.bpm
CCGO_800801 in fs_ccgo_box2 = Content HumanExome-12v1-2_A.bpm
../9010c_trio2.ped
CCGO_800916 in fs_ccgo_box5 = Content HumanExome-12v1-2_A.bpm
CCGO_800801 in fs_ccgo_box2 = Content HumanExome-12v1-2_A.bpm
CLIA_400084 in fs_clia_box3 = Content InfiniumExome-24v1-0_A1.bpm
../9011_trio.ped
CCGO_800944 in fs_ccgo_box5 = Content HumanExome-12v1-2_A.bpm
CCGO_800945 in fs_ccgo_box5 = Content HumanExome-12v1-2_A.bpm
CLIA_400044 in fs_clia_box1 = Content InfiniumExome-24v1-0_A1.bpm
../9016_trio.ped
CLIA_400048 in fs_clia_box1 = Content InfiniumExome-24v1-0_A1.bpm
CLIA_400049 in fs_clia_box1 = Content InfiniumExome-24v1-0_A1.bpm
CCGO_800793 in fs_ccgo_box2 = Content HumanExome-12v1-2_A.bpm
../9019_trio1.ped
CCGO_800948 in fs_ccgo_box5 = Content HumanExome-12v1-2_A.bpm
CCGO_801004 in fs_ccgo_box6 = Content InfiniumExome-24v1-0_A1.bpm
CLIA_400098 in fs_clia_box2 = Content InfiniumExome-24v1-0_A1.bpm
../9019_trio2.ped
CCGO_800948 in fs_ccgo_box5 = Content HumanExome-12v1-2_A.bpm
CCGO_801004 in fs_ccgo_box6 = Content InfiniumExome-24v1-0_A1.bpm
CLIA_400097 in fs_clia_box2 = Content InfiniumExome-24v1-0_A1.bpm
../9020_trio1.ped
CCGO_800986 in fs_ccgo_box4 = Content HumanExome-12v1-2_A.bpm
CCGO_800734 in fs_ccgo_box1 = Content HumanExome-12v1-2_A.bpm
CCGO_800984 in fs_ccgo_box4 = Content HumanExome-12v1-2_A.bpm
../9020_trio2.ped
CCGO_800986 in fs_ccgo_box4 = Content HumanExome-12v1-2_A.bpm
CCGO_800984 in fs_ccgo_box4 = Content HumanExome-12v1-2_A.bpm
CCGO_800985 in fs_ccgo_box4 = Content HumanExome-12v1-2_A.bpm
../9030_trio1.ped
CCGO_800980 in fs_ccgo_box4 = Content HumanExome-12v1-2_A.bpm
CCGO_800977 in fs_ccgo_box4 = Content HumanExome-12v1-2_A.bpm
CCGO_800976 in fs_ccgo_box4 = Content HumanExome-12v1-2_A.bpm
../9030_trio2.ped
CCGO_800979 in fs_ccgo_box4 = Content HumanExome-12v1-2_A.bpm
CCGO_800977 in fs_ccgo_box4 = Content HumanExome-12v1-2_A.bpm
CCGO_800976 in fs_ccgo_box4 = Content HumanExome-12v1-2_A.bpm
../9034_trio1.ped
CLIA_400095 in fs_clia_box2 = Content InfiniumExome-24v1-0_A1.bpm
CLIA_400094 in fs_clia_box3 = Content InfiniumExome-24v1-0_A1.bpm
CCGO_800990 in fs_ccgo_box4 = Content HumanExome-12v1-2_A.bpm
../9034_trio2.ped
CLIA_400071 in fs_clia_box3 = Content InfiniumExome-24v1-0_A1.bpm
CLIA_400094 in fs_clia_box3 = Content InfiniumExome-24v1-0_A1.bpm
CCGO_800990 in fs_ccgo_box4 = Content HumanExome-12v1-2_A.bpm

So, in 9 out of my 18 fake trios I have mixed chips. Let's find a way to use only the intersection of SNPs in both chips in those cases. If we don't miss much, we might end up using the intersection config files for everybody. But let's see.

Out of the 243346 SNPs in InfiniumExome and 244771 in HumanExome, 240953 are repeated. So, we should be OK. I'll operate with an intersection file from now on. Let's create the PFB file first:

comm -12 <(cut -f 1 HumanExome-12v1-2_A_MAF.txt | sort) <(cut -f 1 InfiniumExome-24v1-0_A1_PopulationReport_MAF.txt | sort) > unique_snps.txt
while read s; do grep $s HumanExome-12v1-2_A_MAF.pbf >> filtered_MAF.pfb; done < unique_snps.txt
while read s; do grep $s HumanExome-12v1-2_A.h19.gcmodel >> filtered.h19.gcmodel; done < unique_snps.txt

09/25/2017

Just because it takes less than 5 minutes to run a trio, let's run them in a for loop instead of swarming:

for t in `ls -1 *trio*ped | sed -e 's/\.ped//'`; do 
   bash ~/autodenovo/penncnv_run_trio.sh $t;
done

Now we're ready to check for de novo mutations.

TODO

  • Write scripts to run it for everybody, and then to test each CNV that looks denovo
  • Play with hmm file to increase/decrease sensitivity of denovo findings