selscan - christianparobek/cambodiaWGS GitHub Wiki
selscan is a new program that looks for positive selection. Source code and binaries here. To compile on Kure, must comment the OSX lines and uncomment the Linux lines in the src/Makefile. The make selscan and put that binary in /proj/julianog/bin.
Implements the following three methods:
- EHH looks at haplotype-block decay in a population.
- XP-EHH will compare haploype-block decay between two populations.
- iHS somehow builds on EHH.
- nSL is similar to iHS but doesn't require a genetic map (in cM) and can detect soft sweeps. More robust to recombination rate variation than the others.
So one issue is going to be that the VCF GT field must be coded X/X, diploid style. Not sure if this is exactly fair, but going to turn 0 into 0/0 and 1 into 1/1.
sed 's/\t0:/\t0\/0:/g' our_goods_UG.pass.vcf | sed 's/\t1:/\t1\/1:/g' > our_goods_UG.pass.diploid.vcf
Also, apparently I need to split by chromosome:
for i in 01 02 03 04 05 06 07 08 09 10 11 12 13 14
do
grep "^#\|^Pv_Sal1_chr$i" our_goods_UG.pass.diploid.vcf > chr$i.vcf
done
####2016-01-13 I wasn't able to find any signatures of selection in my P. vivax Cambodia data. Not sure if that's because I don't know what I'm doing or because there's nothing there to find. So, I'm going to run nSL on the Gambia genomes to see if I can find what they found. Paper here. Details of how I got the genomes here.
# make it diploid style
sed 's/\t0:/\t0\/0:/g' guinea.pass.vcf | sed 's/\t1:/\t1\/1:/g' > guinea.pass.diploid.vcf
# split by chromosome
for i in 01 02 03 04 05 06 07 08 09 10 11 12 13 14
do
grep "^#\|^Pf3D7_$i\_v3" guinea.pass.diploid.vcf > chr$i.vcf
done
# calculate nSL
for i in 01 02 03 04 05 06 07 08 09 10 11 12 13 14
do
../bin/linux/selscan --nsl --vcf chr$i.vcf --out chr$i.res --skip-low-freq --maf 0.05
done
## DO I NEED TO NORMALIZE??
./norm --ihs --files chr$i.res.nsl.out > chr$i/.res/nsl/norm
A potential problem I just found is with the format of the VCF file. I'm having to convert it to diploid by adding pseudo haplotypes in there... is that a problem? I don't think so from looking at the nSL calculation... at least not for un-normalized, because the ancestral freq is always divided by derived frequency.
Also, instead of filtering out minor alleles in GATK, I could filter out in selscan:
--skip-low-freq --maf 0.05 #(which is the default)
Do it for P. vivax:
sed 's/\t0:/\t0\/0:/g' our_goods_UG.pass.vcf | sed 's/\t1:/\t1\/1:/g' > our_goods_UG.pass.diploid.vcf
for i in 01 02 03 04 05 06 07 08 09 10 11 12 13 14
do
grep "^#\|^Pv_Sal1_chr$i" our_goods_UG.pass.diploid.vcf > chr$i.vcf
../bin/linux/selscan --nsl --vcf chr$i.vcf --out chr$i.res --skip-low-freq --maf 0.05
done
This gave kind of a gimish of results - no locus stood out as under strong positive selection in the P. vivax genome. Even though nSL is supposed to be more sensitive and easier to use than iHS, we still want to use iHS to as well because it's more established and more commonly used. Problem is to make a genetic map. Apparently this can be done using LDhat...
####2016-01-26
Some clues: Auton et al. published genetic maps in both Science (methods here) and PLOS Genetics (methods here). In both cases, he split the genome into 4000 SNP windows with a 200 SNP overlap and estimated recombination rates independently for each window using LDhat interval: block penalty 5, 60M MCMC iterations, sampling every 40K iterations, and the first 500 samplings (i.e. first 20M iterations) were discarded. Moreover, any intervals with an estimate of >= 100 was zeroed out, and the surrounding 50 SNPs in each direction were also zeroed out. Once you have rho values, you must multiply by 100/4Ne to convert to cM for selscan. Does this apply to haploids?
####2016-02-16
Just getting back to this now. Chang used LDhat in her MBE paper to make a genetic map for the iHS analysis. Seems like she didn't break up the genome into little blocks for subanalysis like Auton did. Looking into this more, it looks like Auton recommends breaking up into blocks if you have a huge genome segment (multiple Mb) and hope to estimate mean rho over the whole segment. Although stat produces an estimate of mean rho, I don't think we care about that, so perhaps we can forego breaking into blocks. Here's a nice and transparent explanation of some legume recombination.
####2016-02-17 Here's the plan:
VCF --> VCFtools --> ldhat.sites & ldhat.locs --> LDhat --> map --> plinkize the map --> map --> selscan (iHS) --> normalize within allele-freq bins --> convert to p-values --> plot the p-values.
And now some notes about it. When VCFtools converts VCF -> locs file, it puts physical coords in KB! This is good. And the last three steps (normalizing, converting to p-values, and plotting that) is taken from Chang 2012.
####2016-02-03 Wrote an R script ``
iHS output headers:
<locusID> <physicalPos> <'1' freq> <ihh1> <ihh0> <unstandardized iHS>
####2016-03-01 I have the nSL and iHS working now. For iHS, I needed to raise the ehh cutoff because at the default 0.05 it was hitting the ends of the chromosomes for all chrs except 12, 13, 14 which are presumably the longest. Now that we have some loci identified, we want to look at these more closely using EHH. The following EHH command works to give some results, but haven't had time to look at them yet or understand what they mean. This is for the big nSL value on chromosome 14 in the pv_mono group.
selscan --ehh locus1511 --vcf /proj/julianog/users/ChristianP/cambodiaWGS/selscan/pv_mono/chr14.vcf --map /proj/julianog/users/ChristianP/cambodiaWGS/geneticMap/pv_mono/14.map --out ehh_pvmono_chr14
One idea APM used was to draw those cool graphs using rehh.