resSNP - christianparobek/cambodiaWGS GitHub Wiki

Want to characterize frequency of mutations in known or putative Pv and Pf drug resistance genes in our Pv cohort and in Pf CP2.

These are the P. vivax genes we want to look at:

Name	ID	Chr	Start	End	+/-
pvkelch	PVX_083080	Pv_Sal1_chr12	447729	449867	+
pvcrt	PVX_087980	Pv_Sal1_chr01	330260	334540	+
pvmdr1	PVX_080100	Pv_Sal1_chr10	361701	366095	-
pvmdr2	PVX_118100	Pv_Sal1_chr12	2412009	2416826	+
pvmrp1	PVX_097025	Pv_Sal1_chr02	153642	158822	-
abcb7	PVX_084521	Pv_Sal1_chr13	385659	389101	-
dhfr	PVX_089950	Pv_Sal1_chr05	964590	966464	+
dhps	PVX_123230	Pv_Sal1_chr14	1256701	1259581	-

And these are the P. falciparum genes we want to look at:

Name	ID	Chr	Start	End	+/-
kelch K13	PF3D7_1343700	Pf3D7_13_v3	1724817	1726997	-
pfcrt	PF3D7_0709000	Pf3D7_07_v3	403222	406317	+
pfmdr1	PF3D7_0523000	Pf3D7_05_v3	957890	962149	+
pfmdr2	PF3D7_1447900	Pf3D7_14_v3	1954601	1957675	-
pfmrp1	PF3D7_0112200	Pf3D7_01_v3	464726	470194	+
pfdhps	PF3D7_0810800	Pf3D7_08_v3	548200	550616	+
pfdhfr	PF3D7_0417200	Pf3D7_04_v3	748088	749914	+

And these are the P. vivax nSL hits we want to look at:

Name	ID	Chr	Start	End	+/-
pvmrp1	PVX_097025	Pv_Sal1_chr02	153,642	158,822	-
sera5	PVX_003830	Pv_Sal1_chr04	572,172	575,852	+
sera4	PVX_003825	Pv_Sal1_chr04	578,341	582,085	+
ApiAP2	PVX_092570	Pv_Sal1_chr09	1,515,724	1,524,557	-
AP2-O	PVX_092760	Pv_Sal1_chr09	1,700,267	1,706,017	-
pvmdr1	PVX_080100	Pv_Sal1_chr10	361,701	366,095	-
SET-domain	PVX_114585	Pv_Sal1_chr11	795,907	816,553	-
ApiAP2	PVX_113370	Pv_Sal1_chr11	1,863,442	1,869,261	+
pvmdr2	PVX_118100	Pv_Sal1_chr12	2,412,009	2,416,826	+
abcB7	PVX_084521	Pv_Sal1_chr13	385,659	389,101	-
ApiAP2	PVX_122680	Pv_Sal1_chr14	785,708	793,408	+
HP1	PVX_123682	Pv_Sal1_chr14	1,651,896	1,652,723	-
SET10	PVX_123685	Pv_Sal1_chr14	1,660,815	1,666,706	-

Then make these into BED files, intersect with the appropriate VCF files, and run snpEff.

It looks like I'm going to have to look at allele frequencies in the unfiltered VCF file (i.e. the original one before any filters coverage, depth, or quality filters were applied). Perhaps in this case we should just look for previously characterized SNPs? JON AGREES! Should we look in the entire Pf and Pv populations, or just in CP2 and mono? Maybe I should look in just CP2 and mono because part of the reason we're doing this is to provide evidence for why we're not seeing additional sweeps in P. falciparum CP2. JON AGREES! So need to go back to the original P. vivax and P. falciparum VCF and run the subsetting functions on those directly.

Made a bash script for getting the resistance genes regions out of the VCF file of interest, annotating those variants, then runs an Rscript that digests those variants and outputs a usable table format.

This output table got tricky in the case of the nsl results since so many had multiple variants. I ended up having to do lots of gedit manipulation of the file to subset it down to just the non-synonymous variants that occurred in at least two samples. This file is ~600 lines long, so turning it into Supplemental Appendix A.

This helped me get just the variants that occurred in more than one sample. Couldn't do a simple grep -v "0.036" file.txt because that gets rid of the tri-allelic sites with one allele occurring in only one sample. So had to do the following:

grep "0.071\|0.107\|0.143\|0.179\|0.214\|0.250\|0.286\|0.321\|0.357\|0.393\|0.429\|0.464\|0.500\|0.536\|0.571\|0.607\|0.643\|0.679\|0.714\|0.750\|0.786\|0.821\|0.857\|0.893\|0.929\|0.964\|1.000" nonsyn.txt > nonsyn_mults.txt