resSNP - christianparobek/cambodiaWGS GitHub Wiki

Want to characterize frequency of mutations in known or putative Pv and Pf drug resistance genes in our Pv cohort and in Pf CP2.

These are the P. vivax genes we want to look at:

Name ID Chr Start End +/-
pvkelch PVX_083080 Pv_Sal1_chr12 447729 449867 +
pvcrt PVX_087980 Pv_Sal1_chr01 330260 334540 +
pvmdr1 PVX_080100 Pv_Sal1_chr10 361701 366095 -
pvmdr2 PVX_118100 Pv_Sal1_chr12 2412009 2416826 +
pvmrp1 PVX_097025 Pv_Sal1_chr02 153642 158822 -
abcb7 PVX_084521 Pv_Sal1_chr13 385659 389101 -
dhfr PVX_089950 Pv_Sal1_chr05 964590 966464 +
dhps PVX_123230 Pv_Sal1_chr14 1256701 1259581 -

And these are the P. falciparum genes we want to look at:

Name ID Chr Start End +/-
kelch K13 PF3D7_1343700 Pf3D7_13_v3 1724817 1726997 -
pfcrt PF3D7_0709000 Pf3D7_07_v3 403222 406317 +
pfmdr1 PF3D7_0523000 Pf3D7_05_v3 957890 962149 +
pfmdr2 PF3D7_1447900 Pf3D7_14_v3 1954601 1957675 -
pfmrp1 PF3D7_0112200 Pf3D7_01_v3 464726 470194 +
pfdhps PF3D7_0810800 Pf3D7_08_v3 548200 550616 +
pfdhfr PF3D7_0417200 Pf3D7_04_v3 748088 749914 +

And these are the P. vivax nSL hits we want to look at:

Name ID Chr Start End +/-
pvmrp1 PVX_097025 Pv_Sal1_chr02 153,642 158,822 -
sera5 PVX_003830 Pv_Sal1_chr04 572,172 575,852 +
sera4 PVX_003825 Pv_Sal1_chr04 578,341 582,085 +
ApiAP2 PVX_092570 Pv_Sal1_chr09 1,515,724 1,524,557 -
AP2-O PVX_092760 Pv_Sal1_chr09 1,700,267 1,706,017 -
pvmdr1 PVX_080100 Pv_Sal1_chr10 361,701 366,095 -
SET-domain PVX_114585 Pv_Sal1_chr11 795,907 816,553 -
ApiAP2 PVX_113370 Pv_Sal1_chr11 1,863,442 1,869,261 +
pvmdr2 PVX_118100 Pv_Sal1_chr12 2,412,009 2,416,826 +
abcB7 PVX_084521 Pv_Sal1_chr13 385,659 389,101 -
ApiAP2 PVX_122680 Pv_Sal1_chr14 785,708 793,408 +
HP1 PVX_123682 Pv_Sal1_chr14 1,651,896 1,652,723 -
SET10 PVX_123685 Pv_Sal1_chr14 1,660,815 1,666,706 -

Then make these into BED files, intersect with the appropriate VCF files, and run snpEff.

It looks like I'm going to have to look at allele frequencies in the unfiltered VCF file (i.e. the original one before any filters coverage, depth, or quality filters were applied). Perhaps in this case we should just look for previously characterized SNPs? JON AGREES! Should we look in the entire Pf and Pv populations, or just in CP2 and mono? Maybe I should look in just CP2 and mono because part of the reason we're doing this is to provide evidence for why we're not seeing additional sweeps in P. falciparum CP2. JON AGREES! So need to go back to the original P. vivax and P. falciparum VCF and run the subsetting functions on those directly.

Made a bash script for getting the resistance genes regions out of the VCF file of interest, annotating those variants, then runs an Rscript that digests those variants and outputs a usable table format.

This output table got tricky in the case of the nsl results since so many had multiple variants. I ended up having to do lots of gedit manipulation of the file to subset it down to just the non-synonymous variants that occurred in at least two samples. This file is ~600 lines long, so turning it into Supplemental Appendix A.

This helped me get just the variants that occurred in more than one sample. Couldn't do a simple grep -v "0.036" file.txt because that gets rid of the tri-allelic sites with one allele occurring in only one sample. So had to do the following:

grep "0.071\|0.107\|0.143\|0.179\|0.214\|0.250\|0.286\|0.321\|0.357\|0.393\|0.429\|0.464\|0.500\|0.536\|0.571\|0.607\|0.643\|0.679\|0.714\|0.750\|0.786\|0.821\|0.857\|0.893\|0.929\|0.964\|1.000" nonsyn.txt > nonsyn_mults.txt