resSNP - christianparobek/cambodiaWGS GitHub Wiki
Want to characterize frequency of mutations in known or putative Pv and Pf drug resistance genes in our Pv cohort and in Pf CP2.
These are the P. vivax genes we want to look at:
| Name | ID | Chr | Start | End | +/- |
|---|---|---|---|---|---|
| pvkelch | PVX_083080 | Pv_Sal1_chr12 | 447729 | 449867 | + |
| pvcrt | PVX_087980 | Pv_Sal1_chr01 | 330260 | 334540 | + |
| pvmdr1 | PVX_080100 | Pv_Sal1_chr10 | 361701 | 366095 | - |
| pvmdr2 | PVX_118100 | Pv_Sal1_chr12 | 2412009 | 2416826 | + |
| pvmrp1 | PVX_097025 | Pv_Sal1_chr02 | 153642 | 158822 | - |
| abcb7 | PVX_084521 | Pv_Sal1_chr13 | 385659 | 389101 | - |
| dhfr | PVX_089950 | Pv_Sal1_chr05 | 964590 | 966464 | + |
| dhps | PVX_123230 | Pv_Sal1_chr14 | 1256701 | 1259581 | - |
And these are the P. falciparum genes we want to look at:
| Name | ID | Chr | Start | End | +/- |
|---|---|---|---|---|---|
| kelch K13 | PF3D7_1343700 | Pf3D7_13_v3 | 1724817 | 1726997 | - |
| pfcrt | PF3D7_0709000 | Pf3D7_07_v3 | 403222 | 406317 | + |
| pfmdr1 | PF3D7_0523000 | Pf3D7_05_v3 | 957890 | 962149 | + |
| pfmdr2 | PF3D7_1447900 | Pf3D7_14_v3 | 1954601 | 1957675 | - |
| pfmrp1 | PF3D7_0112200 | Pf3D7_01_v3 | 464726 | 470194 | + |
| pfdhps | PF3D7_0810800 | Pf3D7_08_v3 | 548200 | 550616 | + |
| pfdhfr | PF3D7_0417200 | Pf3D7_04_v3 | 748088 | 749914 | + |
And these are the P. vivax nSL hits we want to look at:
| Name | ID | Chr | Start | End | +/- |
|---|---|---|---|---|---|
| pvmrp1 | PVX_097025 | Pv_Sal1_chr02 | 153,642 | 158,822 | - |
| sera5 | PVX_003830 | Pv_Sal1_chr04 | 572,172 | 575,852 | + |
| sera4 | PVX_003825 | Pv_Sal1_chr04 | 578,341 | 582,085 | + |
| ApiAP2 | PVX_092570 | Pv_Sal1_chr09 | 1,515,724 | 1,524,557 | - |
| AP2-O | PVX_092760 | Pv_Sal1_chr09 | 1,700,267 | 1,706,017 | - |
| pvmdr1 | PVX_080100 | Pv_Sal1_chr10 | 361,701 | 366,095 | - |
| SET-domain | PVX_114585 | Pv_Sal1_chr11 | 795,907 | 816,553 | - |
| ApiAP2 | PVX_113370 | Pv_Sal1_chr11 | 1,863,442 | 1,869,261 | + |
| pvmdr2 | PVX_118100 | Pv_Sal1_chr12 | 2,412,009 | 2,416,826 | + |
| abcB7 | PVX_084521 | Pv_Sal1_chr13 | 385,659 | 389,101 | - |
| ApiAP2 | PVX_122680 | Pv_Sal1_chr14 | 785,708 | 793,408 | + |
| HP1 | PVX_123682 | Pv_Sal1_chr14 | 1,651,896 | 1,652,723 | - |
| SET10 | PVX_123685 | Pv_Sal1_chr14 | 1,660,815 | 1,666,706 | - |
Then make these into BED files, intersect with the appropriate VCF files, and run snpEff.
It looks like I'm going to have to look at allele frequencies in the unfiltered VCF file (i.e. the original one before any filters coverage, depth, or quality filters were applied). Perhaps in this case we should just look for previously characterized SNPs? JON AGREES! Should we look in the entire Pf and Pv populations, or just in CP2 and mono? Maybe I should look in just CP2 and mono because part of the reason we're doing this is to provide evidence for why we're not seeing additional sweeps in P. falciparum CP2. JON AGREES! So need to go back to the original P. vivax and P. falciparum VCF and run the subsetting functions on those directly.
Made a bash script for getting the resistance genes regions out of the VCF file of interest, annotating those variants, then runs an Rscript that digests those variants and outputs a usable table format.
This output table got tricky in the case of the nsl results since so many had multiple variants. I ended up having to do lots of gedit manipulation of the file to subset it down to just the non-synonymous variants that occurred in at least two samples. This file is ~600 lines long, so turning it into Supplemental Appendix A.
This helped me get just the variants that occurred in more than one sample. Couldn't do a simple grep -v "0.036" file.txt because that gets rid of the tri-allelic sites with one allele occurring in only one sample. So had to do the following:
grep "0.071\|0.107\|0.143\|0.179\|0.214\|0.250\|0.286\|0.321\|0.357\|0.393\|0.429\|0.464\|0.500\|0.536\|0.571\|0.607\|0.643\|0.679\|0.714\|0.750\|0.786\|0.821\|0.857\|0.893\|0.929\|0.964\|1.000" nonsyn.txt > nonsyn_mults.txt