PopGenome - christianparobek/cambodiaWGS GitHub Wiki

PopGenome on CRAN
PopGenome manual
PopGenome vignettes
PopGenome paper

To run ms and msms, we need to put the executable/folder in our workspace. For ms, download the ms source code here, upload to Kure, and compile using gcc -o ms ms.c streec.c rand1.c -lm. Finally, take the ms executable and put it in the PopGenome folder.

Looking for selective sweeps:

Hitchiking events can "greatly reduce differentiation among subdivided populations" (Kim 2002, Stephan 1998). Could the fact that we see essentially zero structuring in the P. vivax populations indicate that there's lots of selective sweeps going on in these populations?

####Hypotheses to Explain Difference in Pv and Pf Population Structure:

P. vivax is better at recombining with broadly dispersed isolates.
Hitchiking events keep differentiation between subpopulations low. Implication would be that there are more hitchiking events in P. vivax than in P. falciparum. Does "background selection" have some light to shed on this?
Gates & WHO anti-ART-R interventions have reduced P. falciparum's ability to recombine and has fractured this population into subpopulations while it has not had the same effect on P. vivax in Cambodia.
There are different genetic backgrounds in Cambodian P. falciparum that are being maintained for whatever reason (ART-R?), whereas that's not the case for P. vivax.

Since we have the first sympatric whole-genome datasets, we could maybe test out some of these hypotheses to figure out what the truth is.

For all of our gene- and exon-wise tests, is the fact that we have MOI>1 in many samples a big problem? If it calculates stats (Tajima's D, Fst) by allele, then could these MOI>1 samples be introducing false alleles into the population? SNP-by-SNP, MOI>1 is not a problem. But maybe it is gene-by-gene?

#####05 May 2015 Got a list of P. vivax and P. falciparum orthologs from PlasmoDB. Susanne helped me I think. These should be all the orthologs for which there is a 1:1 comparison - one from Pv and one from Pf. To process the file I downloaded from PlasmoDB, I did this:

grep "PVX" GenesByOrthologs_summary.txt | grep -v ", PF" | cut -f 1,2 > pv_pf_orthologs.txt

#####23 May 2015 GO term enrichment is possible with PlasmoDB. When I take the list of genes with a detaTajD < -3 (i.e. genes which have a higher Tajima's D in P. falciparum than in P. vivax), we find a Bonferroni-corrected p-value of ~0.025 for metal/cation/ion-binding proteins. This is a 5x enrichment (23 genes out of a background of 434). On the other extreme, for the genes with deltaTajD > 1 (i.e. genes that have higher Tajima's D in P. vivax than in P. falciparum), there may be a slight but statistically significant enrichment for genes that are part of celluar differentiation / developmental processes.

#####11 April 2016 Nick B is working on coalescent simulations to get selection-less distribution of Tajima's D values to compare our observed Tajima's D values against. He ran into trouble because he had a different number of genes in his simulated data than I did in my PopGenome GFF files. Most of this difference was due to what we were excluding. So, Nick is excluding Paralogs (Neafsey or var/stevor/rifin), subtelomeres, AAKMs (for Pv), and Mito/Apico. This leaves him with 4909 Pv and 5293 Pf genes. So I need to subset my Pf and Pv GFF files similarly, and then split them, hopefully giving me the same number of genes in the end. These notes and work are at gff_prepper.sh.