Documentation - santaci/SimOutbreakSelection GitHub Wiki
This can be a well-defined model such as the Gravel model of human evoluton or can be a custom designed one based on curated information from relevant literature or in-house demographic inferencing.
For example purposes, the demographic history of African Cape buffalo (rinderpest/demographies/
) and the Medieval Swedish human population (plague/demographies/
) have been provided to simulate in SLiM.
This is the most bespoke part of the framework—designing the trajectory and implementing the key components into the simulation:
- Bottleneck induced by the epidemic
- Viability of those who are homozygous for the advantageous allele (i.e. selected variant)
- The starting frequency of the variant under selection
The selection-driven simulations used in our manuscript can be found in the
inputs/
folder of each respective epidemic. Note that each mode of inheritance (mode of selection) is a separate SLiM script.
Within the SLiM model of epidemic-driven selection, make sure to keep track of Pedigree IDs of the individuals that survive in the generation* of interest.
This would be done within the intialize() {}
callback with the following command:
initializeSLiMOptions(keepPedigrees=T);
Survivor pedigree IDs will be used to subset apart the dead individuals from the same generation
* when we write those IDs out into a separate file. Those IDs will then be used by SOS
downstream.
In order to keep the dead from the generations of interest within the output tree sequence, we save the individuals in the early part of the generation*
The command to do this in SLiM would be:
sim.treeSeqRememberIndividuals(p2.individuals);
In this example, p2
is the population of interest.
*This is now cycle
or tick
for the newer SLiM versions.
Below is a breakdown of the command-line arguments available in SOS
:
-
--tree <file>
: Path to the tree file without mutations. -
--muTree <file>
: Path to the mutated and recapitated tree file.
-
--bottle <int>
: SLiM generation(s) marking a bottleneck event. Default:58002
. -
--select <int>
: Site of the selected variant. Default:13811692
. -
--generations <int int>
: SLiM generations to compare (e.g.,58002 58012
). Default:[58002, 58012]
. -
--chromosomes <int+>
: Chromosomes simulated in SLiM (e.g.,21 22
). Default:[21, 22]
. -
--dead <int>
: Generation of the deceased individual of interest. -
--demegen <int>
: Last SLiM generation of demographic history. Default:58000
. -
--ne <int>
: Ancestral effective population size. Default:7310
. -
--mu <float>
: Mutation rate for neutral mutations. Default:2.36e-8
. -
--rec <float>
: Uniform recombination rate for recapitation. Default:1e-8
. -
--seed <int>
: Seed for the simulation. Default: current Unix time.
-
--putparents <file>
: CSV file containing both dead and alive individuals. -
--sampling <int+>
: Number of samples for each generation/grouping. If comparing with deceased individuals, input their sample size first. Default:100
. -
--mixCem <float>
: Proportion of diseased dead in a mixed cemetery sample (e.g.,0.5
). -
--realhet
: Counts true heterozygous sites for a given generation (After). -
--midchrom <int+>
: Position to divide SLiM-created chromosomes for plots and VCF output. Default:[33759515]
.
-
--plots
: Generates visual representations of results. -
--winfst
: Calculates windowed FST (100kb windows). -
--vcf
: Creates files necessary to build a VCF. -
--nosel
: Skips runningselscan
. -
--pmap
: Uses physical positions instead of a genetic map foriHS
. -
--recmaps <dir>
: Directory containing per-chromosome recombination maps, named asslim_chr#_recode.map
. -
--ihs
: Produces input files forselscan
orrehh
. -
--gwas
: Produces input files for GWAS. -
--threads <int>
: Number of threads forselscan
. Default:1
.
-
--dir <dir>
: Specifies a different working directory.
Below is an example command using SOS
to analyze a SLiM simulation output (i.e. tree sequence input):
python3 sos.py --nosel --seed 42 --tree add_grad_6.trees --muTree add_grad_6_mu.trees \
--select 42136758 --bottle 40002 --dead 40002 --putparents nwf_putparents_6.csv \
--sampling 200 --vcf --gwas --ihs --plots --pmap --generations 40002 40002 \
--ne 100000 --mu 1.5e-8 --rec 1e-8 --demegen 40000
-
Treeseq Files: Specifies the original tree (
add_grad_6.trees
) and the mutated version (add_grad_6_mu.trees
). -
Selection Analysis: Investigates selection at site
42136758
. -
Demographic History: Defines a bottleneck at generation
40002
. -
Sampling & Comparisons: Uses
nwf_putparents_6.csv
to include both dead and alive individuals, sampling200
per group. -
Output Generation: Creates VCF, GWAS, and
iHS
input files, plots results, and uses a physical map (--pmap
). -
Performance Control: Uses seed
42
for reproducibility and sets--nosel
to skipselscan
.
- Ensure all input files are correctly formatted and available before running
SOS
. - Default values are provided for many parameters, but some require user specification.
- Running with multiple analysis flags (
--vcf
,--gwas
,--ihs
) may require significant computational resources.
Snakemake workflows can be found within the rules/
folders of each example epidemic in this repository. There are rules to run the demographic history (if you haven't done so already) as well as the epidemic part of the simulation.
The epidemic simulation rules require a configuration file (.YAML) with the necessary parameters such as starting allele frequency (f
), viability of the homozygous advantageous allele(v
), and corresponding inheritance model
* (i.e. add
or rec
). Other information such as which generations will be compared (gen1
and gen2
), the generation where there are dead to compare to survivors (dead
), number of samples per grouping(n
), number of simulations to perform (sim
), and number of subsamplings per simulation (resamp
) can be found within the configuration file. All parameters can be adjusted by the user.
## Example configuration YAML
model: "add"
recovery: "grad"
v: 1.0
f: 0.1
sim: 1
n: 100
resamp: 100
deme: 1
gen1: 40001
gen2: 40003
midchrom: 33759515
chrom: [21, 22]
OUTMAIN: "output"
dead:
More examples can be found in the configs/
folder of each respective epidemic.
*NOTE: At the moment only SLiM code for additive and recessive selection have been provided as examples. An equation to implement a dominant model can be found in the Methods section of our manuscript.
When the genome length is reduced by half, the top 3 signals remain stable and retain comparable statistical power. However, inflation in power becomes more pronounced when analyzing beyond the top 3. This suggests that users should carefully consider the number of top signals included in their analysis, as including more than the top 3 may lead to inflated or misleading results in reduced genome contexts.
Test-case: Additive Selection for Rinderpest in African Cape Buffalo. In panels A-C, we can see a comparison of estimated power when searching for epidemic-driven selection in African Cape Buffalo when using full-length chromosomes versus halved sized genomes comparing (A) Top 3 SNPs, (B) Top 10 SNPs, and (C) Top 1 SNP using FST whole genome scans. The first column is comparing samples from before the rinderpest epidemic versus a few generations after while the second column depicts the results when comparing samples from before the rinderpest epidemic versus present-day individuals. Each row is a different viability for those that carry the beneficial allele.
Activate the SOS environment
micromamba activate SOS
An example usage using the demo data (rinderpest/
) in this repository would be to first run the demographic history workflow:
cd rinderpest/
snakemake all -p -s rules/african_buffalo_deme.snakemake --cores 10
Once that has completed and we have our tree sequence outputs of the host's demographic history, we can move on to the our epidemic-driven selection workflow.
In this example below, the Snakemake file contains rules to use the demopgraphic history tree sequences as input for our nonWF SLiM simulations. The number of simulations and subsamplings can be found in the configuration file. The --configfile
path lets us know we are running an additive model with n=100 per group where Dead and Alive are being compared starting at an allele frequency of 0.1 (DA01
).
snakemake all -p -s rules/buffalo_runs.snakemake --configfile configs/add/100/n100_grad_DA01.yaml --cores 20 --debug-dag
Further information can be adjused within the configuration file and the Snakemake file.
Once your epidemic-selection workflow is done, gather the results for those simulations into one mastertable using the command below. You will multiple mastertables if you explore different parameter combinations.
## Usage: get_results.sh <model> <generation1> <generation2> <sample_size> [demeGeneration] [midchrom] [nsims] [nresamp] [workdir]
bash get_results.sh add_grad_v1.0_f0.1_sim 40002 40002 100
Then calculate power for all the performed simulations of the same viability (v
).
This script points to the directory where mastertables for the simulations of VAA=1 are stored (outputs/plots/add/VAA1/
). 3
signifies the number of top candidates to use when considering outliers for all power calculations (this is the default).
## Usage: Rscript calculate_power.R <path to candidate tables> [# of top candidates. Default = 3]
Rscript calculate_power.R outputs/plots/add/VAA1/ 3