Documentation - santaci/SimOutbreakSelection GitHub Wiki

Welcome to the SOS wiki!

fig1_sos_framework_v2

Preparing Inputs

Input I. Simulating the demographic history of the host

This can be a well-defined model such as the Gravel model of human evoluton or can be a custom designed one based on curated information from relevant literature or in-house demographic inferencing.

For example purposes, the demographic history of African Cape buffalo (rinderpest/demographies/) and the Medieval Swedish human population (plague/demographies/) have been provided to simulate in SLiM.

Input II. Simulating the epidemic of interest with selection using SLiM

This is the most bespoke part of the framework—designing the trajectory and implementing the key components into the simulation:

  • Bottleneck induced by the epidemic
  • Viability of those who are homozygous for the advantageous allele (i.e. selected variant)
  • The starting frequency of the variant under selection The selection-driven simulations used in our manuscript can be found in the inputs/ folder of each respective epidemic. Note that each mode of inheritance (mode of selection) is a separate SLiM script.

Input III. Save generations of interest

Within the SLiM model of epidemic-driven selection, make sure to keep track of Pedigree IDs of the individuals that survive in the generation* of interest.

This would be done within the intialize() {} callback with the following command: initializeSLiMOptions(keepPedigrees=T);

Survivor pedigree IDs will be used to subset apart the dead individuals from the same generation* when we write those IDs out into a separate file. Those IDs will then be used by SOS downstream.

In order to keep the dead from the generations of interest within the output tree sequence, we save the individuals in the early part of the generation*

The command to do this in SLiM would be: sim.treeSeqRememberIndividuals(p2.individuals);

In this example, p2 is the population of interest.

*This is now cycle or tick for the newer SLiM versions.

Usage

Using SimOutbreakSelection: SOS

Command-Line Arguments

Below is a breakdown of the command-line arguments available in SOS:

Required Inputs

  • --tree <file>: Path to the tree file without mutations.
  • --muTree <file>: Path to the mutated and recapitated tree file.
Simulation Parameters
  • --bottle <int>: SLiM generation(s) marking a bottleneck event. Default: 58002.
  • --select <int>: Site of the selected variant. Default: 13811692.
  • --generations <int int>: SLiM generations to compare (e.g., 58002 58012). Default: [58002, 58012].
  • --chromosomes <int+>: Chromosomes simulated in SLiM (e.g., 21 22). Default: [21, 22].
  • --dead <int>: Generation of the deceased individual of interest.
  • --demegen <int>: Last SLiM generation of demographic history. Default: 58000.
  • --ne <int>: Ancestral effective population size. Default: 7310.
  • --mu <float>: Mutation rate for neutral mutations. Default: 2.36e-8.
  • --rec <float>: Uniform recombination rate for recapitation. Default: 1e-8.
  • --seed <int>: Seed for the simulation. Default: current Unix time.
Sampling and Analysis
  • --putparents <file>: CSV file containing both dead and alive individuals.
  • --sampling <int+>: Number of samples for each generation/grouping. If comparing with deceased individuals, input their sample size first. Default: 100.
  • --mixCem <float>: Proportion of diseased dead in a mixed cemetery sample (e.g., 0.5).
  • --realhet: Counts true heterozygous sites for a given generation (After).
  • --midchrom <int+>: Position to divide SLiM-created chromosomes for plots and VCF output. Default: [33759515].
Output Options
  • --plots: Generates visual representations of results.
  • --winfst: Calculates windowed FST (100kb windows).
  • --vcf: Creates files necessary to build a VCF.
  • --nosel: Skips running selscan.
  • --pmap: Uses physical positions instead of a genetic map for iHS.
  • --recmaps <dir>: Directory containing per-chromosome recombination maps, named as slim_chr#_recode.map.
  • --ihs: Produces input files for selscan or rehh.
  • --gwas: Produces input files for GWAS.
  • --threads <int>: Number of threads for selscan. Default: 1.
Working Directory
  • --dir <dir>: Specifies a different working directory.

Example Usage

Below is an example command using SOS to analyze a SLiM simulation output (i.e. tree sequence input): epidemic_sim_layout

python3 sos.py --nosel --seed 42 --tree add_grad_6.trees --muTree add_grad_6_mu.trees \
  --select 42136758 --bottle 40002 --dead 40002 --putparents nwf_putparents_6.csv \
  --sampling 200 --vcf --gwas --ihs --plots --pmap --generations 40002 40002 \
  --ne 100000 --mu 1.5e-8 --rec 1e-8 --demegen 40000
Explanation of the Example Command:
  • Treeseq Files: Specifies the original tree (add_grad_6.trees) and the mutated version (add_grad_6_mu.trees).
  • Selection Analysis: Investigates selection at site 42136758.
  • Demographic History: Defines a bottleneck at generation 40002.
  • Sampling & Comparisons: Uses nwf_putparents_6.csv to include both dead and alive individuals, sampling 200 per group.
  • Output Generation: Creates VCF, GWAS, and iHS input files, plots results, and uses a physical map (--pmap).
  • Performance Control: Uses seed 42 for reproducibility and sets --nosel to skip selscan.

Notes

  • Ensure all input files are correctly formatted and available before running SOS.
  • Default values are provided for many parameters, but some require user specification.
  • Running with multiple analysis flags (--vcf, --gwas, --ihs) may require significant computational resources.

Workflow

Running SOS using Snakemake workflows

Snakemake workflows can be found within the rules/ folders of each example epidemic in this repository. There are rules to run the demographic history (if you haven't done so already) as well as the epidemic part of the simulation.

The epidemic simulation rules require a configuration file (.YAML) with the necessary parameters such as starting allele frequency (f), viability of the homozygous advantageous allele(v), and corresponding inheritance model* (i.e. add or rec). Other information such as which generations will be compared (gen1 and gen2), the generation where there are dead to compare to survivors (dead), number of samples per grouping(n), number of simulations to perform (sim), and number of subsamplings per simulation (resamp) can be found within the configuration file. All parameters can be adjusted by the user.

## Example configuration YAML

model: "add"
recovery: "grad"
v: 1.0
f: 0.1
sim: 1
n: 100
resamp: 100
deme: 1
gen1: 40001
gen2: 40003
midchrom: 33759515
chrom: [21, 22]
OUTMAIN: "output"
dead:

More examples can be found in the configs/ folder of each respective epidemic.

*NOTE: At the moment only SLiM code for additive and recessive selection have been provided as examples. An equation to implement a dominant model can be found in the Methods section of our manuscript.

In-House Comparison: Full-Length Two Chromosomes vs. Halved Genome Size

When the genome length is reduced by half, the top 3 signals remain stable and retain comparable statistical power. However, inflation in power becomes more pronounced when analyzing beyond the top 3. This suggests that users should carefully consider the number of top signals included in their analysis, as including more than the top 3 may lead to inflated or misleading results in reduced genome contexts.

Test-case: Additive Selection for Rinderpest in African Cape Buffalo. In panels A-C, we can see a comparison of estimated power when searching for epidemic-driven selection in African Cape Buffalo when using full-length chromosomes versus halved sized genomes comparing (A) Top 3 SNPs, (B) Top 10 SNPs, and (C) Top 1 SNP using FST whole genome scans. The first column is comparing samples from before the rinderpest epidemic versus a few generations after while the second column depicts the results when comparing samples from before the rinderpest epidemic versus present-day individuals. Each row is a different viability for those that carry the beneficial allele.

RinderpestAdditiveGithub

Quick snakemake usage

Activate the SOS environment

micromamba activate SOS

An example usage using the demo data (rinderpest/) in this repository would be to first run the demographic history workflow:

cd rinderpest/
snakemake all -p -s rules/african_buffalo_deme.snakemake --cores 10

Once that has completed and we have our tree sequence outputs of the host's demographic history, we can move on to the our epidemic-driven selection workflow.

In this example below, the Snakemake file contains rules to use the demopgraphic history tree sequences as input for our nonWF SLiM simulations. The number of simulations and subsamplings can be found in the configuration file. The --configfile path lets us know we are running an additive model with n=100 per group where Dead and Alive are being compared starting at an allele frequency of 0.1 (DA01).

snakemake all -p -s rules/buffalo_runs.snakemake --configfile configs/add/100/n100_grad_DA01.yaml --cores 20 --debug-dag

Further information can be adjused within the configuration file and the Snakemake file.

Once your epidemic-selection workflow is done, gather the results for those simulations into one mastertable using the command below. You will multiple mastertables if you explore different parameter combinations.

## Usage: get_results.sh <model> <generation1> <generation2> <sample_size> [demeGeneration] [midchrom] [nsims] [nresamp] [workdir]

bash get_results.sh add_grad_v1.0_f0.1_sim 40002 40002 100

Then calculate power for all the performed simulations of the same viability (v). This script points to the directory where mastertables for the simulations of VAA=1 are stored (outputs/plots/add/VAA1/). 3 signifies the number of top candidates to use when considering outliers for all power calculations (this is the default).

## Usage: Rscript calculate_power.R <path to candidate tables> [# of top candidates. Default = 3]

Rscript calculate_power.R outputs/plots/add/VAA1/ 3
⚠️ **GitHub.com Fallback** ⚠️