Generating test suites with Snakemake - thekswenson/Zombi_wiki GitHub Wiki

Overview

Here we present a way to generate trees over a varying set of parameters. Our Snakemake library provides the functionality to generate several projects, each one having simulated data for a single combination of parameters. There are two ways to use our library:

  1. Use the zombiSnakemake command to generate a suite of test datasets in your project directory.

  2. Import our library in your Snakefile.

Using zombiSnakemake

Simply call zombiSnakemake my_simulation_dir and follow the simple instructions printed to the terminal: a directory called my_simulation_dir will be created along with a my_simulation_dir/Snakefile and the default config and parameter files. To test your setup simply run snakemake -c 1 from inside the my_simulation_dir directory.

Accessing our library from a Snakefile

  1. Install Zombi in your mamba environment.
  2. Import the library adding these lines to your Snakefile:
    from zombi.snakemake.parameters import ZOMBI_EXPORT_SNAKEFILE
    include: ZOMBI_EXPORT_SNAKEFILE
    
  3. Use helper lists like ZOMBIPARAMDIRS and ZOMBIPARAMDIRS_NOREPS to write rules that depend on the Zombi output files.

Directory structure of simulated data

Each set of simulated parameter values yields a new Zombi project directory, containing all of the files resulting from a normal Zombi run. The path to this directory has the following structure (by default):

simulations/sequences/treeparams-{TMODE}-rep{X}/{TREE_PARAMS}/genomeparams-{GMODE}-rep{Y}/{GENOME_PARAMS}/sequenceparams-{SMODE}-rep{Z}/{SEQUENCE_PARAMS}/

where X, Y, and Z are replicate numbers, and each of TMODE, GMODE, and SMODE are the modes under which each command was run (e.g. GMODE is one of {G, Gu, Gf, Gm}). Each of TREE_PARAMS, GENOME_PARAMS, and SEQUENCE_PARAMS are paths containing a directory for each non-default simulation parameter. The parameter will appear as the name in the config file along with its value, separated by a - dash. Therefore, a test suite where none of the default parameters have been changed would produce a project directory

simulations/sequences/treeparams-T-rep0/genomeparams-G-rep0/sequenceparams-S-rep0/.

If, for example, the parameter TOTAL_LINEAGES was set to (non-default) 5, and TANDEMDUP was set to (non-default) f:0, then we would see

simulations/sequences/treeparams-T-rep0/TOTAL_LINEAGES-5/genomeparams-G-rep0/TANDEMDUP-f:0/sequenceparams-S-rep0/.

[!TIP] The order in which the directory names appear in these paths is the same as in the default parameter files of the Zombi installation, and so are predictable, stable, and follow the general snakemake paradigm (i.e. parameters are stored in path names).