Pipeline (Snakemake) - EXIOBASE/docs GitHub Wiki

Pipeline (Snakemake)

Background

We're using snakemake to automate the actual pipeline. Snakemake builds a directed acyclic graph (DAG) based on the input-output files specified in the rule set.

Generally, the Snakemake stuff is extrodinarily fragile. It has to be pretty much perfect to run.

Interfacing with our Code

There are several ways to have Snakemake interface with our Python code including but not limited to:

using shell invocation (not a good choice);
using the run syntax; and,
using the script syntax.

The shell syntax is intended for running shell scripts, etc, and although it could be used it's by far the least favourable option.

The run syntax executes Python code, including possibly import of other modules, directly in the Snakemake context. This is arguably the easiest method but it imposes certain limitations, which could cause problems.

The script syntax executes other Python scripts and supplies the list of input files, list of output files, and parameters as objects within these scripts. Behind the scenes, Snakemake actually serialises these and then transparently deseralises for the scripts. However, since functions or objects are not being interacted with directly, there has to be some extra interface code to handle this communication.

Our main choice is to use the script syntax for reasons that are explained below.

Interfacing with datamanager, macrodb, etc.

There is code in datamanager's exec_control/external.py, which will construct an ExtArgs object from the provided Snakemake parameters. This needs to be in a specific format. See the ExtArgs section in the datamanager page.

There are some technical details about how Snakemake interacts with other code when using script: though.

Snakemake's `snakemake` Object

When using script:, as we do, Snakemake will serialize input:, output:, params:, and log: for the rule in question. It'll deserialize these within the context of your Python module by creating an InputFiles, OutputFiles, Params, and Log object, respectively. These are defined in snakemake.io and are all essentially just a snakemake.io.Namedlist.

Snakemake's Namedlist is descended from Python's list, seemingly as an attempt to add name->value features, which otherwise would be known as a dictionary. It's not particularly good; you'll encounter a few problems unless you're consistent both on the Snakemake and script side. To handle this, we assume that input and output files are just lists. We assume that log and params are using a name->value arrangement: we specify a stdout and stderr for log; params are naturally name->value.

For the params, the main fields are best extracted manually. These correspond directly with ExtArgs, or at least should do. The command_str and query_str can actually be specified as a dictionary and the script will see them as a dictionary. This makes it much easier to instantiate ExtArgsCommandStr and ExtArgsQueryStr.

Snakemake `log` Field

The log: field in the rule specification only supplies your script with the log filename(s). It doesn't actually do any logging yourself. In datamanager and macrodb, we redirect stdout and stderr to snakemake.log['stdout'] and snakemake.log['stderr], which are defined in the corresponding Snakemake rule. There is also the main datamanager logs handled via loguru.

Conda Environments and Snakemake

Snakemake allows specification of a conda environment for each rule. This cannot be used with the run and shell syntax, only the script syntax. There are also two flavours for this: either specifying an existing conda environment, or have Snakemake spin-up an environment automatically based on a YAML file.

This may appear to be unnecessary complexity. However, the software used along the pipeline is quite diverse and includes optimisation software. The current version of EXIOBASE also uses Matlab (although it's not clear if it'll be required in this new version). This could quite easily present a challenge if the different software requires different conda environments, e.g. Matlab has specific requirements depending on version. So it makes some practical sense to keep the option of being able to specify rule-specific environments. Also, already the datamanager environment is different from the Snakemake environment, and keeping minimal specifications lessens the potential for problems in package dependency resolution in conda.

It is also arguably helpful in terms of narrowing the window for potential bugs: having a well-defined interface forces compliance, and so many mistakes can be identified immediately.

Notes / Reminders

Location of Snakemake Working Files

We've setup Snakemake so that the working files end up in var/work/.snakemake. This may not be obvious in directory listings, i.e. remember in the shell you'd need a ls -a.

Problem with Missing Packages with `script` and `conda`

A hard-to-debug problem can sometimes appear when using the script and conda statements together in a rule, when the conda environment is already created (so it points to an environment name not a YAML environment specification). In the target script, it looks like conda has been setup to activate the correct environment but loading of third-party packages fail. On closer inspection, this is due to the syspath not being set correctly.

The resolution, both times so far, appears to have been reinstalling the whole snakemake environment. It probably bumps version and it's been a bug that has been resolved at some point. However, it's weird.

Environment

The standard snakemake install, see the official docs, installed a plethora of packages via conda/mamba.

One of these, PuLP, adds an entry to PYTHONPATH, '/opt/gurobi201/linux32/lib/python2.5'. This took just shy of an hour to track down. Not least, because it only appears when running code under a snakemake process. It turns out that snakemake may use PuLP to solve a MILP for the scheduling, see scheduler.py, and that PuLP itself can Gurobi.

Old / Example Setups for Working with Slurm

Back early 2023, some exploratory runs with Snakemake were done on Idun using Slurm. The code does some figures for the pseudospectrum of various matrices if I recall correctly. Anyway, it's not directly related to the EXIOBASE stuff but I left it in the pipeline repo for future reference since it does work.