Modularization: snakefiles and subworkflows - RomainFeron/workshop-snakemake-sibdays2020 GitHub Wiki

For simple workflows consisting of only a few rules, it makes sense to define all rules in Snakefile. However, as workflows grow complex and implement more and more rules, Snakefile may become messy and harder to maintain and edit. In this case, it's a good idea to organize your workflow in modules; this is a common practice in programming in general. This approach also makes it easier to reuse pieces of workflow in the future, although Snakemake does not have a proper module system like most programming languages.


The first and simplest way to organize your workflow is to group rules in separate snakefiles that will then be included in Snakefile. How to organize rules is up to you, but a common approach would be to create "thematic" modules, i.e. regroup rules involved in the same general step of the workflow.

In practice, all you need to do is to define the rules you want to group in a new file and include this file in Snakefile with the syntax include: '<path/to/file.smk>'. As shown in this example, the recommended extension for snakefiles is .smk. This does not apply to Snakefile, which is a special snakefile. In the following example, we split a very simple workflow into two modules that we load in Snakefile:

first_step.smk:

rule first_step:
    input:
        'data/first_step.tsv'
    output:
        'results/first_step.txt'
    shell:
        'cp {input} > {output}'

second_step.smk:

rule second_step:
    input:
        'results/first_step.txt'
    output:
        'results/second_step.txt'
    shell:
        'cat {input} | grep "snakemake" > {output}'

Snakefile:

include: 'first_step.smk'
include: 'second_step.smk'

rule all:
    input:
        'results/second_step.txt'

Includes do not affect the default target rule, which is the first rule explicitly defined in Snakefile. Therefore, in this example, running snakemake will use the rule all as default target.

In practice, you can imagine that the line include: <path/to/snakefile.smk> is replaced by the entire content of snakefile.smk in Snakefile. This means that syntaxes like rules.<rule_name>.output can still be used in snakefiles, even if the rule <rule_name> was defined in another snakefile, as long as the snakefile in which <rule_name> is defined is included before the snakefile that uses rules.<rule_name>.output. This also work for input and output functions.

You can place snakefiles in a sub-directory without changing input and output paths, as these paths are relative to the working directory. However, you will need to edit paths to external scripts and conda environments, as these paths are relative to the snakefile from which they are called.


Another approach to modularize workflows are sub-workflows. A sub-workflow is a self-contained Snakemake workflow that will be executed independently before the main workflow. Therefore, you should use a sub-workflow if your main workflow depends on the results of an analysis that has already been implemented or can be implemented in another workflow, especially if this analysis can be used independently or as part of other workflows. To learn how to define and use sub-workflows, refer to the relevant section of the official documentation.

⚠️ **GitHub.com Fallback** ⚠️