Defining rules - RomainFeron/workshop-snakemake-sibdays2020 GitHub Wiki

Rules are the basic blocks of a Snakemake workflow. A rule is like a recipe indicating how to produce a specific output; the actual application of a rule to create an output is called a job.

A rule is defined in Snakefile with the keyword rule, and contains directives which indicate the rule's properties. We will learn about directives over the course of the workshop.

To create a basic rule, we need two directives:

  • output: path of the output file for this rule
  • shell: shell command to execute in order to generate the output

The following example shows the syntax to implement a basic rule using these two directives. The rule defined in this example creates a file first_step.txt containing the line "snakemake" and located in a results folder, using the echo shell command:

rule first_step:
    output:
        'results/first_step.txt'
    shell:
        'echo “snakemake” > {output}'

As this example shows, values for these two directives are strings. For the shell directive, the string can be written on multiple lines for clarity, simply using a set of quotes for each line. In addition, values from other directives can be accessed in the shell command with the syntax {directive}, and Snakemake will automatically insert the value when running a job for this rule; in the example, the value of output was obtained with {output}. Note that Snakemake automatically creates all missing folders in the output path.

The next directive used by most rules is input. Like output, input indicates the path to a file that is required by the rule to generate the output. In the following example, we modified the previous rule to use an input file first_step.tsv in a data folder and copy this file to results/first_step.txt:

rule first_step:
    input:
        'data/first_step.tsv'
    output:
        'results/first_step.txt'
    shell:
        'cp {input} {output}'

Note that with this rule definition, Snakemake will not run if data/first_step.tsv does not exist.


Rules can have multiple input and/or output files, with each file on a single line ending with a comma. In the shell command, multiple input will be unpacked, meaning {input} will be replaced with a space-separated list of input files. More information about multiple inputs and outputs will be provided in a later section of the workshop.

Do not forget the commas between input / output files ! It's the source of many errors when starting to write workflows.

Inputs and outputs can be also be accessed by their index with the syntax {input[N]} or {output[N]}. The following example shows a rule to concatenate two inputs in a single output file with cat and print the content of the first one, illustrating the two ways to handle multiple inputs in the shell directive:

rule first_step:
    input:
        'data/first_step_1.tsv',
        'data/first_step_2.tsv'
    output:
        'results/first_step.txt'
    shell:
        'cat {input} > {output};'  # Will be evaluated as "cat data/first_step_1.tsv data/first_step_2.tsv > results/first_step.txt"
        'echo {input[0]}'  # Will be evaluated as "echo data/first_step_1.tsv"

Inputs and outputs can be named and later be accessed by this name in the shell directive. This is a good practice, especially when inputs or outputs are of different types. Named inputs and outputs are defined with the following syntax: <name> = <value>.

The following example adds named inputs to the first_step rule defined previously:

rule first_step:
    input:
        sample_1 = 'data/first_step_1.tsv',
        sample_2 = 'data/first_step_2.tsv'
    output:
        'results/first_step.txt'
    shell:
        'cat {input} > {output};'
        'echo {input.sample_1}'

For more detailed information about defining rules, check the relevant section in Snakemake's official documentation.

⚠️ **GitHub.com Fallback** ⚠️