Defining rules - RomainFeron/workshop-snakemake-sibdays2020 GitHub Wiki
Rules are the basic blocks of a Snakemake workflow. A rule is like a recipe indicating how to produce a specific output; the actual application of a rule to create an output is called a job.
A rule is defined in Snakefile with the keyword rule, and contains directives which indicate the rule's properties. We will learn about directives over the course of the workshop.
To create a basic rule, we need two directives:
-
output: path of the output file for this rule -
shell: shell command to execute in order to generate the output
The following example shows the syntax to implement a basic rule using these two directives. The rule defined in this example creates a file first_step.txt containing the line "snakemake" and located in a results folder, using the echo shell command:
rule first_step:
output:
'results/first_step.txt'
shell:
'echo “snakemake” > {output}'As this example shows, values for these two directives are strings. For the shell directive, the string can be written on multiple lines for clarity, simply using a set of quotes for each line. In addition, values from other directives can be accessed in the shell command with the syntax {directive}, and Snakemake will automatically insert the value when running a job for this rule; in the example, the value of output was obtained with {output}. Note that Snakemake automatically creates all missing folders in the output path.
The next directive used by most rules is input. Like output, input indicates the path to a file that is required by the rule to generate the output. In the following example, we modified the previous rule to use an input file first_step.tsv in a data folder and copy this file to results/first_step.txt:
rule first_step:
input:
'data/first_step.tsv'
output:
'results/first_step.txt'
shell:
'cp {input} {output}'Note that with this rule definition, Snakemake will not run if data/first_step.tsv does not exist.
Rules can have multiple input and/or output files, with each file on a single line ending with a comma. In the shell command, multiple input will be unpacked, meaning {input} will be replaced with a space-separated list of input files. More information about multiple inputs and outputs will be provided in a later section of the workshop.
Do not forget the commas between input / output files ! It's the source of many errors when starting to write workflows.
Inputs and outputs can be also be accessed by their index with the syntax {input[N]} or {output[N]}. The following example shows a rule to concatenate two inputs in a single output file with cat and print the content of the first one, illustrating the two ways to handle multiple inputs in the shell directive:
rule first_step:
input:
'data/first_step_1.tsv',
'data/first_step_2.tsv'
output:
'results/first_step.txt'
shell:
'cat {input} > {output};' # Will be evaluated as "cat data/first_step_1.tsv data/first_step_2.tsv > results/first_step.txt"
'echo {input[0]}' # Will be evaluated as "echo data/first_step_1.tsv"Inputs and outputs can be named and later be accessed by this name in the shell directive. This is a good practice, especially when inputs or outputs are of different types. Named inputs and outputs are defined with the following syntax: <name> = <value>.
The following example adds named inputs to the first_step rule defined previously:
rule first_step:
input:
sample_1 = 'data/first_step_1.tsv',
sample_2 = 'data/first_step_2.tsv'
output:
'results/first_step.txt'
shell:
'cat {input} > {output};'
'echo {input.sample_1}'For more detailed information about defining rules, check the relevant section in Snakemake's official documentation.