Wildcards - RomainFeron/workshop-snakemake-sibdays2020 GitHub Wiki

Wildcards are a core feature of Snakemake. In short, wildcards can be considered as Snakemake's variables that will be completed when Snakemake evaluates the workflow. They are defined with the traditional python {wildcard_name} syntax.

To introduce the concept of wildcards, let us look at this rule from the Defining rules section:

rule first_step:
    input:
        'data/first_step.tsv'
    output:
        'results/first_step.txt'
    shell:
        'cp {input} {output}'

With this definition, this rule only works for the input file data/first_step.tsv and the output file results/first_step.txt. Let's assume we have data for several samples, and we would like to apply this rule to any of them. Using wildcards, we could modify this rule to generate an output for any sample:

rule first_step:
    input:
        'data/{sample}.tsv'
    output:
        'results/{sample}.txt'
    shell:
        'cp {input} {output}'

Now, if we wanted to generate the output results/sample_1.txt, we could run snakemake as follows: snakemake results/sample_1.txt. Then, when evaluating the workflow, Snakemake would identify that the rule first_step can generate this output and replace {sample} with sample_1 everywhere in the definition of first_step for this specific job. Wildcards are a powerful tool to make rules more generic and therefore simplify a workflow's implementation; the traditional programming equivalent would be implementing a function instead of a code snippet.

Snakemake allows to generate for multiple wildcards values at the same time with the following syntax:

snakemake results/sample_{1,2,3}.txt

Wildcards are inferred from a rule's output, and are then propagated to all other directives. It is necessary that all outputs generated by a rule (including log and benchmark files) have the same wildcards, otherwise ambiguous rules could happen for a specific output (see the Rule dependencies section).

Wildcards can be accessed in the shell command with the syntax wildcards.<wilcard_name>. Example:

rule first_step:
    input:
        'data/{sample}.tsv'
    output:
        'results/{sample}.txt'
    shell:
        'echo {wildcards.sample};'
        'cp {input} {output}'

There can be multiple wildcards in a workflow, and even within a single rule; in this case, all outputs from a rule all need to have the same wildcards.

When using multiple wildcards within a rule, it is common to encounter "ambiguous" wildcards problems. To illustrate this problem, let us consider a rule with the output results/{sample}_{treatment}.txt:

rule first_step:
    input:
        'data/{sample}_{treatment}.tsv'
    output:
        'results/{sample}_{treatment}.txt'
    shell:
        'cp {input} {output}'

Imagine that we would like to apply this rule to generate the output results/sample_1_control.txt: Snakemake cannot determine whether the wildcards values are sample='sample_1', treatment='control' or sample='sample', treatment='1_control'.

One way to solve this issue is to use wildcard constraints. In short, values for wildcards can be constrained using regular expressions. Regular expressions are patterns defining sequences of characters; there are multiple implementations, but Snakemake is using Python's syntax (similar to Perl's if you are familiar with that one). Regular expressions are beyond the scope of this workshop, but they are very powerful tools for parsing text in general; a good place to start learning about them is this tutorial.

By default, wildcards match the regular expression .+, meaning "1 or more occurrences of any character except newline". Let's implement a constraint so that wildcards in the previous rules are always resolved as sample='sample_1', treatment='control'. There are two main ways to do that:

  • Specify that the value of the sample wildcard always follows the pattern sample_<number>. This pattern would be implemented as sample_[\d]+ ('sample' followed by any number of digits).
  • Specify that the value of the treatment wildcard cannot contain the character _. This pattern would be implemented as [^_]+ (any character except '_').

Wildcard constraints can be defined for a single rule using the directive wildcard_constraints: <pattern>. There can be multiple constraints for a single rule:

rule first_step:
    input:
        'data/{sample}_{treatment}.tsv'
    output:
        'results/{sample}_{treatment}.txt'
    wildcard_constraints:
        sample = 'sample_[\d]+',
        treatment = '[^_]+'
    shell:
        'cp {input} {output}'

Constraints can also be defined for the entire snakefile using the exact same syntax:

wildcard_constraints:
    sample = 'sample_[\d]+',
    treatment = '[^_]+'

rule first_step:
    input:
        'data/{sample}_{treatment}.tsv'
    output:
        'results/{sample}_{treatment}.txt'
    shell:
        'cp {input} {output}'

In this case, constraints will be applied to the relevant wildcards in all rules.

⚠️ **GitHub.com Fallback** ⚠️