Advanced directives: threads, log, and benchmark - RomainFeron/workshop-snakemake-sibdays2020 GitHub Wiki

So far, we have seen the directives input and output to manage files, shell, run, and script to execute the rule, and params for non-file parameters. There are a few other directives available to control a rule's advanced features. In this section, we will learn about the directives threads, log, and benchmark.

The 'threads' directive

The threads directive allows to specify the number of threads that Snakemake will allocate to each job spawned by a rule. It follows the syntax threads: <number_of_threads>:

rule first_step:
    input:
        'data/first_step.tsv'
    output:
        'results/first_step.txt'
    threads: 4
    shell:
        'command --threads {threads} {input} > {output}'

This directive only applies to software that can make use of a threads parameters; Snakemake cannot parallelize a software automatically. However, you can specify the total number of cores allocated to Snakemake with the runtime parameters --cores <number_of_cores>, and Snakemake will then run multiple jobs in parallel when possible. Note that the number of threads allocated to all jobs running at a given time cannot exceed the value specified with --cores. Therefore, a rule requiring 4 threads will run will only 2 threads if --cores was set to 2.

The 'log' directive

The log directive allows to specify the path to a log file for the rule. It follows the syntax log: <path/to/log/file.log>. The value of log can then be accessed from the shell directive with {log}:

rule first_step:
    input:
        'data/first_step.tsv'
    output:
        'results/first_step.txt'
    log:
        'logs/first_step.log'
    shell:
        'head {input} > {output} 2> {log}'

Note that logs have to be handled manually for each command. Some programs will have parameters to specify a log file, other will output logs to stderr, which can be redirected to a file with 2> like in the example above. Other programs will not have logs or will have mixed logs and outputs.

Similarly to input and output, Snakemake will automatically create all directories in the log file path. Log files can have wildcards, and wildcards in the log file path have to be exactly the same as wildcards in the output, otherwise multiple jobs could create the same log file.

Suggestion: it can be convenient to group all log files for your workflow in a logs/ folder at the root of your workflow's directory. This way, you can easily check the logs for a job in case of failure.

Note: since Snakemake 5.7.1, you can print the log files of failed jobs using the runtime parameter --show-failed-logs.

The 'benchmark' directive

The benchmark directive allows to specify the path to a file that will contain benchmark results for the rule. It follows the syntax benchmark: <path/to/benchmark/file.txt>:

rule first_step:
    input:
        'data/first_step.tsv'
    output:
        'results/first_step.txt'
    benchmark:
        'benchmarks/first_step.txt'
    shell:
        'head {input} > {output}'

In practice, Snakemake will measure runtime and memory usage for the job and store the values in the benchmark file. As this does not affect performance, benchmarking the rule is virtually free, and the results can be helpful to identify performance bottlenecks in your workflow or to have an idea of the resources requirements of a specific step. Snakemake can run the benchmark times with the syntax benchmark: repeat('path/to/benchmark/file.txt', N). However, in this case the job will be run multiple times.

Similarly to input, output, and log, Snakemake will automatically create all directories in the benchmark file path. Benchmark files can have wildcards, and wildcards in the benchmark file path have to be exactly the same as wildcards in the output, otherwise multiple jobs could create the same benchmark file.