Automatic software deployment with Conda - RomainFeron/workshop-snakemake-sibdays2020 GitHub Wiki

As we mentioned in the introduction, one major issue with data analyses reproducibility is software, specifically installation and versioning. To perfectly reproduce a workflow, you should be able to easily install the exact version of all software used in this workflow. One solution to this problem is the package and environment manager Conda.

In short, Conda is an open source system that allows to install, run, and update software on Linux, macOS, and Windows. Software is packaged by maintainers and made available through channels, which are repositories containing hundreds or thousands of packages. Conda is part of the Anaconda Python distribution, which contains the default channel maintained by their team. However, most software is available from community-managed channels; the two important ones for us are:

Conda-forge: contains lots of general-purpose software and libraries, often required by other software
Bioconda: repository for Bioinformatics software, started by the creator of Snakemake

In the vast majority of cases, repositories retain all versions of a software, and Conda allows you to install any version, thus solving the problem of reproducibility. Overall, Conda is a great tool to handle your software in general, especially on Linux.

We have already been using Conda to manage all the software required for this workshop. The specifics of packaging and installing software with Conda is beyond the scope of this workshop; for more information, refer to the official documentation.

In practice, Conda allows to define environments, i.e. a collection of software (with versions) that will be installed and loaded together. Environments can be defined in yaml files, which contain at least the environment's name and the dependencies, and should also specify channels when they are required. For instance, this is the content of the file workshop.yaml at the root of the github repository, which contains all the software required for the workshop:

name: snakemake-workshop
channels:
    - conda-forge
    - bioconda
dependencies:
    - python=3.6.8
    - snakemake=5.7.0
    - jinja2=2.10
    - networkx=2.1
    - matplotlib=2.2.3
    - graphviz=2.38.0
    - bcftools=1.9
    - samtools=1.9
    - bwa=0.7.17
    - pysam=0.15.0

The easiest way to know if a software is available in a Conda channel is to simply google "Conda " !

Snakemake provides Conda integration to automatically install and load software required by a rule. In practice, you define a Conda environment in a yaml file (see example above), and you then associate this environment to a rule with the syntax conda: 'path/to/env/file.yaml':

rule_env.yaml:

name: rule-env
channels:
    - conda-forge
    - bioconda
dependencies:
    - <software1>=<version1>
    - <software2>=<version2>

Snakefile:

rule first_step:
    input:
        'data/first_step.tsv'
    output:
        'results/first_step.txt'
    conda:
        'rule_env.yaml'
    shell:
        '<software1> {input} | <software2> {output}'

By default, Snakemake will not install and deploy Conda environment unless you specify the runtime parameter --use-conda; this way, users can still execute the workflow with their own version of the software if they wish.

The same Conda environment can be used in multiple rules, and you can have multiple environments in your workflow. How you implement the environments is up to you; a suggested approach is to try to be modular while avoiding to create an environment for each rule (installing environments can take some time on the first run).

Automatic software deployment with Conda - RomainFeron/workshop-snakemake-sibdays2020 GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️