1. Installation - sneuensc/mapache GitHub Wiki

Preface

In order to use mapache, you will need to follow these steps:

  1. Install mapache
  2. Prepare the input file pointing out to your FASTQ files
  3. Adapt the config files to your own needs

We strongly recommend to install mapache and run it with the test dataset provided to get familiar with the behavior of the pipeline.

Additionally, you can

  1. Manage the software that will be used by mapache
  2. Configure a profile which will help you to automatically submit and monitor jobs in a queueing system

Installation

There are two options to install mapache. Both options require Conda, so please get Conda on your system first.

Once Conda is installed, you can proceed to install mamba with the following commands:

## create mamba environment
conda create -n base-mamba -c conda-forge mamba

## activate mamba environment
conda activate base-mamba

Now you are ready to install mapache from GitHub or via snakedeploy.

Installing mapache from GitHub

The easiest way to install mapache is to clone the repository from GitHub and using Conda to create an environment that contains the base for snakemake and all dependencies for mapache.

## clone mapache repository
git clone https://github.com/sneuensc/mapache.git
cd mapache

## create conda environment for mapache
mamba env create -n mapache --file config/mapache-env.yaml

## now you can activate the mapache environment
conda activate mapache

🚨 GATK installation is not needed for the tutorial. However, if you plan to realign reads around indels, please follow the instructions to install GATK (unless GATK is already installed in your system). Then, you will need to indicate the name of the GATK executable in the config file (config/config.yaml) or the path to the .jar file. In this example, the executable is called GenomeAnalysisTK.

software:
    picard_jar: 'picard'
    gatk3_jar: 'GenomeAnalysisTK'

If you want or need more control on the specific versions of the tools that will be executed, you can continue on this page and precise the software that will be used.

Installing mapache with snakedeploy

Another way to install mapache is to use snakedeploy.

## create snakemake environment
mamba create -c bioconda -c conda-forge --name snakemake snakemake snakedeploy mamba

## activate the new environment
activate snakemake

## create project folder
mkdir -p project_folder
cd project_folder

## deploy mapache (as tag any release or commit may be used)
snakedeploy deploy-workflow https://github.com/sneuensc/mapache . --tag master

Snakedeploy allows deploying any version of mapache by changing the argument of tag. Possibilities are

  • branch name to get the last commit of this branch (e.g., master)
  • release (e.g., v0.1.1)
  • commit (e.g., 69c820d)

Manage software (advanced)

mapache needs to find the executables of each bioinformatic tool that will be run. Here, we explain how to set-up different software versions or executables. There are four main ways to tell mapache which software will be run, each with pros and cons.

  1. Using the mapache environment with Conda or mamba (described above)
  2. Using rule-specific conda environments
  3. Using pre-installed software (for example, load a module)
  4. Hybrid mode, using pre-installed software and rule-specific conda environments

Using the mapache environment (recommended method)

The easiest and recommended way to manage the version of the softwares is to create a single conda environment and run mapache within this environment

## create conda environment for mapache
mamba env create -n mapache --file config/mapache-env.yaml
conda activate mapache

## run mapache
snakemake --cores all

The yaml file config/mapache-env.yaml lists the packages that were installed during the creation of the mapache environment. If you need to change a specific version of a software, you can do it by editing this file prior to creating the environment.

Using rule-specific conda environments

The trick here is to add --use-conda when executing mapache.

For each rule (i.e., function or step) a conda environment is specified containing a specific software version. You can find all the available yaml environments for each software in workflow/envs.

These small environments are created the first time that mapache is executed by adding --use-conda to the command line.

## running mapache requires the parameter '--use-conda'
snakemake --cores all --use-conda

The next time you run this command, the packages will already be installed, and mapache will know that you are referring to them.

The environments may also be created in advance:

## create rule specific conda environments for mapache
snakemake --conda-create-envs-only --cores all

## running mapache requires the parameter '--use-conda'
snakemake --cores all --use-conda

Using pre-installed software

Global installation

If all dependencies are already installed and globally accessible (including python and R packages), mapache can be run directly with

## run mapache
snakemake --cores all 

Environmental modules

The trick here is to add --use-envmodules when executing mapache.

If your computer infrastructure uses environment modules you can take advantage of these installations and specify the dependencies in the config file (config/config.yaml), e.g.:

envmodules:
    samtools:       "gcc samtools/1.12"
    bowtie2:        "gcc bowtie2/2.4.2"
    bwa:            "gcc bwa/0.7.17"
    picard:         "gcc picard/2.24.0"
    gatk3:          "gcc gatk/3.8-1"
    fastqc:         "gcc fastqc/0.11.9"
    r:              "gcc r/4.0.4"
    adapterremoval: "gcc adapterremoval/2.3.2"
    bedtools:       "gcc bedtools2/2.29.2"
    mapdamage:      "gcc mapdamage2/2.2.1"
    seqtk:          "gcc seqtk/1.3"

and run mapache as follows

## running mapache requires the parameter '--use-envmodules'
snakemake --cores all --use-envmodules

Hybrid solution: pre-installed software and rule-specific environments

The trick here is to add --use-envmodules --use-conda when executing mapache.

If on your system not all dependencies are available you can use a hybrid system. For example if not all dependencies are available as environmental modules, e.g.:

envmodules:
    samtools:       "gcc samtools/1.12"
    bowtie2:        "gcc bowtie2/2.4.2"
    bwa:            "gcc bwa/0.7.17"
    picard:         "gcc picard/2.24.0"
    gatk3:          "gcc gatk/3.8-1"
    fastqc:         "gcc fastqc/0.11.9"
    r:              "gcc r/4.0.4"
    adapterremoval: "gcc adapterremoval/2.3.2"
    bedtools:       "gcc bedtools2/2.29.2"
    mapdamage:      ""
    seqtk:          ""

you can run mapache as follows

## running mapache requires both  parameter '--use_conda' and '--use-envmodules'
snakemake --cores all --use_conda --use-envmodules

mapache will use the dependencies which are available as environmental modules (in config file not set or empty) and will use conda environments for the other dependencies.

Dependencies of mapache

mapache depends on the following software. The pipeline was tested with the specified versions, however also other versions should work, as long as they don't change the command line and/or the input and output file names.

- snakemake = 7.18.2
- fastp = 0.23.2
- bwa = 0.7.17-r1188
- Bowtie 2 = 2.4.4
- FastQC = 0.11.9
- GenomeAnalysisTK = 3.8-1-0
- mapDamage = 2.2.1
- Picard MarkDuplicates = 2.25.5
samtools = 1.14
- bedtools = 2.30.0
- seqtk = 1.3
- DeDup = 0.12.8
- QualiMap = 2.2.2
- multiqc = 1.13
- glimpse = 1.1.1
- bcftools = 1.15
- R = 4.0.5
  - ggplot2
  - rcolorbrewer
  - reshape2
  - svglite
  - gridextra
- python = 3.97
  - pandas
  - numpy
  - itertools
  - pathlib
  - re
  - os
  - argparse
  - pysam
  - sys
  - math
  - collections
  - subprocess

Installation of GATK v3.8

Due to license restrictions for GATK v3.8, the mapache conda package cannot distribute and install GATK 3.8 directly (please note that GATK IndelRealigner used in the pipeline is not available in GATK >v4). To fully install GATK, you must download a licensed copy of GATK from the Broad Institute, and call “gatk-register,” which will copy GATK into your mapache conda environment:

# (download licensed copy of GATK)
gatk-register /path/to/GenomeAnalysisTK.jar

In short, you have

  1. to activate the conda environment containing GATK.
conda activate mapache

If you use rule specific conda environments, you need first to find the appropriate environment, e.g.:

## find conda environment
$ grep -i gatk /path/to/mapache/.snakemake/conda/*.yaml
/path/to/mapache/.snakemake/conda/82f2bb3b7e488f30c27f6219ba709caf.yaml:  - gatk = 3.8

## load this environment
$ conda activate /path/to/mapache/.snakemake/conda/82f2bb3b7e488f30c27f6219ba709caf
  1. execute gatk to see if GATK is already registered. If you get the following output, you have to register GATK:
$ gatk3
GATK jar file not found. Have you run "gatk3-register"?
  1. download the registration file from Broad Institute.
wget link_to_file
  1. register GATK
gatk3-register GenomeAnalysisTK-3.8-1-0-gf15c1c3ef.tar.bz2