1. Installation - sneuensc/mapache GitHub Wiki
Preface
In order to use mapache
, you will need to follow these steps:
- Install mapache
- Prepare the input file pointing out to your FASTQ files
- Adapt the config files to your own needs
We strongly recommend to install mapache and run it with the test dataset provided to get familiar with the behavior of the pipeline.
Additionally, you can
- Manage the software that will be used by
mapache
- Configure a
profile
which will help you to automatically submit and monitor jobs in a queueing system
Installation
There are two options to install mapache. Both options require Conda, so please get Conda on your system first.
Once Conda is installed, you can proceed to install mamba with the following commands:
## create mamba environment
conda create -n base-mamba -c conda-forge mamba
## activate mamba environment
conda activate base-mamba
Now you are ready to install mapache from GitHub or via snakedeploy.
mapache
from GitHub
Installing The easiest way to install mapache
is to clone the repository from GitHub and using Conda to create an environment that contains the base for snakemake
and all dependencies for mapache
.
## clone mapache repository
git clone https://github.com/sneuensc/mapache.git
cd mapache
## create conda environment for mapache
mamba env create -n mapache --file config/mapache-env.yaml
## now you can activate the mapache environment
conda activate mapache
๐จ GATK installation is not needed for the tutorial. However, if you plan to realign reads around indels, please follow the instructions to install GATK (unless GATK is already installed in your system).
Then, you will need to indicate the name of the GATK executable in the config file (config/config.yaml
) or the path to the .jar file. In this example, the executable is called GenomeAnalysisTK
.
software:
picard_jar: 'picard'
gatk3_jar: 'GenomeAnalysisTK'
If you want or need more control on the specific versions of the tools that will be executed, you can continue on this page and precise the software that will be used.
mapache
with snakedeploy
Installing Another way to install mapache
is to use snakedeploy.
## create snakemake environment
mamba create -c bioconda -c conda-forge --name snakemake snakemake snakedeploy mamba
## activate the new environment
activate snakemake
## create project folder
mkdir -p project_folder
cd project_folder
## deploy mapache (as tag any release or commit may be used)
snakedeploy deploy-workflow https://github.com/sneuensc/mapache . --tag master
Snakedeploy
allows deploying any version of mapache
by changing the argument of tag
. Possibilities are
- branch name to get the last commit of this branch (e.g., master)
- release (e.g., v0.1.1)
- commit (e.g., 69c820d)
Manage software (advanced)
mapache
needs to find the executables of each bioinformatic tool that will be run.
Here, we explain how to set-up different software versions or executables.
There are four main ways to tell mapache
which software will be run, each with pros and cons.
- Using the mapache environment with Conda or mamba (described above)
- Using rule-specific conda environments
- Using pre-installed software (for example, load a module)
- Hybrid mode, using pre-installed software and rule-specific conda environments
Using the mapache environment (recommended method)
The easiest and recommended way to manage the version of the softwares is to create a single conda environment and run mapache
within this environment
## create conda environment for mapache
mamba env create -n mapache --file config/mapache-env.yaml
conda activate mapache
## run mapache
snakemake --cores all
The yaml file config/mapache-env.yaml
lists the packages that were installed during the creation of the mapache environment.
If you need to change a specific version of a software, you can do it by editing this file prior to creating the environment.
Using rule-specific conda environments
The trick here is to add
--use-conda
when executingmapache
.
For each rule (i.e., function or step) a conda environment is specified containing a specific software version.
You can find all the available yaml environments for each software in workflow/envs
.
These small environments are created the first time that mapache
is executed by adding --use-conda
to the command line.
## running mapache requires the parameter '--use-conda'
snakemake --cores all --use-conda
The next time you run this command, the packages will already be installed, and mapache
will know that you are referring to them.
The environments may also be created in advance:
## create rule specific conda environments for mapache
snakemake --conda-create-envs-only --cores all
## running mapache requires the parameter '--use-conda'
snakemake --cores all --use-conda
Using pre-installed software
Global installation
If all dependencies are already installed and globally accessible (including python and R packages), mapache
can be run directly with
## run mapache
snakemake --cores all
Environmental modules
The trick here is to add
--use-envmodules
when executingmapache
.
If your computer infrastructure uses environment modules you can take advantage of these installations and specify the dependencies in the config file (config/config.yaml
), e.g.:
envmodules:
samtools: "gcc samtools/1.12"
bowtie2: "gcc bowtie2/2.4.2"
bwa: "gcc bwa/0.7.17"
picard: "gcc picard/2.24.0"
gatk3: "gcc gatk/3.8-1"
fastqc: "gcc fastqc/0.11.9"
r: "gcc r/4.0.4"
adapterremoval: "gcc adapterremoval/2.3.2"
bedtools: "gcc bedtools2/2.29.2"
mapdamage: "gcc mapdamage2/2.2.1"
seqtk: "gcc seqtk/1.3"
and run mapache
as follows
## running mapache requires the parameter '--use-envmodules'
snakemake --cores all --use-envmodules
Hybrid solution: pre-installed software and rule-specific environments
The trick here is to add
--use-envmodules --use-conda
when executingmapache
.
If on your system not all dependencies are available you can use a hybrid system. For example if not all dependencies are available as environmental modules, e.g.:
envmodules:
samtools: "gcc samtools/1.12"
bowtie2: "gcc bowtie2/2.4.2"
bwa: "gcc bwa/0.7.17"
picard: "gcc picard/2.24.0"
gatk3: "gcc gatk/3.8-1"
fastqc: "gcc fastqc/0.11.9"
r: "gcc r/4.0.4"
adapterremoval: "gcc adapterremoval/2.3.2"
bedtools: "gcc bedtools2/2.29.2"
mapdamage: ""
seqtk: ""
you can run mapache
as follows
## running mapache requires both parameter '--use_conda' and '--use-envmodules'
snakemake --cores all --use_conda --use-envmodules
mapache
will use the dependencies which are available as environmental modules (in config file not set or empty) and will use conda environments for the other dependencies.
mapache
Dependencies of mapache
depends on the following software. The pipeline was tested with the specified versions, however also other versions should work, as long as they don't change the command line and/or the input and output file names.
- snakemake = 7.18.2
- fastp = 0.23.2
- bwa = 0.7.17-r1188
- Bowtie 2 = 2.4.4
- FastQC = 0.11.9
- GenomeAnalysisTK = 3.8-1-0
- mapDamage = 2.2.1
- Picard MarkDuplicates = 2.25.5
samtools = 1.14
- bedtools = 2.30.0
- seqtk = 1.3
- DeDup = 0.12.8
- QualiMap = 2.2.2
- multiqc = 1.13
- glimpse = 1.1.1
- bcftools = 1.15
- R = 4.0.5
- ggplot2
- rcolorbrewer
- reshape2
- svglite
- gridextra
- python = 3.97
- pandas
- numpy
- itertools
- pathlib
- re
- os
- argparse
- pysam
- sys
- math
- collections
- subprocess
Installation of GATK v3.8
Due to license restrictions for GATK v3.8, the mapache
conda package cannot distribute and install GATK 3.8 directly (please note that GATK IndelRealigner used in the pipeline is not available in GATK >v4). To fully install GATK, you must download a licensed copy of GATK from the Broad Institute, and call โgatk-register,โ which will copy GATK into your mapache conda environment:
# (download licensed copy of GATK)
gatk-register /path/to/GenomeAnalysisTK.jar
In short, you have
- to activate the conda environment containing GATK.
conda activate mapache
If you use rule specific conda environments, you need first to find the appropriate environment, e.g.:
## find conda environment
$ grep -i gatk /path/to/mapache/.snakemake/conda/*.yaml
/path/to/mapache/.snakemake/conda/82f2bb3b7e488f30c27f6219ba709caf.yaml: - gatk = 3.8
## load this environment
$ conda activate /path/to/mapache/.snakemake/conda/82f2bb3b7e488f30c27f6219ba709caf
- execute
gatk
to see if GATK is already registered. If you get the following output, you have to register GATK:
$ gatk3
GATK jar file not found. Have you run "gatk3-register"?
- download the registration file from Broad Institute.
wget link_to_file
- register GATK
gatk3-register GenomeAnalysisTK-3.8-1-0-gf15c1c3ef.tar.bz2