4. Tutorial: Running mapache - sneuensc/mapache GitHub Wiki

Throughout this section we will explain how to use mapache to map and impute chr21 with the test dataset. Please make sure to install mapache and GATK first. You do not need to further modify the samples file or the config file (steps 2 and 3 below) if you plan to run only the test.

Otherwise, for a real case with your own data, the recommended workflow is to

  1. Install mapache (and GATK if realigning around indels) and copy its scripts to your working directory
  2. Prepare the sample file with paths to the FASTQ files
  3. Configure the pipeline
  4. Configure a profile for job submissions (optional, highly recommended)
  5. Visualize the directed acyclic graph (DAG) (optional, recommended)
  6. Perform a dry run (optional, highly recommended)
  7. Execute the mapping pipeline with mapache
  8. Generate the html report

Installing mapache from GitHub

The easiest way to install mapache is to clone the repository from GitHub and using Conda to create an environment that contains the base for snakemake and all dependencies for mapache.

## create mamba environment
yes | conda create -n base-mamba -c conda-forge mamba

## activate mamba environment
conda activate base-mamba

## clone mapache repository
git clone https://github.com/sneuensc/mapache
cd mapache

## create conda environment for mapache
yes | mamba env create -n mapache --file config/mapache-env.yaml

## now you can activate the mapache environment
conda activate mapache


# These lines will help you to create an alias to copy or create symbolic links 
# to the required directories. Make sure to move to the mapache directory first

# alias to copy directories
echo "alias copy_mapache='cp -r $(pwd -P)/{config,workflow,test_data,slurm} . '" >> ~/.bash_profile

source ~/.bash_profile

cd .. ; mkdir -p tutorial ; cd tutorial ; copy_mapache ; ls -l

File structure and copying scripts

Let's first get familiar with the file structure obtained by cloning the mapache GitHub repository (see Installing mapache from GitHub):

tree -L 2 mapache

mapache/
├── config/
│   ├── config.yaml      <= config file to specify the mapping
│   ├── mapache-env.yaml <= conda definition for entire environment
│   ├── README.md
│   └── samples.tsv      <= sample file listing all fastq files (example dataset)
├── LICENSE
├── README.md
├── slurm/               <= cluster profile for slurm
├── test_data/           <= folder containing the test dataset (FASTQ files)
└── workflow/
    ├── envs/            <= folder containing all rule-specific conda definitions
    ├── report/          <= folder containing headings for report
    ├── rules/           <= folder containing all other sankefiles
    ├── scripts/         <= folder containing all scripts
    └── Snakefile        <= main Snakefile

☝🏿 To run mapache, you can move to mapache's folder (cd mapache), but we recommend to work in a different directory and copy these folders:

  • config and workflow to run mapache at any time
  • slurm if you plan to submit jobs to a system managed by slurm
  • test_data to map the test dataset

Reference genome

When cloning the mapache GitHub repository we got all input FASTQ files for this tutorial, but we still need the human reference genome.

## create a folder
mkdir -p ref

## download reference
curl -o ref/hs37d5.fa.gz  ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

## gunzip the reference and extract chromosomes 1-22, X, Y, MT
gunzip ref/hs37d5.fa.gz 
samtools faidx ref/hs37d5.fa
samtools faidx ref/hs37d5.fa {1..22} X Y MT > ref/hs37d5_subset.fa

Now, we have to specify the reference genome location in the config file (config/config.yaml):

genome: 
    hg19: ref/hs37d5_subset.fa

Tip: you can edit the config file with nano config/config.yaml. To find the above line, press Ctrl + W, then type "genome:" (without quotes). Edit the file, save it by pressing Ctrl + O, and exit with Ctrl + X.

Sample file

Let's have a look at the sample file:

$ column -t config/samples.tsv

SM    LB       ID              Data
ind1  lib1_lb  lib1_R1_001_fq  test_data/lib1_R1_001.fastq.gz
ind1  lib1_lb  lib1_R1_002_fq  test_data/lib1_R1_002.fastq.gz
ind1  lib2_lb  lib2_R1_001_fq  test_data/lib2_R1_001.fastq.gz
ind2  lib3_lb  lib3_R1_001_fq  test_data/lib3_R1_001.fastq.gz
ind2  lib3_lb  lib3_R1_002_fq  test_data/lib3_R1_002.fastq.gz

Everything looks fine. We have five fastq files which belong to three libraries. These three libraries are from two samples/individuals. The path to the fastq files seems to be correct (mapache will throw an error if the files do not exist).

Input files to impute chr20 and chr21

This section is optional. You can skip to the next section if you just want to map the test data.

Since we are mapping to hg19, we need a panel (in VCF format) mapped to the same version of the reference genome. The lines below will download the 1000 Genomes panel (.vcf.gz) and its extension (.vcf.gz.tbi) for chr20 and chr21.

Beware that this will download ~530 Mb of data.


for chr in {20..21}
do
  for ext in vcf.gz vcf.gz.tbi
  do
     curl  -o 1000G.chr$chr.$ext "http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr$chr.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.$ext"
  done
done

We will also download the genetic maps for both chromosomes:

mkdir maps
git clone https://github.com/odelaneau/shapeit4
tar -xzf shapeit4/maps/genetic_maps.b37.tar.gz --directory maps

If you used those commands and want to run imputation on the test dataset, you will need to edit the corresponding section of the config file:

imputation:
    hg19:
        run: True

# [other lines are not shown here]

        ## path_map and path_panel may specify a single file OR may contain the variable '{chr}' to specify it per chromosome 
        path_map: "maps/chr{chr}.b37.gmap.gz"
        path_panel: "1000G.chr{chr}.vcf.gz"

Tip: you can edit the config file with nano config/config.yaml. To find the above line, press Ctrl + W, then type "imputation:" (without quotes). Edit the file, save it by pressing Ctrl + O, and exit with Ctrl + X.

Running mapache

Recent versions of snakemake throw an error. You can run y | mamba install 'tabulate=0.8.10' before trying the dry run. More info here: https://github.com/snakemake/snakemake/issues/1899

Everything is set and we are ready to launch our first mapping. Before launching the jobs, it is good practice to visualize the directed acyclic graph (DAG) and perform a dry run.

To create simplified DAGs:

snakemake dag --dag | dot -Tpng > dag.png               
snakemake dag --rulegraph | dot -Tpng > rulegraph.png

The following line will print the commands to be run:

    snakemake -np

You can launch the mapping using a single CPU and printing the commands to be executed:

    snakemake --cores 1 -p

For the test dataset, the bottleneck is indexing the genome with BWA. Otherwise, most of the tasks are quickly executed. This takes about one hour (for 1 CPU).

To automatically submit jobs to the queue in a server managed by SLURM, you can try this command:

    snakemake --jobs 999 -p --profile slurm

This will automatically queue at most 999 jobs using the profile included in mapache's repository.

All output is stored by default in the folder results.

results/                     <= folder containing all results
├── 00_reference 
│   └── hg19                 <= folder containing the reference fasta and all indexes 
├── 01_fastq                 <= folder containing all steps computed at the fastq level
│   ├── 00_reads
│   ├── 01_trimmed
│   ├── 02_mapped
│   ├── 03_filtered
│   └── 04_final_fastq       <= folder containing the final bam files of the fastq level
├── 02_library               <= folder containing all steps computed at the library level
│   ├── 00_merged_fastq
│   ├── 01_duplicated
│   └── 03_final_library     <= folder containing the final bam files of the library level
├── 03_sample                <= folder containing all steps computed at the sample level
│   ├── 00_merged_library
│   ├── 01_realigned
│   ├── 02_md_flag
│   └── 03_final_sample      <= ** FOLDER WITH FINAL BAM FILES **
└── 04_stats                 <= folder containing all statistics
    ├── 01_sparse_stats
    ├── 02_separate_tables
    ├── 03_summary           <= ** FOLDER WITH SUMMARY STATISTICS TABLES **
    └── 04_plots             <= ** FOLDER WITH SOME INFORMATIVE PLOTS **

Generate an html report with mapping statistics:

snakemake --report report.html

Finally, if you did not modify the pipeline to run imputation, you can do so as explained above. Once you have the input files and have modified the config file, the command below should print the remaining jobs to run:

snakemake -n --rerun-triggers mtime

And you can run them using one CPU:

snakemake -p --rerun-triggers mtime --cores 1

Update the report:

snakemake --report report.html

Older versions of snakemake generate nicer reports. We recommend creating a new environment with Snakemake 6.10.0:

mamba create -n snakemake610 snakemake=6.10.0

mamba activate snakemake610

snakemake --report report.html
# or:
# snakemake --report report.zip
# for a zipped directory including the html report, tables and plots

The following information is reported:

  • Workflow: It shows the used workflow. By clicking on the nodes, the executed code is shown.
  • Statistics: Two tables are shown, one with the runtime and the other one with the run date of each rule (function).
  • Configuration: The entire config file is listed. Great for reproducibility.
  • Results:
    • Damage pattern: Shows tables and plots of the damage pattern per library.
    • Mapping statistics: Shows tables and plots of summary statistics for samples, libraries and fastq files.
    • Read length: Shows tables and plots of summary statistics for libraries.

For details and more statistics, please see section 5. Parameters for statistics.

Local machine or submitting jobs to a queueing system?

The snakemake implementation of mapache allows you to run multiple tasks in parallel.

However, you need to specify the number of CPUs that can be used to run the tasks in parallel locally in a machine:

snakemake --cores 32

In that example, snakemake will run 32 tasks simultaneously. If you do not have 32 CPUs available and execute that line, then snakemake will use as many CPUs as possible in your machine.

A better approach is to use a profile adapted to your system manager. Mapache includes a profile for SLURM, you can find the repositories for other systemes here.

To use the SLURM profile and submit a maximum of 999 jobs at a time:

snakemake -p --jobs 999 --profile slurm

Running a realistic sized mapping

The test-data are great to get familiar with mapache, but they are smaller than a regular screen, and the mapping can be run on a small computer. With real sized libraries, mapping is time and memory consuming. Therefore, for larger input files, it is advisable to follow these steps:

a) Request enough running time and memory

If you are using a queueing system, it is advisable to modify the configuration file to specify the amount of CPUs, time and memory that will be requested for each of the jobs.

The config file contains the keywords time and mem for all the tasks that normally take more than 15 minutes to compute, and the keyword threads when the tool allows multithreading.

You can easily find them on a text editor (e.g., Ctrl + F). Time is indicated in hours and memory in GB. If you are not using a queueing system but have several CPUs available, the keyword threads will still apply, indicating the maximum number of CPUs per task. However, this value will be downscaled if you specify less CPUs for the entire pipeline. E.g., if you indicated threads: 8 for AdapterRemoval, and launched mapache as snakemake --cores 4, then a maximum of 4 CPUs will be used to run ApaterRemoval.

b) Draw a DAG

By visualizing all jobs and their dependencies using a directed acyclic graph (DAG) one gains an overview of what will be computed. A DAG is drawn using the command

snakemake dag --dag | dot -Tsvg > dag.svg

c) Dry run

Snakemake allows you to do a a dry run of the entire pipeline using the command. This means that nothing is run, but you will see the list of tasks that will be executed.

If the pipeline stops running at some point (e.g., wrong inputs specified, or due to a lack of memory or time resources), the dry run would print only the jobs that have not been run yet or that need to be updated:

snakemake -n

d) Cluster submission

Real-sized libraries are often not executed on a single computer, but on a HPC or a cloud infrastructure. Snakemake allows to scale up and launch mapache on a cluster or a cloud. For example launching mapache on a cluster with SLURM as queuing system is straightforward:

snakemake --jobs 999 --profile slurm

the option --jobs 999 indicates that a maximum of 999 jobs will be sent to the queue at the same time.