4. Tutorial: Running mapache - sneuensc/mapache GitHub Wiki
Throughout this section we will explain how to use mapache to map and impute chr21 with the test dataset. Please make sure to install mapache and GATK first. You do not need to further modify the samples file or the config file (steps 2 and 3 below) if you plan to run only the test.
Otherwise, for a real case with your own data, the recommended workflow is to
- Install mapache (and GATK if realigning around indels) and copy its scripts to your working directory
- Prepare the sample file with paths to the FASTQ files
- Configure the pipeline
- Configure a profile for job submissions (optional, highly recommended)
- Visualize the directed acyclic graph (DAG) (optional, recommended)
- Perform a dry run (optional, highly recommended)
- Execute the mapping pipeline with mapache
- Generate the html report
mapache
from GitHub
Installing The easiest way to install mapache
is to clone the repository from GitHub and using Conda to create an environment that contains the base for snakemake
and all dependencies for mapache
.
## create mamba environment
yes | conda create -n base-mamba -c conda-forge mamba
## activate mamba environment
conda activate base-mamba
## clone mapache repository
git clone https://github.com/sneuensc/mapache
cd mapache
## create conda environment for mapache
yes | mamba env create -n mapache --file config/mapache-env.yaml
## now you can activate the mapache environment
conda activate mapache
# These lines will help you to create an alias to copy or create symbolic links
# to the required directories. Make sure to move to the mapache directory first
# alias to copy directories
echo "alias copy_mapache='cp -r $(pwd -P)/{config,workflow,test_data,slurm} . '" >> ~/.bash_profile
source ~/.bash_profile
cd .. ; mkdir -p tutorial ; cd tutorial ; copy_mapache ; ls -l
File structure and copying scripts
Let's first get familiar with the file structure obtained by cloning the mapache
GitHub repository (see Installing mapache from GitHub):
tree -L 2 mapache
mapache/
├── config/
│ ├── config.yaml <= config file to specify the mapping
│ ├── mapache-env.yaml <= conda definition for entire environment
│ ├── README.md
│ └── samples.tsv <= sample file listing all fastq files (example dataset)
├── LICENSE
├── README.md
├── slurm/ <= cluster profile for slurm
├── test_data/ <= folder containing the test dataset (FASTQ files)
└── workflow/
├── envs/ <= folder containing all rule-specific conda definitions
├── report/ <= folder containing headings for report
├── rules/ <= folder containing all other sankefiles
├── scripts/ <= folder containing all scripts
└── Snakefile <= main Snakefile
☝🏿 To run mapache, you can move to mapache's folder (cd mapache
), but we recommend to work in a different directory and copy these folders:
config
andworkflow
to run mapache at any timeslurm
if you plan to submit jobs to a system managed by slurmtest_data
to map the test dataset
Reference genome
When cloning the mapache
GitHub repository we got all input FASTQ files for this tutorial, but we still need the human reference genome.
## create a folder
mkdir -p ref
## download reference
curl -o ref/hs37d5.fa.gz ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
## gunzip the reference and extract chromosomes 1-22, X, Y, MT
gunzip ref/hs37d5.fa.gz
samtools faidx ref/hs37d5.fa
samtools faidx ref/hs37d5.fa {1..22} X Y MT > ref/hs37d5_subset.fa
Now, we have to specify the reference genome location in the config file (config/config.yaml
):
genome:
hg19: ref/hs37d5_subset.fa
Tip: you can edit the config file with
nano config/config.yaml
. To find the above line, press Ctrl + W, then type "genome:" (without quotes). Edit the file, save it by pressing Ctrl + O, and exit with Ctrl + X.
Sample file
Let's have a look at the sample file:
$ column -t config/samples.tsv
SM LB ID Data
ind1 lib1_lb lib1_R1_001_fq test_data/lib1_R1_001.fastq.gz
ind1 lib1_lb lib1_R1_002_fq test_data/lib1_R1_002.fastq.gz
ind1 lib2_lb lib2_R1_001_fq test_data/lib2_R1_001.fastq.gz
ind2 lib3_lb lib3_R1_001_fq test_data/lib3_R1_001.fastq.gz
ind2 lib3_lb lib3_R1_002_fq test_data/lib3_R1_002.fastq.gz
Everything looks fine. We have five fastq files which belong to three libraries. These three libraries are from two samples/individuals. The path to the fastq files seems to be correct (mapache
will throw an error if the files do not exist).
Input files to impute chr20 and chr21
This section is optional. You can skip to the next section if you just want to map the test data.
Since we are mapping to hg19, we need a panel (in VCF format) mapped to the same version of the reference genome. The lines below will download the 1000 Genomes panel (.vcf.gz) and its extension (.vcf.gz.tbi) for chr20 and chr21.
Beware that this will download ~530 Mb of data.
for chr in {20..21}
do
for ext in vcf.gz vcf.gz.tbi
do
curl -o 1000G.chr$chr.$ext "http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr$chr.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.$ext"
done
done
We will also download the genetic maps for both chromosomes:
mkdir maps
git clone https://github.com/odelaneau/shapeit4
tar -xzf shapeit4/maps/genetic_maps.b37.tar.gz --directory maps
If you used those commands and want to run imputation on the test dataset, you will need to edit the corresponding section of the config file:
imputation:
hg19:
run: True
# [other lines are not shown here]
## path_map and path_panel may specify a single file OR may contain the variable '{chr}' to specify it per chromosome
path_map: "maps/chr{chr}.b37.gmap.gz"
path_panel: "1000G.chr{chr}.vcf.gz"
Tip: you can edit the config file with
nano config/config.yaml
. To find the above line, press Ctrl + W, then type "imputation:" (without quotes). Edit the file, save it by pressing Ctrl + O, and exit with Ctrl + X.
mapache
Running Recent versions of snakemake throw an error. You can run
y | mamba install 'tabulate=0.8.10'
before trying the dry run. More info here: https://github.com/snakemake/snakemake/issues/1899
Everything is set and we are ready to launch our first mapping. Before launching the jobs, it is good practice to visualize the directed acyclic graph (DAG) and perform a dry run.
To create simplified DAGs:
snakemake dag --dag | dot -Tpng > dag.png
snakemake dag --rulegraph | dot -Tpng > rulegraph.png
The following line will print the commands to be run:
snakemake -np
You can launch the mapping using a single CPU and printing the commands to be executed:
snakemake --cores 1 -p
For the test dataset, the bottleneck is indexing the genome with BWA. Otherwise, most of the tasks are quickly executed. This takes about one hour (for 1 CPU).
To automatically submit jobs to the queue in a server managed by SLURM, you can try this command:
snakemake --jobs 999 -p --profile slurm
This will automatically queue at most 999 jobs using the profile included in mapache's repository.
All output is stored by default in the folder results
.
results/ <= folder containing all results
├── 00_reference
│ └── hg19 <= folder containing the reference fasta and all indexes
├── 01_fastq <= folder containing all steps computed at the fastq level
│ ├── 00_reads
│ ├── 01_trimmed
│ ├── 02_mapped
│ ├── 03_filtered
│ └── 04_final_fastq <= folder containing the final bam files of the fastq level
├── 02_library <= folder containing all steps computed at the library level
│ ├── 00_merged_fastq
│ ├── 01_duplicated
│ └── 03_final_library <= folder containing the final bam files of the library level
├── 03_sample <= folder containing all steps computed at the sample level
│ ├── 00_merged_library
│ ├── 01_realigned
│ ├── 02_md_flag
│ └── 03_final_sample <= ** FOLDER WITH FINAL BAM FILES **
└── 04_stats <= folder containing all statistics
├── 01_sparse_stats
├── 02_separate_tables
├── 03_summary <= ** FOLDER WITH SUMMARY STATISTICS TABLES **
└── 04_plots <= ** FOLDER WITH SOME INFORMATIVE PLOTS **
Generate an html report with mapping statistics:
snakemake --report report.html
Finally, if you did not modify the pipeline to run imputation, you can do so as explained above. Once you have the input files and have modified the config file, the command below should print the remaining jobs to run:
snakemake -n --rerun-triggers mtime
And you can run them using one CPU:
snakemake -p --rerun-triggers mtime --cores 1
Update the report:
snakemake --report report.html
Older versions of snakemake generate nicer reports. We recommend creating a new environment with Snakemake 6.10.0:
mamba create -n snakemake610 snakemake=6.10.0
mamba activate snakemake610
snakemake --report report.html
# or:
# snakemake --report report.zip
# for a zipped directory including the html report, tables and plots
The following information is reported:
- Workflow: It shows the used workflow. By clicking on the nodes, the executed code is shown.
- Statistics: Two tables are shown, one with the runtime and the other one with the run date of each rule (function).
- Configuration: The entire config file is listed. Great for reproducibility.
- Results:
- Damage pattern: Shows tables and plots of the damage pattern per library.
- Mapping statistics: Shows tables and plots of summary statistics for samples, libraries and fastq files.
- Read length: Shows tables and plots of summary statistics for libraries.
For details and more statistics, please see section 5. Parameters for statistics.
Local machine or submitting jobs to a queueing system?
The snakemake implementation of mapache
allows you to run multiple tasks in parallel.
However, you need to specify the number of CPUs that can be used to run the tasks in parallel locally in a machine:
snakemake --cores 32
In that example, snakemake will run 32 tasks simultaneously. If you do not have 32 CPUs available and execute that line, then snakemake will use as many CPUs as possible in your machine.
A better approach is to use a profile adapted to your system manager. Mapache includes a profile for SLURM, you can find the repositories for other systemes here.
To use the SLURM profile and submit a maximum of 999 jobs at a time:
snakemake -p --jobs 999 --profile slurm
Running a realistic sized mapping
The test-data are great to get familiar with mapache
, but they are smaller than a regular screen, and the mapping can be run on a small computer. With real sized libraries, mapping is time and memory consuming. Therefore, for larger input files, it is advisable to follow these steps:
a) Request enough running time and memory
If you are using a queueing system, it is advisable to modify the configuration file to specify the amount of CPUs, time and memory that will be requested for each of the jobs.
The config file contains the keywords time
and mem
for all the tasks that normally take more than 15 minutes to compute, and the keyword threads
when the tool allows multithreading.
You can easily find them on a text editor (e.g., Ctrl + F). Time is indicated in hours and memory in GB.
If you are not using a queueing system but have several CPUs available, the keyword threads
will still apply, indicating the maximum number of CPUs per task. However, this value will be downscaled if you specify less CPUs for the entire pipeline. E.g., if you indicated threads: 8
for AdapterRemoval, and launched mapache as snakemake --cores 4
, then a maximum of 4 CPUs will be used to run ApaterRemoval.
b) Draw a DAG
By visualizing all jobs and their dependencies using a directed acyclic graph (DAG) one gains an overview of what will be computed. A DAG is drawn using the command
snakemake dag --dag | dot -Tsvg > dag.svg
c) Dry run
Snakemake allows you to do a a dry run of the entire pipeline using the command. This means that nothing is run, but you will see the list of tasks that will be executed.
If the pipeline stops running at some point (e.g., wrong inputs specified, or due to a lack of memory or time resources), the dry run would print only the jobs that have not been run yet or that need to be updated:
snakemake -n
d) Cluster submission
Real-sized libraries are often not executed on a single computer, but on a HPC or a cloud infrastructure. Snakemake allows to scale up and launch mapache
on a cluster or a cloud. For example launching mapache
on a cluster with SLURM as queuing system is straightforward:
snakemake --jobs 999 --profile slurm
the option --jobs 999
indicates that a maximum of 999 jobs will be sent to the queue at the same time.