3. Config file (parameters) - sneuensc/mapache Wiki

Introduction

The pipeline can be customised using a configuration file in the yaml format (default config/config.yaml).

The minimal input files to run mapache are

  • a plain text file (config/samples.tsv) with paths and metadata of the input FASTQ files (see Sample file)
  • a reference genome in FASTA format listed in the config file config/config.yaml (see Reference genome)

Additionally, if you plan to impute genomes, a reference panel (in VCF format) needs to be specified in the config file as well.

In the file config/config.yaml, you will see blocks corresponding to each of the steps that could be run. There, you can specify whether the step should be run, include additional arguments, and specify the number of threads and amount of memory to be used.

Here is an example for the subsampling step, which allows you to run mapache only on a few reads (number: 1000) of your data (to test whether the pipeline will succeed with your current configuration). Note that the additional parameter -s1 (random seed for the tool seqtk) was passed using the keyword params.

## subsampling (optional)
subsampling:
    run: False
    mem: 2
    params: '-s1'
    number: 1000

For some tools, the parameters have been set to be optimized for ancient human DNA. As a general rule, we try to add as few extra parameters as possible, and you can find them all (and change them as needed) on the config file. For example, to use the default parameters for a particular tool, the params has to be empty (params: '').

The following paragraphs list and describe all available parameters in mapache. We hope that the config file is self-explanatory, but in this page we extend the documentation for each of the sections in the configuration in case of doubts.

Software

Job resubmission

Sample file

Specifies the location of the sample file. The description on how to write a sample file is found on section 2 of this wiki (Sample file)

sample_file: config/samples.tsv                  ## path to the sample file

Output folder

Mapping

Reference genome

Input FASTQ files may be mapped to one or multiple reference genomes.

The minimal specification of the human reference genome is for the GRCh37:

genome: 
  GRCh37: 
    fasta: path_to_reference/GRCh37.fasta

and for the human reference genome GRCh38:

genome: 
  GRCh38:                                         
    fasta: path_to_reference/GRCh38.fasta

If you plan to map to multiple reference genomes (.e.g, organism1, organism2, organism3), you can do so by listing them all under the keyword genome:

genome: 
  ORG1:
    fasta: path_to_reference/organism1.fasta
  ORG2:
    fasta: path_to_reference/organism2.fasta
  ORG3:
    fasta: path_to_reference/organism3.fasta

Note that the names for each reference genome (in the example above, ORG1, ORG2 and ORG3) have to be unique. Your final output files will be named with the format {sample_name}.{genome_name}.bam

Indexing

Reference genome(s) and bam files have to be indexed for some tools. The indexing is done automatically if needed (for fasta and bam files). If indexes of the reference gnome (fasta) are already available beforehand, mapache may include them as inputs if they are located in the same directory as their matching reference genome (e.g., foo.fasta) and are named with the following formats:

## reference genome name specified in config file:
foo.fasta

## bwa index file names:
foo.fasta.sa
foo.fasta.amb
foo.fasta.ann
foo.fasta.bwt
foo.fasta.pac

## bowtie2 index file names:
foo.fasta.1.bt2
foo.fasta.2.bt2
foo.fasta.3.bt2
foo.fasta.4.bt2
foo.fasta.rev.1.bt2
foo.fasta.rev.2.bt2

## samtools faidx index file name:
foo.fasta.fai

## picard index file name:
foo.dict

Indexing is specified with the following parameters:

indexing:
  bwa_params: ''                                 ## parameters for 'bwa index'
  bowtie2_params: ''                             ## parameters for 'bowtie2-build'
  samtools_params: ''                            ## parameters for 'samtools faidx'
  mem: 16                                        ## memory allocation for queuing system in GB
  threads: 1                                     ## number of CPUs requested for each indexing job
  time: 2                                        ## runtime allocation for queuing system in hours

Subsampling (optional)

If you are not yet familiar with mapache, we recommend you to try and subsample a few reads of your data to see mapache's behavior and get to know its output files quickly.

This step is run with the tool [seqtk] (https://github.com/lh3/seqtk).

seqtk sample {params} {input} {number}

Subsampling is specified with the following parameters:

subsampling:
  run: False                                     ## should the fastq files be randomly subsampled?
  number: 1000                                   ## number/ratio of reads to keep:
                                                 ##   <1: ratio of reads
                                                 ##   >=1: absolute number of reads 
  params: '-s1'                                  ## parameters for seqtk sample (if paired reads are 
                                                 ##   subsampled, the seed has to be fixed (e.g., -s1)

AdapterRemoval (optional)

Ancient DNA is usually fragmented, with fragment lengths usually shorter than the number of sequenced cycles, resulting in reads including adapter sequences. Removing these remnants of adapters is crucial.

Mapache uses AdapterRemoval2 to remove adapters and to perform cleaning of the reads in general. Mapache applies by default settings optimized for ancient DNA, namely:

  • remove default adapters
  • trim ambiguous bases (N) at 5'/3' termini (--trimns)
  • trim consecutive stretches of low-quality (<=2) bases at the termini (--trimqualities)
  • discard reads shorter than 30bp following trimming (--minlength 30)

Paired-end reads may be collapsed with AdapterRemoval2 by including either --collapse, --collapse-deterministic or --collapse-conservatively to the params option in the config file. If you decide to collapse the read pairs, only collapsed pairs (usually the majority of the dataset) will be mapped to the reference genome; that is, singletons and trashed reads that did not pass the aforementioned filters will not be mapped to the reference genomes.

If you collapse paired-end reads, please have a look at DeDup to remove duplicates (see Removing duplicates)

AdapterRemoval is run for single-end reads as follows

AdapterRemoval ${params} --file1 ${input} --basename ${output%%.fastq.gz} --gzip --output1 ${output}

AdapterRemoval is specified with the following parameters:

adapterremoval: 
  run: True                                          ## should AdapterRemoval be run?
  params: '--minlength 30 --trimns --trimqualities'  ## parameters for AdapterRemoval
  threads: 4                                         ## number of threads
  mem: 4                                             ## memory allocation for queuing system in GB
  time: 2                                            ## runtime allocation for queuing system in hours

Mapping

Mapping to a reference genome is a mandatory step. The default mapping is optimized for ancient DNA using bwa_aln with the parameter -l 1024 to disable seeding (Schubert et al., 2012). However, other mappers are available. We show below the commands that are used to map single-end (or collapsed) and paired-end reads with the available mapping softwares (BWA and bowtie2).

  • bwa_aln (default) using the command
    • for single-end reads:
bwa aln ${params} ${ref} -f ${sai} ${fastq}
bwa samse ${bwa_samse_params} -r \"@RG\\tID:${ID}\\tLB:${LB}\\tSM:${SM}\\tPL:${PL}\" ${ref} ${sai} ${fastq}
    • for paired-end reads:
bwa aln {params} {ref} -f {sai1} {fastq1}
bwa aln {params} {ref} -f {sai2} {fastq2}
bwa sampe ${bwa_sampe_params} -r \"@RG\\tID:${ID}\\tLB:${LB}\\tSM:${SM}\\tPL:${PL}\" ${ref} ${sai1} ${sai2} ${fastq1} ${fastq2}
bwa mem ${bwa_mem_params} -R \"@RG\\tID:${ID}\\tLB:${LB}\\tSM:${SM}\\tPL:${PL}\" ${ref} ${fastq(s)}

The values in ID, LB, SM and PL are automatically extracted from the corresponding columns in the sample file. This is useful, for example, if in the future you need to split your BAM files by library type with samtools.

  • bowtie2 using the command
    • for single-end reads:
bowtie2 ${bowtie2_params} -x ${ref} -U ${fastq}
    • for paired-end reads:
bowtie2 ${bowtie2_params} -x ${ref} -1 ${fastq1} -2 ${fastq2}

Mapping is specified with the following parameters:

mapping:
  mapper: bwa_aln                                ## mapper (available: bwa_aln, bwa_mem, bowtie2)
  bwa_aln_params: '-l 1024'                      ## parameter for 'bwa aln'
  bwa_samse_params: '-n 3'                       ## parameter for 'bwa samse' (single-end libraries)
  bwa_sampe_params: '-n 3'                       ## parameter for 'bwa sampe' (paired-end libraries)
  bwa_mem_params: ''                             ## parameter for 'bwa mem'
  bowtie2_params: ''                             ## parameter for 'bowtie2'
  threads: 4                                     ## number of threads
  mem: 16                                        ## memory allocation for queuing system in GB
  time: 30                                       ## runtime allocation for queuing system in hour

Sorting

Mapped BAM files are always sorted using samtools with the command

samtools sort --threads ${threads} ${input} > ${output}

Sorting is specified with the following parameters:

sorting:
  threads: 4                                     ## number of threads
  mem: 16                                        ## memory allocation for queuing system in GB
  time: 10                                       ## runtime allocation for queuing system in hours

Filtering (optional)

It is common practice to keep only high-quality mapped reads in a final BAM file. By default, unmapped and low-quality reads are saved in a separate BAM file (save_low_qual: True) in case you want to use them for other type of analyses, such as investigating the metagenomic composition of the sample.

In brief, filtering is done with samtools using the command

samtools view -b -F 4 -q ${MAPQ} -U ${out_low_qual} ${input} > ${out_mapped}

Just as mentioned above, this command will save reads with a mapping quality equal or higher than MAPQ (indicated in the sample file) to {out_mapped}, and use samtools to store (-U ${out_low_qual}) unmapped reads (-F 4) and reads with lower mapping qualities (-q ${MAPQ}) to a separate BAM file.

Filtering is specified with the following parameters:

filtering:
  run: True                                      ## perform filtering or not (if `False`, the final bam file 
                                                 ##   contains all alignments, including not mapping reads)
  save_low_qual: True                            ## if filtering, shall the not mapped reads and low quality  
                                                 ##   alignments be saved in a separate bam file? 
  threads: 4                                     ## number of threads
  mem: 16                                        ## memory allocation for queuing system in GB
  time: 2                                        ## runtime allocation for queuing system in hours

Merging

Merging of bam files happens twice:

  • fastq to library: All bam files of the fastq files are merged at the library level (same name in LB and SM)
  • library to sample: All bam files of the libraries are merged at the sample level (same name in SM)

Merging is done using samtools with the command

samtools merge -f ${output} ${inputs}

Merging is specified with the following parameters:

merging:
  threads: 4                                     ## number of threads
  mem: 16                                        ## memory allocation for queuing system in GB
  time: 2                                        ## runtime allocation for queuing system in hours

Removing duplicates (optional)

A characteristic of ancient DNA is its scarcity. Consequently the chance to resequence the same molecule is quite large leading to unwanted duplicated sequences. This module allows detecting and removing such duplicated sequences. By default MarkDuplicates from Picard is used with the command

picard MarkDuplicates --INPUT ${input} --OUTPUT ${out_bam} --METRICS_FILE ${out_stats} \
            ${params} --ASSUME_SORT_ORDER coordinate --VALIDATION_STRINGENCY LENIENT

Alternatively, DeDup may be used which was especially developed for collapsed paired-end reads in ancient DNA using the command

dedup -i ${input} ${params}  -o $(dirname $bam)

Removing duplicates is specified with the following parameters:

remove_duplicates:
  run: adapterremoval                            ## should duplicates be inferred/removed?
                                                 ##   (adaperremoval, dedup, False)
  params_adapterremoval: '--REMOVE_DUPLICATES true'   ## parameters for markDuplicates
  params_dedup: '-m'                             ## parameters for dedup ('-m/--merged' is 
                                                 ##   automatically removed for single-end reads)
  threads: 4                                     ## number of threads
  mem: 16                                        ## memory allocation for queuing system in GB
  time: 2                                        ## runtime allocation for queuing system in hours

Rescaling damage (optional)

Reads originated from ancient samples present a high frequency of apparent C to T and G to A substitutions at the read termini, with specific patterns that vary depending on the protocol used to build the libraries.

An option to deal with these deamination patterns is to reduce the base qualities of the likely affected bases. You can do so by rescaling the qualities with mapDamage2, although this option is not run by default in mapache.

mapDamage -i ${input} -r ${ref} -d $(dirname ${output})_results ${params} --rescale \
      --merge-reference-sequences --rescale-out ${output};

Rescaling damage is specified with the following parameters:

damage_rescale:
  run: False                                     ## should the bam file be rescaled?
  params: ''                                     ## parameters for mapDamage
  threads: 1                                     ## number of threads (only single threaded)
  mem: 8                                         ## memory allocation for queuing system in GB
  time: 2                                        ## runtime allocation for queuing system in hours

Realigning indels (optional)

Taking the information across alignments allows correcting for mis-alignments (indels) at the edges of the reads. Realigning indels is done using IndelRealigner from GATK v3.8 using the command

GenomeAnalysisTK -T RealignerTargetCreator -I ${input} -R ${ref} -o ${intervals} ;
GenomeAnalysisTK -T IndelRealigner -I ${input} -R ${ref} -targetIntervals ${intervals} -o ${output}

Realigning indels is specified with the following parameters:

realign:
  run: True                                      ## should GATK IndelRealigner be run
  threads: 1                                     ## number of threads (only single threaded)
  mem: 8                                         ## memory allocation for queuing system in GB
  time: 2                                        ## runtime allocation for queuing system in hours

Recomputing md flags (optional)

This step allows to recompute the md flag of each alignment to avoid downstream analyses problems. Incorrect md flags may occur by modifications to the bam file in previous steps (such as realignment around indels) without adjusting the md flag and may lead to issues when using this BAM file with certain programs. The md flag is recomputed with samtools using the command

samtools calmd ${input} ${ref} | samtools view -bS - > ${output}

Recomputing md flag is specified with the following parameters:

run_compute_md:
  run: True                                      ## should samtools calmd be run [True]
  threads: 4                                     ## number of threads (only single threaded)
  mem: 8                                         ## memory allocation for queuing system in GB
  time: 2                                        ## runtime allocation for queuing system in hours

Imputation

Low-coverage imputation has as input genotype likelihoods for variable sites in a reference panel.

In the first step, we generate genotype likelihoods at the sites of the reference panel. Then, we proceed to impute. Besides the input data, for this step we need:

  1. path_panel: the reference panel (vcf file; e.g., 1000 Genomes)
  2. path_map: the genetic map that contains the recombination rates across the human genome in the coordinates of the genome build used to map the data

We can assess how confidently each site was imputed using the genotype probabilities (GP), that are contained in the output vcf/bcf files (FORMAT/GP). For any biallelic site, a GP value for each three possible genotypes (0, 1 and 2) is reported with values ranging between 0 and 1. They sum to 1 across the three genotypes. Mapache outputs a histogram with the resulting maximum GP for all imputed sites as well as the average maximum GP. Mapache allows filtering on the GP value (parameter ‘GP_filter’), which is a trade-off between imputation accuracy and loss of sites. Stricter thresholds for this filtering result in a greater loss of sites containing the alternative allele, which is particularly noticeable at lower depths of coverage (e.g., 0.1x). As default, mapache applies a GP filter of 0.8. In any case, we recommend removing rare variants, at least with minor allele frequency (MAF) below 1%, as these are less accurately imputed.

Statistics

FASTQC, qualimap and multiqc

Plots (optional)

A few plots are generated by default to get an idea of the amount of reads that have been analysed at some steps of the mapping pipeline. If you are mapping a large number of samples or to a large number of genomes, the default plotting settings might need some adjustments in order to improve the visualisation of the data.

stats:
    plots:
        x_axis: sample
        split_plot: F # possible values: FALSE, F, TRUE, T
        # n_col: 15
        # n_row: 1

  • x_axis: sample or genome. Select the value that will be shown on the x-axis. If sample, then all the samples will be listed on the x-axis, and different bars will be shown and coloured per genome. If genome, then all the genomes will be listed on the x-axis, and different bars will be shown and coloured per sample. If n_samples > n_genomes, we recommend to set this option to "sample". If n_genomes > n_samples, we recommend to set this option to "genome"

  • split_plot: TRUE (also T) or FALSE (F). This option will split the main plot into smaller sub-panels. We recommend to set it to TRUE if you are mapping many samples or genomes and the bars are crowded in the figures.

  • n_col: numeric. How many columns should be shown? We recommend a large number (20 - 40) for figures in landscape mode.

  • n_row: numeric. How many rows should be shown? We recommend a small number (1 - 5) for figures in landscape mode.

stats These parameters allow to define which statistics should be computed and the time and memory allocation

run_damage: 'False'                              ## infer the deamination pattern ['False']; 
                                                 ##   Options: 'False' (or anything), 
                                                 ##   'mapDamage', 'bamdamage'
run_depth: True                                  ## run the script deptyh.py and report coverage 
                                                 ## statistics [False]
run_bammds: False                                ## run bammds (not yet implemented)

## bamdamage 
bamdamage_fraction: 0                            ## fraction/number of alignments to consider to infer
                                                 ##   the damage pattern:
                                                 ##     0:   use all (default)
                                                 ##     <1:  a fraction of alignments to consider 
                                                 ##     >=1: absolute number of alignments to consider
                            
## memory (default 2GB)
fastqc_mem
samtools_flagstat_mem
multiqc_mem
depth_mem

## runtime (default 1h)
fastqc_time
samtools_flagstat_time
multiqc_time
depth_time

By default the pipeline computes intermediate summary statistics of the mappings to better evaluate the final results. In addition a few optional statistics are offered to be computed on the final bam file of the sample and/or library level.

Sex inference (optional)

Any human reference genome, but also any other reference genome in FASTA format may be specified with or without sex-specific chromosomes. To account for this flexibility several parameters need to be specified in order to compute underlying summary statistics (e.g. determine the genomic sex). Mapache recognizes automatically the two human reference genomes GRCh37and GRCh38 based on their chromosome names and assigns automatically the chromosomes to the autosome, the sex and the mt chromosomes. Expected chromosome names are

  • GRCh37: 1, 2, ..., 22 ,X ,Y ,MT
  • GRCh38: chr1, chr2, ..., chr22, chrX, chrY, chrM

For all other reference genomes, the assignments to autosome, sex and mt chromosomes have to be made manually if certain options (e.g., sex determination) are selected.

To infer sex, we follow the method presented by Mittnik et al. (2016) for humans. This method computes the average ratio of the normalised depth of the X chromosome to the normalised depth on the autosomes. Thus, for females, a ratio of 1 is expected, and for males a ratio of 0.5 is expected. Assuming this ratio can be approximated by a normal distribution, a 95% confidence interval is generated by taking ~1.96 standard errors of the mean. The sex is then assigned depending on the values spanned by this confidence interval (parameter thresholds).

We reimplemented the code to adapt it to organisms with a different chromosomal sex-determination systems, including XX/XY, ZW/ZZ, XX/XO and ZZ/ZO.

As the method relies on the number of reads mapped to the homogametic sex chromosome and to the autosomes, and the naming of chromosomes varies across reference genomes (i.e., 1,2,..,22,X,Y vs chr1,chr2,...,chr22,chrX,chrY), the parameters specific to sex inference have to be declared under the genome section of the config file.

Please make sure to pass the correct names of the autosomes and the sex chromosome as specified in the reference FASTA file, otherwise mapache will not run until you fix this error.

Thus, assuming that we are mapping to two different reference genome, the parameters should be specified as follows.


genome: 
  GRCh37: 
    fasta: path/hs.build37.1.fa                  ## the genome 'GRCh37' is automatically recognized
    sex_inference:
      run: False                                 ## should the sex be inferred?
      params:
        autosomes: 'list(range(1,23))'           ## specify the autosome chromosomes
        sex_chr: X                               ## specify the homogametic sex
        system: XY                               ## either of XY, ZW, XO or ZO
        signif: 0.95                             ## confidence interval wide
        thresholds: list( "XX" = c(0.8, 1), 
                          "XY" = c(0, 0.6), 
                          "consistent with XX but not XY" = c(0.6, 1), 
                          "consistent with XY but not XX" = c(0, 0.8))   

  GRCh38: 
    fasta: path/GCA_000001405.15_GRCh38.fa   ## this genome is automatically recognized
    sex_inference:
      run: True                                  ## should the sex be inferred?
      params:
        autosomes: '[f"chr{x}" for x in range(1,23)]'  ## specify the autosome chromosomes
        sex_chr: chrX                            ## specify the homogametic sex chromosome
        system: XY                               ## either of XY, ZW, XO or ZO
        signif: 0.95                             ## confidence interval wide
        thresholds: list( "XX" = c(0.8, 1), 
                          "XY" = c(0, 0.6), 
                          "consistent with XX but not XY" = c(0.6, 1), 
                          "consistent with XY but not XX" = c(0, 0.8))   
    

GRCh37 and GRCh38

Notice that mapache recognizes the two human reference genomes GRCh37 and GRCh38, and assigns automatically the chromosomes to autosome, sex and mt chromosomes. Therefore, the parameters autosomes and sex_chr most not be set manually for these two reference genomes.

For other reference genomes, these parameters have to be set manually. If the specified chromosome names are not present in the reference genome, an error is drawn.

Autosomes

The autosomes should be listed as a single-quoted python expression, e.g.

  • autosomes: '["chr1", "chr2", "chr3"]'.

This can be tedious if you have many more chromosomes. Assuming you are mapping to an organism with 22 autosomes, you can pass any compressed list in python syntax, e.g.

  • autosomes: 'list(range(1,23))' or
  • autosomes: '[f"chr{x}" for x in range(1,23)]' if you need the chromosomes with the chr prefix

Sex thresholds

As mentioned at the beginning, we can use this method to assign sex to other sex systems. We have pre-defined thresholds for each systems, but you might change these thresholds according to your needs.

        system: XY                               ## either of XY, ZW, XO or ZO
        signif: 0.95                             ## confidence interval wide (e.g. 95%)
        thresholds: list( "XX" = c(0.8, 1), 
                          "XY" = c(0, 0.6), 
                          "consistent with XX but not XY" = c(0.6, 1), 
                          "consistent with XY but not XX" = c(0, 0.8))   

These are the default thresholds proposed by Mittnik et al. (2016) for humans. You can edit them by passing a list with R syntax. Note that each name contains the "sex" that will be reported. For example, if you wish to report Female instead of XX).

In this specific example (default values), if the 95% confidence interval falls within

  • 0.8 and 1, it returns XX
  • 0 and 0.6, it returns XY
  • 0.6 and 1, it returns "consistent with XX but not XY"
  • 0 and 0.8, it returns "consistent with XY but not XX"
                # thresholds:  list( "XX" = c(0.8, 1), "XY" = c(0, 0.6), "consistent with XX but not XY" = c(0.6, 1), "consistent with XY but not XX" = c(0, 0.8) )

You can also uncomment this option and change the labels that will be returned (for example, if you wish to report Female instead of XX).

Report

Please note that if you make any modification in the config file concerning the stats section, you might need to recreate some tables or figures before being able to add them to the report.

First, have a look at the outputs that need to be updated with the following command

snakemake --list-params-changes

If the list of outputs looks reasonable (for example, a change in the plotting parameters means that only the plots need to be updated), you might feed that list of outputs to snakemake like this:

snakemake -R `snakemake --list-params-changes` --cores 1

or indicating a profile instead of the number of cores.

Finally, you can update the report with snakemake --report

Mapping statistics

To generate the mapping statistics, you just need to run the pipeline with

snakemake

or

snakemake mapping

Those commands will produce three csv tables with mapping statistics by FASTQ file (FASTQ), library (LB) and sample (SM) that was mapped.

results/04_stats/03_summary/
├── FASTQ_stats.csv
├── LB_stats.csv
├── SM_stats.csv

In addition to the csv tables, some statistics are plotted in svg format.

results/04_stats/04_plots/
├── 1_nb_reads.svg
├── 2_mapped.svg
├── 3_endogenous.svg
├── 4_duplication.svg
├── 5_AvgReadDepth.svg

The csv files and plots produced with these commands can be integrated in an html report by running

snakemake --report

Damage (optional)

You can run mapDamage or an extended version of bamdamage to get lengths and deamination rates of the mapped reads (in plain text files and plots), or to re-scale the base qualities (mapDamage only, see Rescaling damage).

stats:
    run_damage: False                            ## either 'False', 'mapDamage', or 'bamdamage'
    bamdamage_fraction: 0                        ## fraction/number of alignments to consider for bamdamage:
                                                 ##     0: use all alignments (default)
                                                 ##    <1: fraction of alignments to consider 
                                                 ##   >=1: absolute number of alignments to consider
    bamdamage_params: ''                         ## parameters to pass to bamdamage

Bamdamage was extended to be able to subsample the bam file and to define the plot area for the damage plot. Using all alignments is time consuming and often already a small fraction of the alignments is sufficient to get representative plots. The second modification allows narrowing down the number of bases from each end of the fragment to visualize in the damage plot. The parameters for the extended bamdamage are

  --mapquality|-m=i                        skip reads with mapping quality smaller than qual (30)
  --basequality|-b=i                       skip bases with base quality smaller than qual (20)
  --rlength|-r=i                           assumes reads are at most length (100)
  --sample|-s=s                            use name for the legend in plots (default is the file name)
  --outputfile|--output|-o=s               filename of damage plot (foo.dam.pdf)
  --outputfile_length|--output_length|-O=s filename of length plot (foo.length.pdf)
  --nth_read=i                             subsample: consider only every nth alignment (1; all)
  --plot_length=i                          damage plot: number of bases to plot from each end (25)
  --help|-h                                show this help
  --debug|-D                               write debug information
  --verbose|-v                             verbose output
  --version|-V                             show version

Please note, that in mapache the specification of the subsampling (see above) is different to the one using bamdamagedirectly. By default bamdamgeis returning the length distribution for reads up to 100bp (parameter --rlength) and longer reads are cropped. The longer the specified max read length is, the longer the execution of bamdamage takes. Thus, it is wise to set it as small as possible (e.g., initial read length of Illumina sequences), but also as long as needed (e.g., collapsed reads).

Depth of coverage for specific chromosomes (optional)

By default, the LB_stats.csv and SM_stats.csv files include a column with the average genome depth of coverage (DoC). If you wish to report the depth of coverage on a specific chromosome(s), you can do so by passing the name of the chromosome(s) under the genome section with the option depth_chromosomes. Please note that they should be double quoted and separated by commas.

The csv files would then have extra columns with the name of the chromosomes selected, prefixed with "depth_".

genome: 
  GRCh37: 
    fasta: path/hs.build37.1.fa
    depth_chromosomes: "X,Y,MT"                  ## compute DoC on given chromosomes

Please make sure to pass the correct names of the chromosomes as specified in the reference FASTA file, otherwise mapache will not run until you fix this error.