3. Config file (parameters) - sneuensc/mapache GitHub Wiki
Introduction
The pipeline can be customised using a configuration file in the yaml
format (default config/config.yaml
).
The minimal input files to run mapache
are
- a plain text file (
config/samples.tsv
) with paths and metadata of the input FASTQ files (see Sample file) - a reference genome in FASTA format listed in the config file
config/config.yaml
(see Reference genome)
Additionally, if you plan to impute genomes, a reference panel (in VCF format) needs to be specified in the config file as well.
In the file config/config.yaml
, you will see blocks corresponding to each of the steps that could be run. There, you can specify whether the step should be run, include additional arguments, and specify the number of threads and amount of memory to be used.
Here is an example for the subsampling step, which allows you to run mapache only on a few reads (number: 1000
) of your data (to test whether the pipeline will succeed with your current configuration). Note that the additional parameter -s1
(random seed for the tool seqtk) was passed using the keyword params
.
## subsampling (optional)
subsampling:
run: False
mem: 2
params: '-s1'
number: 1000
For some tools, the parameters have been set to be optimized for ancient human DNA. As a general rule, we try to add as few extra parameters as possible, and you can find them all (and change them as needed) on the config file. For example, to use the default parameters for a particular tool, the params
has to be empty (params: ''
).
The following paragraphs list and describe all available parameters in mapache
. We hope that the config file is self-explanatory, but in this page we extend the documentation for each of the sections in the configuration in case of doubts.
Samples
Sample file
Specifies the location of the sample file. The description on how to write a sample file is found on section 2 of this wiki (Sample file):
sample_file: config/samples.tsv ## path to the sample file
delim: "\s+" ## deliminator used to read the sample
## file with pandas read_csv()
Samples may also be directly defined in the config file. Instead of passing a file name to the parameter sample_file
a nested dictionary may be passed, e.g.:
sample_file:
ind1: ## SM
a_L2: ## LB
a_L2_R1_001: ## ID
Data: reads/a_L2_R1_001.fastq.gz ## Data and path to the file
a_L2_R1_002:
Data: reads/a_L2_R1_002.fastq.gz
b_L2:
b_L2_R1_001:
Data: reads/b_L2_R1_001.fastq.gz
b_L2_R1_002:
Data: reads/b_L2_R1_002.fastq.gz
External samples
Analyses on the final bam file can also be run on bam files generated by other means (e.g., downloaded from public repositories). These bam files can be defined either in a file analog to the sample file, e.g.:
External sample file ('config/external_samples.tsv')
SM Bam Genome
external1 external/file1.bam GRCh37
external2 external/file2.bam GRCh37
The column Genome
allows assigning the bam file to the correct reference genome. The external sample file has to be specified in the config file:
external_sample: config/external_samples.tsv
External samples may also be directly defined in the config file using a nested dictionary, e.g.:
external_sample: config/samples_stats.tsv
GRCh37: ## Genome
external1: external/file1.bam ## SM and path to the file
external2: external/file2.bam
Output folder
The name of the output folder may be defined as follows:
result_dir: 'results' ## output folder name
Reference genome
Reference genome
Input FASTQ files may be mapped to one or multiple reference genomes.
The minimal specification of the human reference genome is for the GRCh37
:
genome:
GRCh37: path_to_reference/GRCh37.fasta
and for the human reference genome GRCh38
:
genome:
GRCh38: path_to_reference/GRCh38.fasta
If you plan to map to multiple reference genomes (.e.g, organism1, organism2, organism3), you can do so by listing them all under the keyword genome
:
genome:
ORG1: path_to_reference/organism1.fasta
ORG2: path_to_reference/organism2.fasta
ORG3: path_to_reference/organism3.fasta
Note that the names for each reference genome (in the example above, ORG1
, ORG2
and ORG3
) have to be unique. Your final output files will be named with the format {sample_name}.{genome_name}.bam
Indexing
Reference genome(s) and bam files have to be indexed for some tools. The indexing is done automatically if needed (for fasta
and bam
files). If indexes of the reference gnome (fasta
) are already available beforehand, mapache
may include them as inputs if they are located in the same directory as their matching reference genome (e.g., foo.fasta
) and are named with the following formats:
## reference genome name specified in config file:
foo.fasta
## bwa index file names:
foo.fasta.sa
foo.fasta.amb
foo.fasta.ann
foo.fasta.bwt
foo.fasta.pac
## bowtie2 index file names:
foo.fasta.1.bt2
foo.fasta.2.bt2
foo.fasta.3.bt2
foo.fasta.4.bt2
foo.fasta.rev.1.bt2
foo.fasta.rev.2.bt2
## samtools faidx index file name:
foo.fasta.fai
## picard index file name:
foo.dict
Indexing is specified with the following parameters:
indexing:
bwa_params: '' ## parameters for 'bwa index'
bowtie2_params: '' ## parameters for 'bowtie2-build'
samtools_params: '' ## parameters for 'samtools faidx'
mem: 16 ## memory allocation for queuing system in GB
threads: 1 ## number of CPUs requested for each indexing job
time: 2 ## runtime allocation for queuing system in hours
Mapping workflow
Subsampling (optional)
If you are not yet familiar with mapache, we recommend you to try and subsample a few reads of your data to see mapache's behavior and get to know its output files quickly.
This step is run with the tool seqtk
.
seqtk sample ${params} ${input} ${number}`
Subsampling is specified with the following parameters:
subsampling:
run: False ## should the fastq files be randomly subsampled?
number: 1000 ## number/ratio of reads to keep:
## <1: ratio of reads
## >=1: absolute number of reads
params: '-s1' ## parameters for seqtk sample (if paired reads are
## subsampled, the seed has to be fixed (e.g., -s1)
Cleaning (optional)
Ancient DNA is usually fragmented, with fragment lengths usually shorter than the number of sequenced cycles, resulting in reads including adapter sequences. Removing these remnants of adapters is crucial.
Mapache
uses AdapterRemoval2 (default) or fastp to remove adapters, combine paired-reads, and to perform cleaning of the reads in general. Mapache
uses by default AdapterRemoval2 with settings optimized for ancient DNA, namely:
- remove default adapters
- trim ambiguous bases (N) at 5'/3' termini (
--trimns
) - trim consecutive stretches of low-quality (<=2) bases at the termini (
--trimqualities
) - discard reads shorter than 30bp following trimming (
--minlength 30
)
Paired-end reads may be collapsed with AdapterRemoval2 using the arguments --collapse
, --collapse-deterministic
or --collapse-conservatively
in the parameter cleaning:params_adapterremoval
. Paired-end reads may also be merged with fastp using the arguments -m
(has to stand alone) or --merge
in the parameter cleaning:params_fastp
. Mapache automatically detects these collapsing arguments and will retain just the collapsed pairs (usually the majority of the dataset) and will map them to the reference genome; that is, singletons and trashed reads that did not pass the aforementioned filters will not be mapped to the reference genomes. The parameter cleaning:collapse_opt
allows to adjust what reads should be retained for mapping.
If you collapse paired-end reads, please have a look at DeDup to remove duplicates (see Removing duplicates)
AdapterRemoval is run for single-end reads as follows
AdapterRemoval ${params} --file1 ${input} --basename ${output%%.fastq.gz} --gzip --output1 ${output}
Fastp is run for single-end reads as follows
fastp --in1 ${input} --out1 ${output.R} --json ${output.json} --html ${output.html} ${params}
The FASTQ cleaning is specified with the following parameters:
cleaning:
run: 'adapterremoval' ## 'adapterremoval', 'fastp', or False
params_adapterremoval: '--minlength 30 --trimns --trimqualities' ## parameters for AdapterRemoval
params_fastp: '' ## parameters for fastp
collapse_opt: 'only_collapse' ## what to retain when collapsing pairs:
## for AdapterRemoval:
## 'only_collapse': collapsed
## 'collapse_trunc': collapsed + collapsed_truncated
## 'all': collapsed + collapsed_truncated + R1 + R2 + singleton.truncated
## for fastp:
## 'only_collapse': merged
## 'all': merged + R1 + R2
threads: 4 ## number of threads
mem: 4 ## memory allocation for queuing system in GB
time: 2 ## runtime allocation for queuing system in hours
Mapping
Mapping to a reference genome is a mandatory step. The default mapping is optimized for ancient DNA using bwa_aln
with the argument -l 1024
to disable seeding (Schubert et al., 2012). However, other mappers are available. We show below the commands that are used to map single-end (or collapsed) and paired-end reads with the available mapping softwares (BWA and bowtie2).
- bwa_aln (default) using the command
- for single-end reads:
bwa aln ${params} ${ref} -f ${sai} ${fastq}
bwa samse ${bwa_samse_params} -r \"@RG\\tID:${ID}\\tLB:${LB}\\tSM:${SM}\\tPL:${PL}\" ${ref} ${sai} ${fastq}
-
- for paired-end reads:
bwa aln ${params} ${ref} -f ${sai1} ${fastq1}
bwa aln ${params} ${ref} -f ${sai2} ${fastq2}
bwa sampe ${bwa_sampe_params} -r \"@RG\\tID:${ID}\\tLB:${LB}\\tSM:${SM}\\tPL:${PL}\" ${ref} ${sai1} ${sai2} ${fastq1} ${fastq2}
- bwa_mem using the command
bwa mem ${bwa_mem_params} -R \"@RG\\tID:${ID}\\tLB:${LB}\\tSM:${SM}\\tPL:${PL}\" ${ref} ${fastq(s)}
The values in ID
, LB
, and SM
are automatically extracted from the corresponding columns in the sample file. The PL
is specified by the parameter pl
. This is useful, for example, if in the future you need to split your BAM files by library type with samtools.
- bowtie2 using the command
- for single-end reads:
bowtie2 ${bowtie2_params} -x ${ref} -U ${fastq}
-
- for paired-end reads:
bowtie2 ${bowtie2_params} -x ${ref} -1 ${fastq1} -2 ${fastq2}
Mapping is specified with the following parameters:
mapping:
mapper: bwa_aln ## mapper (available: bwa_aln, bwa_mem, bowtie2)
bwa_aln_params: '-l 1024' ## parameter for 'bwa aln'
bwa_samse_params: '-n 3' ## parameter for 'bwa samse' (single-end libraries)
bwa_sampe_params: '-n 3' ## parameter for 'bwa sampe' (paired-end libraries)
bwa_mem_params: '' ## parameter for 'bwa mem'
bowtie2_params: '' ## parameter for 'bowtie2'
pl: 'ILLUMINA' ## the platform to specify in the RG
threads: 4 ## number of threads
mem: 16 ## memory allocation for queuing system in GB
time: 30 ## runtime allocation for queuing system in hour
Sorting
Mapped BAM files are always sorted using samtools with the command
samtools sort --threads ${threads} ${input} > ${output}
Sorting is specified with the following parameters:
sorting:
threads: 4 ## number of threads
mem: 16 ## memory allocation for queuing system in GB
time: 10 ## runtime allocation for queuing system in hours
Filtering (optional)
It is common practice to keep only high-quality mapped reads in a final BAM file. By default, unmapped and low-quality reads are saved in a separate BAM file (save_low_qual: True
) in case you want to use them for other type of analyses, such as investigating the metagenomic composition of the sample.
In brief, filtering is done with samtools using the command
## if filtering:mapq is specified:
samtools view -b -F 4 -q ${mapq} -U ${out_low_qual} ${input} > ${out_mapped}
## if filtering:params is specified:
samtools view -b ${params} -U ${out_low_qual} ${input} > ${out_mapped}
Just as mentioned above, this command will save reads with a mapping quality equal or higher than mapq
(indicated in the sample file) to {out_mapped}
, and use samtools to store (-U ${out_low_qual}
) unmapped reads (-F 4
) and reads with lower mapping qualities (-q ${mapq}
) to a separate BAM file.
Filtering is specified with the following parameters:
filtering:
run: True ## perform filtering or not (if `False`, the final bam file
## contains all alignments, including not mapping reads)
mapq: 30 ## filtering threshold (parameter -q in samtools)
params: '' ## any parameters to samtools (only 'params' OR 'mapq' may be specified)
save_low_qual: True ## if filtering, shall the not mapped reads and low quality
## alignments be saved in a separate bam file?
threads: 4 ## number of threads
mem: 16 ## memory allocation for queuing system in GB
time: 2 ## runtime allocation for queuing system in hours
Merging 'fastq' to library
Merging of all bam files of a library (same name in LB
and SM
).
Merging is done using samtools with the command
samtools merge -f ${output} ${inputs}
Merging is specified with the following parameters (merging libraries to sample is specified using the same parameters (see below)):
merging:
threads: 4 ## number of threads
mem: 16 ## memory allocation for queuing system in GB
time: 2 ## runtime allocation for queuing system in hours
Removing duplicates (optional)
A characteristic of ancient DNA is its scarcity. Consequently the chance to resequence the same molecule is quite large leading to unwanted duplicated sequences. This module allows detecting and removing such duplicated sequences. By default MarkDuplicates from Picard is used with the command
picard MarkDuplicates --INPUT ${input} --OUTPUT ${out_bam} --METRICS_FILE ${out_stats} \
${params} --ASSUME_SORT_ORDER coordinate --VALIDATION_STRINGENCY LENIENT
Alternatively, DeDup may be used which was especially developed for collapsed paired-end reads in ancient DNA using the command
dedup -i ${input} ${params} -o $(dirname $bam)
Removing duplicates is specified with the following parameters:
remove_duplicates:
run: markduplicates ## should duplicates be inferred/removed?
## (markduplicates, dedup, False)
params_markduplicates: '--REMOVE_DUPLICATES true' ## parameters for markDuplicates
params_dedup: '-m' ## parameters for dedup ('-m/--merged' is
## automatically removed for single-end reads)
threads: 4 ## number of threads
mem: 16 ## memory allocation for queuing system in GB
time: 2 ## runtime allocation for queuing system in hours
Rescaling damage (optional)
Reads originated from ancient samples present a high frequency of apparent C to T and G to A substitutions at the read termini, with specific patterns that vary depending on the protocol used to build the libraries.
An option to deal with these deamination patterns is to reduce the base qualities of the likely affected bases. You can do so by rescaling the qualities with mapDamage2, although this option is not run by default in mapache
.
mapDamage -i ${input} -r ${ref} -d $(dirname ${output})_results ${params} --rescale \
--merge-reference-sequences --rescale-out ${output};
Rescaling damage is specified with the following parameters:
damage_rescale:
run: False ## should the bam file be rescaled?
params: '' ## parameters for mapDamage
threads: 1 ## number of threads (only single threaded)
mem: 8 ## memory allocation for queuing system in GB
time: 2 ## runtime allocation for queuing system in hours
Trimming alignments (optional)
Alignments may be trimmed at the reads end in order for example to reduce the effect of the damage. Trimming is done with BamUtil with the command
bam trimBam ${input} ${output} ${params}
Following parameters are available:
bamutil:
run: False
mem: 4 ## memory allocation for queuing system in GB
time: 2 ## runtime allocation for queuing system in hours
Merging libraries to sample
Merging of libraries of the same sample (same name in SM
).
Merging is done using samtools with the command
samtools merge -f ${output} ${inputs}
Merging is specified with the following parameters (merging 'fastq' to library is specified using the same parameters (see above)):
merging:
threads: 4 ## number of threads
mem: 16 ## memory allocation for queuing system in GB
time: 2 ## runtime allocation for queuing system in hours
Realigning indels (optional)
Taking the information across alignments allows correcting for mis-alignments (indels) at the edges of the reads. Realigning indels is done using IndelRealigner
from GATK v3.8 using the command
GenomeAnalysisTK -T RealignerTargetCreator -I ${input} -R ${ref} -o ${intervals} ;
GenomeAnalysisTK -T IndelRealigner -I ${input} -R ${ref} -targetIntervals ${intervals} -o ${output}
Realigning indels is specified with the following parameters:
realign:
run: True ## should GATK IndelRealigner be run
threads: 1 ## number of threads (only single threaded)
mem: 8 ## memory allocation for queuing system in GB
time: 2 ## runtime allocation for queuing system in hours
Recomputing md flags (optional)
This step allows to recompute the md flag
of each alignment to avoid downstream analyses problems. Incorrect md flags
may occur by modifications to the bam file in previous steps (such as realignment around indels) without adjusting the md flag
and may lead to issues when using this BAM file with certain programs. The md flag
is recomputed with samtools using the command
samtools calmd ${input} ${ref} | samtools view -bS - > ${output}
Recomputing md flag is specified with the following parameters:
run_compute_md:
run: True ## should samtools calmd be run [True]
threads: 4 ## number of threads (only single threaded)
mem: 8 ## memory allocation for queuing system in GB
time: 2 ## runtime allocation for queuing system in hours
Statistics
Mapping statstics
Mapache computes several summary statistics during the mapping process. The following tables in csv format are generated:
results/04_stats/03_summary/ ## path to the files
├── FASTQ_stats.csv ## statistics for each fastq file and gneome
├── LB_stats.csv ## statistics for each library and gneome
├── SM_stats.csv ## statistics for each sample and gneome
In addition to the csv tables, some statistics at the sample level are plotted in svg format.
results/04_stats/04_plots/ ## path to the files
├── 1_nb_reads.svg ## number of raw reads
├── 2_mapped.svg ## number of mapped reads (unique and duplicated)
├── 3_endogenous.svg ## endogenous content (unique reads)
├── 4_duplication.svg ## % of duplicated reads
├── 5_AvgReadDepth.svg ## average read depth (unique reads)
├── 6_Sex.svg ## genomic sex assignment
The frollowing parameters allow fine-tuning the plots. If you are mapping a large number of samples or to a large number of genomes, the default plotting settings might need some adjustments in order to improve the readability of the plots.
stats:
plots:
x_axis: auto ## "auto" (default), "sample" or "genome"
## Which value should be included as the variable
## on the x-axis? E.g.,
## sample vs DoC (and bars colored by sample name)
## genome vs DoC (and bars colored by genome name)
## We recommend ("auto"):
## "sample" if n_samples > n_genomes
## "genome" if n_genomes > n_samples
n_col: 1 ## number of panels per row if multiple samples
## and genomes are used
width: 18 ## width of the plot
height: 12 ## height of th plot
color: "#f2ad78" ## color of the bars
show_numbers: 15 ## show the numbers in the plot if the number of items
## (samples/genomes) is below the given threshold
sex_ribbons: 'c("XX"="#7e3075", "XY"="#003e83")'
## ribbon coloring for plot 6_Sex.svg. Expected is a
## vector of 2 colors (R format). Names must much
## 'sex_inference:threshold' names.
Reports
Mapache can produce several reports.
qualimap
(per sample and genome): Qualimap reports statistics on single bam file. In mapache qualimap is run on all final bam files.multiqc
(per genome): MultiQC combines log files of several tools, such asfastqc
,samtools stats
,AdapterRemoval
,qualimap
. It is a good resource to get detailed information of indvidual steps if something went wrong. It also includes the qualimap reports.snakemake
(per run): This is the most complete report, including the the mutliqc report, the mapping statistic tables and plots (see Mapping statistcs).) and also runtime information. It has to be generated manually after a successful run with the commandsnakemake --report report.html
.
The reports may be enabled as follows. Depending on the data valume the memory and/or runtime have to be adjusted when run on a cluster.
stats:
fastqc_time: 4 ## runtime specification for fastqc
qualimap: False ## should the qualimap report be generated?
qualimap_mem: 8 ## memory allocation for qualimap
multiqc: False ## should the multiqc report be generated?
Analyses on final bam file
Damage (optional)
The damage (fragment length and deamination pattern of the mapped reads) may be inferred for each library using mapDamage or an extended version of bamdamage. To re-scale the base qualities based on the deamination pattern please see Rescaling damage).
The damage inference is parametrized with the following parameters:
damage:
## damage statistics (read length, deamination rates)
run: False ## "False", "bamdamage", "mapDamage"
bamdamage_fraction: 0 ## fraction/number of alignments to consider
## 0: use all alignments (default)
## <1: fraction of alignments to consider
## >=1: absolute number of alignments to consider
bamdamage_params: '' ## parameters to pass to bamdamage (see below)
Bamdamage
was extended to be able to subsample the bam file and to define the plot area for the damage plot. Using all alignments is time consuming and often already a small fraction of the alignments is sufficient to get representative plots. The second modification allows narrowing down the number of bases from each end of the fragment to visualize in the damage plot. The parameters for the extended bamdamage
are
--mapquality|-m=i skip reads with mapping quality smaller than qual (30)
--basequality|-b=i skip bases with base quality smaller than qual (20)
--rlength|-r=i assumes reads are at most length (100)
--sample|-s=s use name for the legend in plots (default is the file name)
--outputfile|--output|-o=s filename of damage plot (foo.dam.pdf)
--outputfile_length|--output_length|-O=s filename of length plot (foo.length.pdf)
--nth_read=i subsample: consider only every nth alignment (1; all)
--plot_length=i damage plot: number of bases to plot from each end (25)
--help|-h show this help
--debug|-D write debug information
--verbose|-v verbose output
--version|-V show version
Please note, that in mapache
the specification of the subsampling (see above) is different to the one using bamdamage
directly. By default bamdamge
is returning the length distribution for reads up to 100bp (parameter --rlength
) and longer reads are cropped. The longer the specified max read length is, the longer the execution of bamdamage
takes. Thus, it is wise to set it as small as possible (e.g., initial read length of Illumina sequences), but also as long as needed (e.g., collapsed reads).
Depth of coverage for specific chromosomes (optional)
By default, the LB_stats.csv
and SM_stats.csv
files report the average genome depth of coverage (DoC). this section allows reporting the depth of coverage on a specific chromosome(s).
The csv files will then have extra columns with the name of the chromosomes selected, prefixed with "depth_
".
depth:
hg19: ## genome name (has to defined under 'genome')
run: False ## compute DoC of specified chromosomes
chromosomes: ["X", "Y", "MT"] ## list of chromosome names
Please make sure to pass the correct names of the chromosomes as specified in the reference FASTA file, otherwise mapache will not run until you fix this error.
Sex inference (optional)
Any human reference genome, but also any other reference genome in FASTA format may be specified with or without sex-specific chromosomes. To account for this flexibility several parameters need to be specified in order to compute underlying summary statistics (e.g. determine the genomic sex). Mapache
recognizes automatically the two human reference genomes GRCh37
and GRCh38
. For all other reference genomes, the assignments to autosome, sex and mt chromosomes have to be made manually if the inference of the genomic sex is desired.
To infer sex, we follow the method presented by Mittnik et al. (2016) for humans.
This method computes the average ratio of the normalised depth of the X chromosome to the normalised depth on the autosomes. Thus, for females, a ratio of 1 is expected, and for males a ratio of 0.5 is expected. Assuming this ratio can be approximated by a normal distribution, a 95% confidence interval is generated by taking ~1.96 standard errors of the mean. The sex is then assigned depending on the values spanned by this confidence interval (parameter thresholds
).
We reimplemented the code to adapt it to organisms with a different chromosomal sex-determination systems, including XX/XY, ZW/ZZ, XX/XO and ZZ/ZO.
As the method relies on the number of reads mapped to the homogametic sex chromosome and to the autosomes, and the naming of chromosomes varies across reference genomes (i.e., 1,2,..,22,X,Y vs chr1,chr2,...,chr22,chrX,chrY), the parameters specific to sex inference have to be declared.
Please make sure to pass the correct names of the autosomes and the sex chromosome as specified in the reference FASTA file, otherwise mapache will not run until you fix this error.
Thus, assuming that we are mapping to two different reference genome, the parameters should be specified as follows.
sex_inference:
GRCh37: ## genome name (has to defined under 'genome')
run: True ## should the sex_inference be run for this genome?
autosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
## a list of autosome chromosome names
sex_chr: X ## the homogametic sex
system: XY ## either of XY, ZW, XO or ZO
signif: 0.95 ## confidence interval wide
thresholds: 'list("XX" = c(0.8, 1),
"XY" = c(0, 0.6),
"consistent with XX but not XY" = c(0.6, 1),
"consistent with XY but not XX" = c(0, 0.8))'
## These are the thresholds used to assign molecular sex, in R syntax.
## For humans, we expect
## females: DoC(X)/DoC(genome) = 1
## males: DoC(X)/DoC(genome) = 0.5
## Note that sex will be assigned by considering the confidence intervals
## around the estimate for such ratio, which can be a bit lower or
## higher than 1 or 0.5.
## This option can be commented as the sex script comes with default
## values for different sex systems.
Autodetection of GRCh37 and GRCh38
Mapache
recognizes the two human reference genomes GRCh37
and GRCh38
automatically based on the chromosome names, and assigns automatically the chromosomes to autosome, sex and mt chromosomes. It assumes the folling names:
GRCh37
: 1, 2, ..., 22 ,X ,Y ,MTGRCh38
: chr1, chr2, ..., chr22, chrX, chrY, chrM
Therefore, for these two genomes the parameters autosomes
and sex_chr
most not be set manually. For other reference genomes, these parameters have to be set manually. If the specified chromosome names are not present in the reference genome, no sex inference is possilbe and an error is drawn.
Autosomes
The autosomes
should be listed as a list, e.g.
autosomes: '["chr1", "chr2", "chr3"]'
.
This can be tedious if you have many more chromosomes. We recommend using a python terminal to obtain the correct lit of chromosome names, by executing for example the code below:
list(range(1,23))
or[f"chr{x}" for x in range(1,23)]
if you need the chromosomes with the chr prefix
Sex thresholds
As mentioned at the beginning, we can use this method to assign sex to other sex systems. We have pre-defined thresholds for each systems, but you might change these thresholds according to your needs. Mapache
uses the default thresholds proposed by Mittnik et al. (2016) for humans. You can edit them by passing a list with R syntax to the parameter thresholds
. Note that each name contains the "sex" that will be reported. For example, if you wish to report Female
instead of XX
).
In this specific example (default values), if the 95% confidence interval falls within
- 0.8 and 1, it returns XX
- 0 and 0.6, it returns XY
- 0.6 and 1, it returns "consistent with XX but not XY"
- 0 and 0.8, it returns "consistent with XY but not XX"
Imputation (optional)
Low-coverage imputation has as input genotype likelihoods for variable sites in a reference panel.
In the first step, we generate genotype likelihoods at the sites of the reference panel. Then, we proceed to impute. Besides the input data, for this step we need:
- path_panel: the reference panel (vcf file; e.g., 1000 Genomes)
- path_map: the genetic map that contains the recombination rates across the human genome in the coordinates of the genome build used to map the data
We can assess how confidently each site was imputed using the genotype probabilities (GP), that are contained in the output vcf/bcf files (FORMAT/GP). For any biallelic site, a GP value for each three possible genotypes (0, 1 and 2) is reported with values ranging between 0 and 1. They sum to 1 across the three genotypes. Mapache outputs a histogram with the resulting maximum GP for all imputed sites as well as the average maximum GP. Mapache allows filtering on the GP value (parameter ‘GP_filter’), which is a trade-off between imputation accuracy and loss of sites. Stricter thresholds for this filtering result in a greater loss of sites containing the alternative allele, which is particularly noticeable at lower depths of coverage (e.g., 0.1x). As default, mapache applies a GP filter of 0.8. In any case, we recommend removing rare variants, at least with minor allele frequency (MAF) below 1%, as these are less accurately imputed.
imputation:
hg19: ## genome name (has to defined under 'genome')
run: False ## should imputation be run
gp_filter: [0.8] ## gp filter
## path_map and path_panel may specify a single file OR may contain the variable '{chr}'
## to specify it per chromosome
path_map: "path_to_map/chr{chr}.b37.gmap.gz"
path_panel: "path_to_panel/ALL.chr{chr}.genotypes.vcf.gz"
chromosomes: [20,21] ## list of chromosomes to impute
glimse_chunk_params: "" ## parameters for GLIMPSE chunk
glimse_phase_params: "" ## parameters for GLIMPSE phase
Variable parameter
Mapache supports library specific settings. In general, defined settings are common across all fastq files and libraries, e.g.:
cleaning:
run: 'adapterremoval'
params_adapterremoval: '--minlength 30 --trimns --trimqualities'
To use fastq/library specific settings, you may allocate each fastq file to a key
in the sample file (using a new column named Group
) and refer to this key
in the config file (using a nested dictionary). A key
may be any text. For example, let's assume that library a_L2
is UDG treated and libraray b_L2
is not. To use UDG specific settings, we first have to define the key
in the sample file (column Group
; the keys 'UDG' and 'nonUDG' are freely chosen):
SM LB ID Data Group
ind1 a_L2 a_L2_R1_001 reads/a_L2_R1_001.fastq.gz UDG
ind1 a_L2 a_L2_R1_002 reads/a_L2_R1_002.fastq.gz UDG
ind1 b_L2 b_L2_R1_001 reads/b_L2_R1_001.fastq.gz nonUDG
ind1 b_L2 b_L2_R1_002 reads/b_L2_R1_002.fastq.gz nonUDG
Now we can specfiy UDG specific settings in the config file using a nested dictionary, for example by using different trimming lengths at the 5'/3' termini (2 bases for 'UDG' and 6 bases for 'nonUDG' libraries):
cleaning:
run: 'adapterremoval'
params_adapterremoval:
UDG: '--trim5p 2 --trim3p 2 --minlength 30 --trimns --trimqualities'
nonUDG: '--trim5p 6 --trim3p 6 --minlength 30 --trimns --trimqualities'
Above is a simple case. You may extend the variable setting as follows:
-
default
: If a parameter contains variable setting (a nested dictionary), but the specifiedkey
in the sample file is not defined in the config file, then the default parameter setting is used. One may override the default parameter setting, using 'default' askey
in the config file. Like this not all fastq files in the sample file have to contain akey
(key
'strange' in the example below). -
reuse of key
: The same key may be used for different parameters. -
multiple keys
: Thekey
allows grouping fastq files. It is possible to use multiplekeys
to define different groups. Multiplekeys
are seperated by a comma.
Below is a more complex scenario:
Sample file:
SM LB ID Data Group
ind1 a_L2 a_L2_R1_001 reads/a_L2_R1_001.fastq.gz index8,UDG
ind1 a_L2 a_L2_R1_002 reads/a_L2_R1_002.fastq.gz index8,UDG
ind1 b_L2 b_L2_R1_001 reads/b_L2_R1_001.fastq.gz index7,nonUDG,strange
ind1 b_L2 b_L2_R1_002 reads/b_L2_R1_002.fastq.gz index8,nonUDG
Config file:
cleaning:
run: ‘adapterremoval’
params_adapterremoval:
index8: '–adapter1 ${adapter1_8b} –adapter2 ${adapter2_8b}'
index7: '–adapter1 ${adapter1_7b} –adapter2 ${adapter2_7b}'
filtering:
run: True
mapq:
strange: '25'
default: '30'
bamutil:
run: True
params:
UDG: '2'
noUDG: '6'
Please note, that the listed adapters above (e.g, '${adapter1_8b}') are used as placeholders for an adapter sequence to increase the readability. YAML does not allow such substitutions, the adapter sequences have to be writen. Keys
defined for parameter settings of tools used at the library or sample level, have to be identical for each fastq file within a library and sample, respectively (othwise an error is drawn).
In the eaxample above, the fastq files will have the following parameter settings
- fastq files
a_L2_R1_001
anda_L2_R1_002
:- cleaning: '–adapter1 ${adapter1_8b} –adapter2 ${adapter2_8b}'
- filtering: '30'
- bamutil: '2'
- fastq file
b_L2_R1_001
:- cleaning: '–adapter1 ${adapter1_7b} –adapter2 ${adapter2_7b}'
- filtering: '25'
- bamutil: '6'
- fastq file
b_L2_R1_002
:- cleaning: '–adapter1 ${adapter1_8b} –adapter2 ${adapter2_8b}'
- filtering: '30'
- bamutil: '6'
Variable parameters are not avaible for all parameters. The following parameters support variable settings (nested dictionary):
## FASTQ LEVEL
subsampling:
run: False
params: '-s1'
number: 1000
cleaning:
run: ‘adapterremoval’
params_adapterremoval: ''
params_fastp: ''
collapse_opt: only_collapse
mapping:
bwa_aln_params: '-l 1024'
bwa_samse_params: '-n 3'
bwa_sampe_params: ''
bwa_mem_params: ''
bowtie2_params: ''
platform: 'ILLUMINA'
filtering:
params: ''
mapq: 30
save_low_qual: True
## LIBRARY LEVEL
remove_duplicates:
params_markduplicates: '--REMOVE_DUPLICATES true'
params_dedup: '-m'
damage_rescale:
run: False
params: ''
bamutil:
run: False
params: '3'
## SAMPLE LEVEL
realign:
run: True
compute_md:
run: True
Job resubmission
On an HPC system, it may happen that a job is killed due to memory or runtime limits. Snakemake does not only clean up the space of killed/unfinished jobs, but also allows relaunching them automatically. At such a relaunch one may increase automatically the memory or runtime allocation. This is very handy if the input file sizes are highly variable (wath is the normal in aDNA). For example, if most libraries are of small size and only a few large, it would be a pitty to allocate huge resources to all libraries and then spend most of the time waiting in the queue. It may be more efficient to allocate less resources (runtime and memory) which allows for most of the libraries to map successfully. The larger libraries will then fail due to lack of resources and be relaunched automatically with more resources. The parameters below allow defining the increase of resources at a relaunch.
memory_increment_ratio: 1 ## memory increment
runtime_increment_ratio: 0 ## runtime increment
The number specifies the ratio in relation ot the initial allocation which is added at each relaunch:
- 0: no change at relaunch
- 1: adding the inital allocation to the previous one at each relaunch
Software
Java tools
Picard and GATK are Java programs. To launch them the .jar
file has to be known. If you have used the conda installation, the following settings should be fine.
software:
picard_jar: 'picard'
gatk3_jar: 'GenomeAnalysisTK'
Environment modules
The underlying software used by mapache can be provided in different ways. The easiest way is to use global or rule specifc conda environment(s). However, it is also possible to use pre-installed software available via 'module', which is a widly used feature on HPC systems. Using the 'module' feature can be invoced by launching mapache with the command line parameter --use-envmodules
. The parameters below allow defining how the different tools are loaded on your computer infrastucture. E.g., if you have to load samtools as module add samtools
then you shoud state it in the config file as
envmodules:
samtools: "samtools"
The following tools may be defined by enfironment modules:
envmodules:
samtools: "gcc samtools/1.12"
bowtie2: "gcc bowtie2/2.4.2"
bwa: "gcc bwa/0.7.17"
picard: "gcc picard/2.24.0"
gatk3: "gcc gatk/3.8-1"
fastqc: "gcc fastqc/0.11.9"
r: "gcc r/4.0.4"
adapterremoval: "gcc adapterremoval/2.3.2"
mapdamage: ""
bedtools: "gcc bedtools2/2.29.2"
seqtk: ""
qualimap: ""
multiqc: ""
glimpse: ""
bamutil: ""