2. Configuring & Running - DKFZ-ODCF/AlignmentAndQCWorkflows GitHub Wiki

All configuration variables are documented in the XML files in the resources/configurationFiles/ directory. There is one XML for each workflow. Note that workflows depend on each other, i.e. the WES and WGBS workflows are extension of the WGS workflow -- this can be recognized from the imports attribute of the top-level configuration tag in the XMLs. This means that most options of the WGS workflow also affect the other two workflows. Conversely, settings in the WGBS and WES workflow may override those in the WGS workflow. Some processing steps, notably those of the ACEseq quality control (QC) are not valid for the WES workflow. Note that the plugin depends on the COWorkflowBasePlugin, which has its own configurations affecting this alignment plugin.

See the Roddy documentation for general information on how to configure and run workflows.

BAM and FASTQ Usage

The workflow (and Roddy) was originally designed to retrieve the files -- FASTQs and BAMs -- from the filesystem by predicting their names using the "filename patterns" -- which means that there are rules that define the expected names of files, either for finding those files or creating them. Usually, the workflow was run on one directory that represented both the input and output directories and the workflow made sure that all expected files are present after the execution. However, it turned out, that this use case was too strict. Sometimes additional FASTQs ("lane files") present at on the filesystem, should actually get ignored even though they matched the filename patterns, because of quality problems in the data. Sometimes, BAM files needed to be overwritten. Sometimes externally generated BAMs should be used to add QC information using the workflow. This resulted in a somewhat messy set of configuration values.

Here are the use cases and the appropriate combinations of configuration parameters:

  1. If you start from the scratch with FASTQ files and you want the workflow to create "lane"-BAMs, merged/sample BAMs and add QC, use one of the following three mutually exclusive ways to provide the FASTQs. The parameters are given in the order of their priority:
    1. list the FASTQs and metadata in a metadata file (WGBS only)
    2. use the fastq_list parameter.
    3. get the FASTQs (and their metadata) from filesystem structure using the filename patterns defined by the plugins XML.
  2. You have a merged/sample-BAM file and want to get just the QC without doing alignments or duplication marking. Provide the BAM via the bam parameter and set useOnlyExistingTargetBam=true. Note that if you have lane-BAMs, job will recognize their presence (by filename pattern matching) and only do the QC on them.
  3. You have a merged/sample-BAM file and want to add the data from more lane-FASTQs to that BAM. Provide the BAM via the bam parameter and leave useOnlyExistingTargetBam=false (default). The additionally FASTQs need to be provided via one of the three methods listed in (1). For instance, if you use fastq_list with bam only the listed FASTQs will be added to the BAM file provided via bam. Note that read groups already in the BAM file (no matter whether provided via bam or via filename pattern matching) are ignored.
  4. Some of you lane-BAMs are corrupt and you want to recalculate them. Remove these files and let the alignment job recognize automatically, which lane-BAMs are present, and which are not. For existing lane-BAMs only the QC will be run, while missing BAMs will be created anew. The merged/sample/duplication-marked BAM will be created anew.

Note that only the data processing using the approaches 1a and 1c (from the scratch with metadata table or with filename patterns) is suited for processing more than a single sample (and thus also patient).

A description of the parameter meanings.

Variable Default Value Description
bam NULL If the bam parameter is set to some file, an existing target merged/duplication-marked/sample-BAM file (derived from the filename pattern in the XML) will be rescued by suffixing it with a date.
useOnlyExistingTargetBam false "target" BAM refers to the merged BAM. If this is set to "true" then only an existing merged BAM is considered and thus basically just the QC in run on this file. Using useOnlyExistingTargetBam=true with bam, fastq_list, or useExistingLaneBam set is considered a configuration error and will prevent job-submission.
useExistingLaneBams false true will prevent the addition of more FASTQ files (and therefore the actual alignment jobs) but might still include new lane BAMs in the merged/marked BAM that are found at the expected places (see filename patterns) but were previously not included.
fastq_list NULL If set this needs to be a ';'-separated list of alternating matching R1 and R2 FASTQ files. The names of matching files should be identical except for the '1' and '2' for the first and second read. These names should also still match the filename patterns, because this is the way how the metadata (read group, sample name, etc.) are retrieved from the filenames. If fastq_list is set, FASTQ files in the input or output directories matching filename patterns are ignored.

Parsing of Metadata from Filenames

The workflow can parse metadata from filenames using the "filename patterns" in the plugin configuration XML. Note that this is optional and depends on the actual configuration as mentioned in the previous section. For instance, with a metadata input table the you can provide FASTQ files with arbitrary file names that are not parsed, because the metadata is taken from the table's columns.

Dataset/Patient Identifier

The dataset or patient identifier is provided by the Roddy core. There it is determined by (1) retrieving the identifier list or wildcard from the command line, and (2) matching these in the input directory.

Sample Identifier

The way the sample identifier is retrieved depends on a number of configuration variables. The following conditions are checked in order and the first true condition determines the algorithm:

  • with the configuration value sample_list being set, the names are taken from there and appropriate files are identifier by templating the names into the filename patterns
  • with the configuration value fastq_list the FASTQ filenames are matched by the filename patterns to retrieve the ${sample} path component.
  • with the configuration value extractSamplesFromOutputFiles the sample identifiers are retrieved from the alignmentOutputDirectory (usually "alignment"; located in outputAnalysisBaseDirectory). The names are then parsed from the input BAM files (using COMetadataAccessor#extractSampleNameFromBamBasename). This is suited e.g. if BAM files are supposed to be extended by data from more FASTQs.
  • with the configuration value bam_list being set the sample again is parsed from the BAM file names (using COMetadataAccessor#extractSampleNameFromBamBasename).
  • with all previous conditions failing, the sample names are the subdirectory names in the inputAnalysisBaseDirectory which is the "dataset" directory.

The actual algorithm for parsing BAM file names is determined by the selectSampleExtractionMethod variable which can be set to "version_1" or "version_2".

Library Identifier

The library identifier is only relevant for the WGBS workflow and is determined in the same way as the sample identifier.

Run Identifier

The run identifier is parsed from the FASTQ path according to the filename pattern component .../${run}/....

Lane and Index Identifiers

FASTQs of read pairs should always separated the read index number (1,2, etc.) from the base filename with underscore _. For instance run1_R1.fastq.gz and run1_R2.fastq.gz are recognized as matching read 1 and read 2 files, because they (1) have in total only a single character difference, (2) have the same prefix run1, (3) that is separated from the index number with an underscore. In the workflow, the part before the _ is called lane identifier, while the part between the _ and the dot separating off the file suffix is called index identifier (e.g. "1", or "R1").

Read Group Identifiers

Read group IDs in BAMs are determined (input files) from or stored in (output files) the ID attribute in @RG header lines. Usually, read group IDs in FASTQ files are determined from filenames using the patterns ${RUN}_${LANE}.

Running the Workflow

The most important parameters are:

Parameter Example Description
INDEX_PREFIX /path/to/assembly/assembly.fa The path to the fasta file with the assembled genome(s). Note that the BWA index needs to be this directly and use the string 'assembly.fa' as prefix
CHROM_SIZES_FILE /path/to/assembly/sizes.tsv A two-column TSV file with chromosome identifiers (1) and number of bases (2). Usually you want the number of bases just be from the set {A, T, C, G}, to ignore all lower-quality bases in the genome in the statistics. This file also determines for which chromosomes the coverage is calculated, and which are included in the overall ("all") coverage. This is important for xenografts and in particular the 1.2.73 branch. IMPORTANT: The CHROM_SIZES_FILE is usually not the same as the CHROMOSOME_LENGTH_FILE that is used in the ACEseq QC.
INSERT_SIZE_LIMIT 1000 Maximal insert size for a proper pair (default 1000) used by flags_isizes_PEaberrations.pl for the QC output
CHR_PREFIX chrMmu On the master branch, with xenograft data used to discern 'matching' and 'nonmatching' identifiers which match the /$CHR_PREFIX$chr$CHR_SUFFIX/ pattern or not, respectively. Also used for WGBS.
CHR_SUFFIX _hsa See CHR_PREFIX.
CHR_GROUP_NOT_MATCHING human On the master branch, See CHR_PREFIX. Default: "nonmatching"
CHR_GROUP_MATCHING mouse On the master branch, See CHR_PREFIX. Default" "matching"
CHROMOSOME_INDICES "( 1 2 3 )" Needed for the WGBS workflow to select chromosomes to be processed. This should be a quoted bash array, i.e. with spaces as element separators and including the parentheses.
useAdapterTrimming false Set to true to turn on adapter trimming
CLIP_INDEX Path to the trimmomatic adapter file. Only used in ADAPTER_TRIMMING_OPTIONS_1
ADAPTOR_TRIMMING_OPTIONS_0 "PE -threads 4 -phred33" Trimmomatic execution parameters, such as PE, SE, -phred33, -phred64. Don't use -trimlog as it is devastating for the workflow's performance
ADAPTOR_TRIMMING_OPTIONS_1 "ILLUMINACLIP:${CLIP_INDEX}:2:30:10:8:true SLIDINGWINDOW:4:15"
markDuplicatesVariant sambamba Allowed values: biobambam, picard, sambamba. Default: empty. If set, this option takes precedence over the older useBioBamBamMarkDuplicates option
SAMBAMBA_MARKDUP_OPTS "-t 1 -l 0 --hash-table-size=2000000 --overflow-list-size=1000000 --io-buffer-size=64" Please use -l 0, the workflow unpacks the BAM directly with samtools. Compression is faster and more stable with samtools.
runFingerprinting false Fingerprint the individuals using a set of reference positions.
fingerprintingSitesFile BED file with the reference positions used for fingerprinting. Used for to discover sample swaps.
runFastQC Run fastqc, unless runAlignmentOnly is "true"
runAlignmentOnly Skip the Fastqc step, even if runFastQC is "true"
runFastQCOnly Skip the alignment. Note that to run the fastqc step you still have to turn on runFastQC

The three options runFastQC, runFastQCOnly, runAlignmentOnly have a somewhat unclear semantics, best summarized with the following pseudo-code:

if (runFastQC && !runAlignmentOnly) 
  runFastQC()

else if (!runFastQCOnly) 
  runAlignment()

Furthermore note that this interaction is different in the very old parts of the workflow that used bwa sampe (the non-"slim" part of the workflow) that is long deprecated and should not be used.

Note that for older Roddy versions you should use quotes around the arguments containing spaces if you also use Bash <4.4. This is because of a bug in Bash array variables are not exported to called programs correctly. Newer Roddy versions recognize the spaces and may quote automatically if necessary (if feature toggle AutoQuoteBashArrayVariables is turned on).

A full description of all options in the different workflows can be found in the XML files in resources/configurationFiles. Note that workflow configurations inherit from each other in the order "WGS" <- "WES" <- "WGBS". Thus the WGS configuration (analysisQc.xml) contains variables that are overridden by values in the WES configuration (analysisExome.xml), and so forth.

Optical Duplicates

Whether optical and PCR duplicates are discerned depends on the chosen duplication marking tool.

Xenograft

The WGS and WES workflows can deal with xenograft data simply by aligning against the combined genome. Thus to process xenograft data you need a FASTA file, a genome index, and a matching "chromosome sizes file".

1.2.73 Branch

If you want your QC only for certain chromosomes, e.g. the human chromosomes in a human/mouse xenograft, then the CHROM_SIZES_FILE should only contain the chromosomes of these chromosomes.

Master-Branch (will probably be renamed to "develop")

Here, you have more flexibility and can calculate QC metrics for both the host and embedded cell types.

The "chromosome sizes file" should contain both the host's and xenografted species's genomes. Make sure that the chromosomes from both species have different identifiers, e.g. by pre- or suffixing one of sets of the chromosome names, e.g. with chrMmu or whatever is appropriate.

NOTE: Currently, the WGBS workflow variant uses the CHR_PREFIX-variable for another purpose and, therefore, can not collect dedicated statistics for xenograft data.

If you want to have species-specific coverage you additionally need to set some variables. The chromosomes of one of the two species need to be prefixed (CHR_PREFIX) and/or suffixed (CHR_SUFFIX). For instance you may use a FASTA file with the human chromosomes without (explicitly configured) prefixes, e.g. with chromosomes 1, 2, ..., X, Y, MT and mouse chromosomes prefixed by 'chrMmu'. In this situation use the following configuration values:

  • CHR_PREFIX=chrMmu
  • CHR_GROUP_MATCHING=mouse
  • CHR_GROUP_NOT_MATCHING=human

The result will be that in the files $sample_$pid(_targetExtract)?.rmdup.bam.DepthOfCoverage(_Target)?_Grouped.txt two lines are created called "matching" (or "mouse" in the example) and "nonmatching" (or "human in the example). Additionally, these values are collected into the _qualitycontrol.json files.

Information for specific protocols

The plugin contains three related workflows for WGS, WES and WGBS data. The way to invoke a specific workflows is to set the availableAnalyses section in the project configuration to the configuration name of the desired workflow (i.e. the name of the configuration file in the plugin's resources/configurationFiles directory). Examples will be given below.

Whole Genome Sequencing (WGS)

In summary, the WGS workflows first aligns reads per "lane" file, sorts the resulting BAMs. In a second job, the "lane" BAMs are combined and duplication marking is done (no reads are removed). The remaining jobs do some additional QC on the results. In particular, the WGS workfolw does some GC- and replication-timing bias corrections for the coverage estimates, as are described is the documentation of the ACEseq workflow.

WGS job structure

An basic WGS analysis project XML may look like this:

<configuration
        configurationType="project"
        name="configurationName"
        description="The description">

    <subconfigurations>
        <configuration name="config" usedresourcessize="xl">
            <availableAnalyses>
                <analysis id="WGS" configuration="qcAnalysis"/>
            </availableAnalyses>
        </configuration>
    </subconfigurations>
</configuration>
Variable Default Description
runACEseqQc true Run ACEseq QC steps
CHROMOSOME_LENGTH_FILE A two-columns TSV with chromosome name and chromosome length (full length). Usually, only the "real" chromosomes 1-22, X, and Y are included. IMPORTANT: The CHROMOSOME_LENGTH_FILE is usually not the same as the CHROM_SIZES_FILE that is used for the per-base QC.
GC_CONTENT_FILE_ALN Needed for ACEseq QC
MAPPABILITY_FILE_ALN Needed for ACEseq QC
REPLICATION_TIME_FILE_ALN Needed for ACEseq QC
MAPPABILITY_FILE_ALN Needed for ACEseq QC
min_X_ratio_ALN 0.8 used for annotateCnvFiles
min_Y_ratio_ALN 0.12 used for annotateCnvFiles
cnv_min_coverage_ALN 50 should be 50 for controls and 0 for tumors. Take care that you correctly set the possible{Tumor,Control}SampleNamePrefixes variables!
mapping_quality_ALN 1000 TBD
min_windows_ALN 5 TBD
LOWESS_F_ALN 0.1 used by correctGC
SCALE_FACTOR_ALN 0.9 used by correctGC
COVERAGEPLOT_YLIMS_ALN 4 used by correctGC
GC_bias_json_key_ALN gc-bias used by correctGC
BASE_QUALITY_CUTOFF 0 quality cutoff used in coverageQc.d programm. Defined in COWorkflowsBasePlugin.

Whole Exome Sequencing (WES)

For WES data the workflow structure is generally the same as for WGS data, except that the ACEseq QC is not done. The QC statistics for exome data account for the target regions only, otherwise the estimates would be widely off any relevant value.

WES job structure

You can create a target for an WES analysis by referencing the "exomeAnalysis" in the <availableAnalyses> block.

...
                <analysis id="WES" configuration="exomeAnalysis"/>
...
Variable Default Description
BASE_QUALITY_CUTOFF 25 Defined in analysisExome.xml. Note that the default for the WES workflow deviates from the default 0 for the WGS and WGBS workflows, which is defined in the COWorkflowsBasePlugin. This default has historical reasons and you may want to change it.
WINDOW_SIZE 10 TBD
runACEseqQc false leave this set to false for WES data
TARGET_REGIONS_FILE Exom/Target-regions BED file with accompanied tabix-index.
TARGETSIZE The number of bases covered by the target regions.
SAMTOOLS_VIEW_OPTIONS " -bu -q 1 -F 1024" -q 1 uniquely mapped reads, -bu output uncompressed BAM for pipe to coverageBed
INTERSECTBED_OPTIONS TBD
COVERAGEBED_OPTIONS " -d " TBD

Whole Genome Bisulfite Sequencing (WGBS)

The WGBS variant does bisulfite calling on the fly with a patched version of methylCtools that is included in this repository. The workflow can handle not only WGBS but also tagmentation-WGBS data as described by Wang et al., 2013. Tagmentation data is based on independently amplified libraries, which makes it necessary to do independent duplication marking for each individual library before merging everything into a final merged-BAM. Note that this additional merging step only invoked if there are multiple libraries are actually provided.

WGBS job structure

The WGBS workflow is invoked if the "bisulfiteCoreAnalysis" configuration is referenced in the <availableAnalyses> section of the config file:

...
                <analysis id="WGBS" configuration="bisulfiteCoreAnalysis"/>
...
Variable Default Description
markDuplicatesVariant sambamba Allowed values: biobambam, picard, sambamba. Default: sambamba. Better stick to sambamba."/>
useBioBamBamSort false Use biobambam instead of samtools for sorting
IS_TAGMENTATION false true: tagmentation; false: standard WGBS.
reorderUndirectionalWGBSReadPairs false Try to infer the correct orientation of unidirectional WGBS reads from the TC and AG dinucleotide frequencies in both reads.
METH_CALLS_CONVERTER none none: Keep methylCtools output format. moabs: Convert to moabs format.
METH_CALL_PARAMETERS "-t -e 5 -x -z" Parameter to methylationCallingScript, of which there are two, for normal and tagmentation WGBS.
CYTOSINE_POSITIONS_INDEX Absolute path to the tabixed BED file containing all cytosine positions. This file is based on the INDEX_PREFIX file.
CHROMOSOME_INDICES ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT )' Bash array (enclosed in parentheses) of chromosome identifiers as found in the BAM header. Mind 'chr' prefixes! See also CHROM_SIZES_FILE.

Swift Biosciences ACCEL-NGS 1S PLUS & METHYL-SEQ

The protocol produces a second read (R2) fragment-end of on average 8 bp containing non-genomic sequences with low complexity. As these sequences are not of genomic origin they should be trimmed off. Because of read through with fragments shorter than the read length, the advise by Swift Biosciences is to trim off this non-genomic 10 bp from both fragment ends (compare here). Like for the WGS workflow, the trimming off is done by trimmomatic before the actual alignment and can be customized as described by adapting the option ADAPTOR_TRIMMING_OPTIONS_1. The variable ADAPTOR_TRIMMING_OPTIONS_1_SwiftAccelNgs has the correct trimming parameters for the Swift ACCEL-NGS protocol predefined. You can add the following configuration to your <configurationvalues> section:

<cvalue name="IS_TAGMENTATION" value="false" type="boolean"
        description="true: Ignore 9 bp at read start for methylation statistics"/>
<cvalue name='ADAPTOR_TRIMMING_OPTIONS_1_SwiftAccelNgs' value='"ILLUMINACLIP:${CLIP_INDEX}:2:30:10:8:true SLIDINGWINDOW:4:15 MINLEN:36 HEADCROP:10"' type="string"
        description="Swift Biosciences Accel-NGS Methyl-Seq DNA Library Kit"/>
<cvalue name="ADAPTOR_TRIMMING_OPTIONS_1" value="${ADAPTOR_TRIMMING_OPTIONS_1_SwiftAccelNgs}" type="string"/>

Note that here IS_TAGMENTATION is set to false, so no additionally ignore 9 bp during bisulphite calling.

Finally note, that we currently use trimmomatic, which -- by itself -- cannot just trim 10 bp from the distal fragment end independent of quality scores. This means that for short fragments only the adapter is trimmed, but not, as suggested by the protocol producer, the additional 10 bp.

WGBS-Tagmentation

WGBS-tagmentation (Wang et al., 2013) produces about 9 bp of genomic sequences on both fragment ends that show a conversion bias. Because these sequences are genomic, they contain information for the alignment and should not get trimmed off completely. However, because they are biased they need to be ignored during the bisulphite calling. This is the function of the patch of methylCtools, to ignore the biased 9 bp (see here). Therefore, for tagmentation you need to set

<cvalue name="IS_TAGMENTATION" value="true"
        description="true: Ignore 9 bp at read start for methylation statistics"/>

in your configuration.

Note that tagmentation data is based on independently amplified libraries, which makes it necessary to do independent duplication marking for each individual library before merging everything into a final merged-BAM.

Post-Bisulfite Adapter Tagging (PBAT)

The Post-Bisulfite Adapter Tagging (PBAT; Miura et al., 2012) protocol produces undirectional read pairs.

By setting reorderUndirectionalWGBSReadPairs the a read-reordering script will be run that decides based on the relative frequencies of TC and AG dinucleotides in both reads, what is the most likely correct orientations of the reads, and may then swap the two reads. Reads that cannot be unambiguously classified are currently dropped. Note that after the swapping, the read-numbers of swapped reads are reversed: What was R1 in the input FASTQ will be R2 in the output BAM, and vice versa. The original script for swapping, including a documentation of the underlying ideas, can be found here.

Furthermore, for PBAT data the "tagmentation" variant of the bisulphite calling should be used, in which the first 9 bp of the reads are ignored. Apparently, there is a conversion bias in the first read bases, probably because of random priming. For more information you can read this article. Note that the bias may extend further than 9 bp, but our current script versions can only ignore 9 bases. Feel free to make a pull request, or wait until we are fixing this ourselves.

<cvalue name="IS_TAGMENTATION" value="true"
        description="true: Ignore 9 bp at read start for methylation statistics"/>
<cvalue name="reorderUndirectionalWGBSReadPairs" value="true"
        description="true: swap R1/R2 based on nucleotide statistics to approximate directional protocol"/>

GRCh38 ALT processing

The 1.2.73-2 branch of the workflow supports the alignment and post-processing of reads against the GRCh38 assembly. It uses the bwa-postalt.js script by Heng Li's, which can be found in the bwa.kit package. There are a number of options related to bwa-postalt.js.

  • Set runBwaPostAltJs=true to activate the ALT chromosome processing. Default: false.
  • ALT_FILE: Defaults to be $INDEX_PREFIX.aln
  • K8_BINARY: Path to k8 binary. Defaults to a k8 executable located besides bwa (like in bwakit).
  • Set bwaPostAltJsPath to point to the bwa-postalt.js script. Defaults to a bwa-postalt.js located besides bwa (like in bwakit)
  • Set bwaPostAltJsHla to "true", if you want FASTQs with HLA-mapping reads (-p option). HLA FASTQs are placed besides the lane-BAMs.
  • Set bwaPostAltJsMinPaRatio to set the -r option of bwa-postalt.js.

The master switch really is runBwaPostAltJs. The other parameters have reasonable defaults and are mostly meant for customization and debugging.

Note that for the tbi-lsf-cluster.sh environment script the situation is a bit different. This script is used as default environment setup for our local cluster at the DKFZ and BIH. It assumes that software is provided via cluster modules and differs in some of the variables and their defaults:

  • K8_BINARY: The default binary is k8-Linux located in a specific k8 module.
  • K8_VERSION: The version tag of the k8 module. Default: "0.2.5".
  • bwaPostAltJsPath: The script is assumed to be available in one of the loaded modules and executable (which is used to get its location). At our department the module is "bwakit".
  • BWAKIT_VERSION: The version tag for loading the "bwakit" module. By default the same as that of bwa.

Note that the post-processing is done for each lane-BAM individually. No attempt is made to integrate the HLA FASTQs that are created if bwaPostAltJsHla is active.

Resource Requirements

The workflow is rather tuned to minimize IO. For instance, the tools are glued together using pipes. However, the duplication marking and the BAM sorting steps produce temporary files. These two and the BWA step are also the memory-hungry steps, while BWA is the step that requires most CPU time.

Have a look at the resource definitions in XMLs in the resources/configurationFiles/ directory, which are rather conservative and will cover almost all 30x-80x human data sets we received from our X10 sequencers. The XMLs contain multiple parameters that allow you can tweak the actually used memory and cores.

  • bwa (BWA_MEM_THREADS)
  • mbuffer (MBUFFER_SIZE_LARGE, MBUFFER_SIZE_SMALL)
  • sambamba (SAMBAMBA_MARKDUP_OPTS)
  • samtools (SAMPESORT_MEMSIZE)
  • picard (PICARD_MARKDUP_JVM_OPTS)

Other relevant options are

  • The resource requirements depend on the workflow variant that is used (e.g. whether biobambam's bamsort or samtools sort is used to sort the BAM file). Use a fast duplication marker. We found sambamba-0.5.9 to be optimal. Use the markDuplicatesVariant variable.
  • If you have a large and fast local filesystem thes useRoddyScratchAsBigFileScratch to true and set scratchBaseDirectory in the applicationProperties.ini to a path on that filesystem. This will speed up all temporary file IO.
  • Mbuffer is used to buffer short timescale throughput fluctuation and for copying data to multiple output (named) pipes.