Configuration - RCChan5/BioLockJ GitHub Wiki
Configuration files contain all system properties, program inputs, cutoff values, external dependencies, and format specifications used during pipeline execution.
BioLockJ takes a single configuration file as a runtime parameter. Although all properties can be configured in one file, we recommend chaining default files through the pipeline.defaultProps option. This can often improve the portability, maintainability, and readability of the project-specific configuration files.
Our recommended approach is as follows:
1. Use standard.properties to assign universal default values:
2. Use environment.properties to assign envionment-specific defaults
3. Create a new configuration file for each pipeline to assign project-specific properties:
- Set the BioModule execution order
- Set pipeline.defaultProps = environment.properties
- Override environment.properties and standard.properties as needed
- Example project configuration files can be found in templates.
A copy of each configuration file is stored in the pipeline root directory to serve as primary project documentation.
BioModule execution order
To include a BioModule in your pipeline, add a #BioModule
line to the top your configuration file, as shown in the examples found in templates. Each line has the #BioModule
keyword followed by the path to the jar file for that module. For example:
#BioModule biolockj.module.seq.PearMergeReads
#BioModule biolockj.module.classifier.wgs.Kraken2Classifier
#BioModule biolockj.module.report.r.R_PlotMds
BioModules will be executed in the order they are listed in here. A typical pipeline contians one classifier module. Any number of sequence pre-processing modules may come before the classifier module. Any number of report modules may come after the classifier module. In addition to the BioModules specified in the configuration file, BioLockJ may add implicit modules that the are required by specified modules. See Example Pipeline.
Summary of Properties
Properties are defined as name-value pairs. List-values are comma separated. Leading and trailing whitespace is removed so "propName=x,y" is equivalent to "propName = x, y".
aws
Property |
Description |
aws.profile |
String |
aws.ram |
AWS memory applied through Nextflow. example value: "8 GB" |
aws.stack |
String |
aws.s3 |
String |
cluster
Property |
Description |
cluster.batchCommand |
The command to submit jobs on the cluster |
cluster.host |
Cluster host address |
cluster.jobHeader |
Job script header to define # of nodes, # of cores, RAM, walltime, etc. |
cluster.modules |
List of modules to load before execution. Adds “module load” command to bash scripts |
cluster.prologue |
Command(s) to run at the start of every script after loading cluster modules (if any) |
cluster.runJavaAsScript |
Options: Y/N. If Y, each JavaModule will instantiate a clone of the application in direct mode on a job node via a single worker script to avoid overworking the head node where BioLockJ is deployed |
cluster.validateParams |
Options: Y/N. If Y, validate cluster.jobHeader "ppn:" or "procs:" value matches script.numThreads |
Property |
Description |
demultiplexer.barcodeCutoff |
desc |
demultimplexer.barcodeRevComp |
Options: Y/N. Use reverse compliment of metadata.barcodeColumn if demultimplexer.strategy = barcode_in_header or barcode_in_seq. |
demultimplexer.strategy |
Options: barcode_in_header, barcode_in_seq, id_in_header, do_not_demux. Set the Demultiplexer strategy. If using barcodes, they must be provided in the metadata.filePath with in column name defined by metadata.barcodeColumn. |
docker
Property |
Description |
docker.imgVersion |
By default, docker will always use 'latest', but advanced users may specify a different tag. |
docker.user |
Docker Hub user name with the BioLockJ containers. By default the "biolockj" user is used to pull the standard modules, but advanced users can deploy their own versions of these modules and add new modules in their own Docker Hub account. |
docker.saveContainerOnExit |
Y/N. If Y, property removed the default --rm flag on docker run command |
exe
Property |
Description |
exe.awk |
Define executable awk command, if default "awk" is not included in your $PATH |
exe.docker |
Define executable docker command, if default "docker" is not included in your $PATH |
exe.gzip |
Define executable gzip command, if default "gzip" is not included in your $PATH |
exe.humann2 |
Define executable humann2 command, if default "humann2" is not included in your $PATH |
exe.humann2Params |
Optional humann2 parameters |
exe.humann2JoinTableParams |
Optional parameters |
exe.humann2RenormTableParams |
Optional parameters |
exe.java |
Define executable java command, if default "java" is not included in your $PATH |
exe.javaParams |
Optional parameters |
exe.kneaddata |
Define executable kneaddata command, if default "kneaddata" is not included in your $PATH |
exe.kneaddataParams |
Optional kneaddata parameters |
exe.kraken |
Define executable kraken command, if default "kraken" is not included in your $PATH |
exe.krakenParams |
Optional kraken parameters |
exe.kraken2 |
Define executable kraken2 command, if default "kraken2" is not included in your $PATH |
exe.kraken2Params |
Optional kraken2 parameters |
exe.metaphlan2 |
Define executable metaphlan2 command, if default "metaphlan2" is not included in your $PATH |
exe.metaphlan2Params |
Optional metaphlan2 parameters |
exe.pear |
Define executable pear command, if default "pear" is not included in your $PATH |
exe.pearParams |
Optional pear parameters |
exe.python |
Define executable python command, if default "python" is not included in your $PATH |
exe.Rscript |
Define executable Rscript command, if default "Rscript" is not included in your $PATH |
exe.vsearch |
Define executable vsearch command, if default "vsearch" is not included in your $PATH |
exe.vsearchParams |
Optional vsearch parameters |
Property |
Description |
genMod.launcher |
Define executable language command if it is not included in your $PATH |
genMod.param |
Any parameters that is needed for user's script |
genMod.scriptPath |
Path where user script is stored |
Property |
Description |
humann2.disableGeneFamilies |
Options: Y/N. If Y, disable HumanN2 Gene Family report |
humann2.disablePathAbundance |
Options: Y/N. If Y, disable HumanN2 Pathway Abundance report |
humann2.disablePathCoverage |
Options: Y/N. If Y, disable HumanN2 Pathway Coverage report |
humann2.keepUnintegrated |
Options: Y/N. If Y, keep UNINTEGRATED column in count tables (otherwise this column is dropped) |
humann2.keepUnmapped |
Options: Y/N. If Y, keep UNMAPPED column in count tables (otherwise this column is dropped) |
humann2.nuclDB |
Directory property may contain multiple nucleotide database files |
humann2.protDB |
Directory property may contain protein nucleotide database files |
input
Property |
Description |
input.dirPaths |
List of directories containing pipeline input files |
input.ignoreFiles |
List of files to ignore if found in * input.dirPaths* |
input.requireCompletePairs |
Options: Y/N. Stop pipeline if any unpaired FW or RV read sequence file is found |
input.suffixFw |
File name suffix to indicate a forward read |
input.suffixRv |
File name suffix to indicate a reverse read |
input.trimPrefix |
For files named by Sample ID, provide the prefix preceding the ID to trim when extracting Sample ID. For multiplexed sequences, provide any characters in the sequence header preceding the ID. For fastq, this value could be “@” if the sample ID was added to the header immediately after the "@" symbol. |
input.trimSuffix |
For files named by Sample ID, provide the suffix after the ID, often this is just the file extension. Do not include read direction indicators listed in input.suffixFw/input.suffixRv. For multiplexed sequences, provide 1st character in the sequence header found after every embedded Sample ID. If undefined, “_” is used as the default end-of-sample-ID delimiter. |
Property |
Description |
kraken.db |
Path to database for KRAKEN |
Property |
Description |
kraken2.db |
Path to database for KRAKEN2 |
Property |
Description |
mail.encryptedPassword |
Encrypted password from email.from account. If BioLockJ is passed a 2nd parameter (in addition to the config file), the 2nd parameter should be the clear-text password. The password will be encrypted and stored in the prop file for future use. WARNING: Base64 encryption is only a trivial roadblock for malicious users. This functionality is intended merely to keep clear-text passwords out of the configuration files and should only be used with a disposable email.from account. |
mail.from |
Notification emails sent from this account, provided email.encryptedPassword is valid |
mail.smtp.auth |
Options: Y/N. Set the SMTP authorization property |
mail.smtp.host |
Email SMTP Host |
mail.smtp.port |
Email SMTP Host |
mail.smtp.starttls.enable |
Options: Y/N. Set the SMTP start TLS property |
mail.to |
Comma-separated email recipients list |
metadata
Property |
Description |
metadata.barcodeColumn |
Metadata column name containing the barcode used for demultiplexing |
metadata.columnDelim |
Define column delimiter for metadata.filePath file, default = tab |
metadata.commentChar |
Define how comments are indicated in metadata.filePath file, default = "" |
metadata.fileNameColumn |
Column in metadata file giving file names used to identify each sample. Standard default: "InputFileName". Values should be simple names, not file paths, and unique to each sample. Using this column in the metadata overrides the use of input.trimPreifx and input.trimSuffix. For paired reads, give the forward read file and use input.suffixFw and input.suffixRv to link to the reverse file. |
metadata.filePath |
Metadata file path, must have unique column headers |
metadata.nullValue |
Define how null values are represented in metadata |
metadata.required |
Options: Y/N. Require every sequence file has a corresponding row in metadata file |
metadata.useEveryRow |
Options: Y/N. Requires every metadata row to have a corresponding sequence file |
Property |
Description |
metaphlan2.db |
Directory property containing alternate database. Must always be paired with metaphlan2.mpa_pkl |
metaphlan2.mpa_pkl |
File property containing path to the mpa_pkl file used to reference an alternate DB. Must always be paired with metaphlan2.db |
Property |
Description |
multiplexer.gzip |
Options: Y/N. If Y, gzip the multiplexed output |
pipeline
Property |
Description |
pipeline.copyInput |
Options: Y/N. If Y, copy input.dirPaths into a new directory under the project root directory |
pipeline.defaultDemultiplexer |
Assign module to demultiplex datasets. Default = Demultiplexer |
pipeline.defaultFastaConverter |
Assign module to convert fastq sequence files into fasta format when required. Default = AwkFastaConverter |
pipeline.defaultSeqMerger |
Assign module to merge paired reads when required. Default = PearMergeReads |
pipeline.defaultStatsModule |
Java class name for default module used generate p-value and other stats |
pipeline.defaultProps |
Path to a default BioLockJ configuration file containing default property values that are overridden if defined in the primary configuration file |
pipeline.deleteTempFiles |
Options: Y/N. If Y, delete module temp dirs after execution |
pipeline.disableAddImplicitModules |
Options: Y/N. If Y, implicit modules will not be added to the pipeline |
pipeline.disableAddPreReqModules |
Options: Y/N. If Y, prerequisite modules will not be added to the pipeline. |
pipeline.downloadDir |
The pipeline summary includes an scp command for the user to download the pipeline analysis if executed on a cluster server. This property defines the target directory on the users workstation to which the analysis will be downloaded. |
pipeline.env |
Options: aws, cluster, local. Describes runtime environment |
pipeline.limitDebugClasses |
used to limit classes that log debug statements |
pipeline.logLevel |
Options: DEBUG, INFO, WARN, ERROR. Determines Java log level sensitivity |
pipeline.permissions |
Set chmod -R command security bits on pipeline root directory (Ex. 770) |
pipeline.userProfile |
Bash users typically use ~/.bash_profile (the standard default). |
Property |
Description |
qiime.alphaMetrics |
Options listed online: scikit-bio.org |
qiime.params |
Optional parameters passed to qiime scripts |
qiime.pynastAlignDB |
File property to define ~/.qiime_config pynast_template_alignment_fp. If supplied, qiime.refSeqDB and qiime.taxaDB must also be supplied and all three must share some parent directory. |
qiime.refSeqDB |
File property to define ~/.qiime_config pick_otus_reference_seqs_fp and assign_taxonomy_reference_seqs_fp. If supplied, qiime.pynastAlignDB and qiime.taxaDB must also be supplied and all three must share some parent directory. |
qiime.removeChimeras |
Options: Y/N. If Y, remove chimeras after open or de novo OTU picking using exe.vsearch |
qiime.taxaDB |
File property to define ~/.qiime_config assign_taxonomy_id_to_taxonomy_fp. If supplied, qiime.pynastAlignDB and qiime.refSeqDB must also be supplied and all three must share some parent directory. |
Property |
Description |
r.colorBase |
This is the base color used for labels & headings in the PDF report |
r.colorHighlight |
This color is used to highlight significant OTU plot titles |
r.colorPalette |
palette argument passed to get_palette {ggpubr} to select colors for some output visualiztions |
r.colorPoint |
Sets the color of scatterplot and strip-chart plot points |
r.debug |
Options: Y/N. If Y, will generate R Script log files |
r.excludeFields |
List metadata columns to exclude from R script reports |
r.nominalFields |
Explicitly override default field type assignment to model as a nominal field in R |
r.numericFields |
Explicitly override default field type assignment to model as a numeric field in R |
r.pch |
Sets R plot pch parameter for PDF report |
r.pvalCutoff |
Sets p-value cutoff used to assign label r.colorHighlight |
r.pValFormat |
Sets the format used in R sprintf() function |
r.rareOtuThreshold |
If >1, R will filter OTUs below value provided. If <1, R will interperate the value as a percentage and discard OTUs not found in at least that percentage of samples |
r.reportFields |
Override field used to explicitly list metadata columns to report in the R scripts. If left undefined, all columns are reported |
r.saveRData |
Options: Y/N. If Y, all R script generating BioModules will save R Session data to the module output directory to a file using the extension ".RData" |
r.timeout |
Sets # minutes before R Script will time out and fail |
Property |
Description |
r_CalculateStats.pAdjustScope |
Options: GLOBAL, LOCAL, TAXA, ATTRIBUTE. Used to set the p.adjust "n" parameter for how many simultaneous p-value calculations |
r_CalculateStats.pAdjustMethod |
Sets the p.adjust "method" parameter |
Property |
Description |
r_PlotEffectSize.parametricPval |
Options: Y/N. If Y, the parametric p-value is used when determining which taxa to include in the plot and which should get a (*). If N (default), the non-parametric p-value is used. |
r_PlotEffectSize.disablePvalAdj |
Options: Y/N. If Y, the non-adjusted p-value is used when determining which taxa to include in the plot and which should get a (*). If N (default), the adjusted p-value is used. |
r_PlotEffectSize.excludePvalAbove |
Options: [0,1], Taxa with a p-value above this value are excluded from the plot. |
r_PlotEffectSize.taxa |
Override other criteria for selecting which taxa to include in the plot by specifying wich taxa should be included |
r_PlotEffectSize.maxNumTaxa |
Each plot is given one page. This is the maximum number of bars to include in each one-page plot. |
r_PlotEffectSize.disableCohensD |
Options: Y/N. If N (default), produce plots for binary attributes showing effect size calculated as Cohen's d. If Y, skip this plot type. |
r_PlotEffectSize.disableRSquared |
Options: Y/N. If N (default), produce plots showing effect size calculated as the r-squared value. If Y, skip this plot type. |
r_PlotEffectSize.disableFoldChange |
Options: Y/N. If N (default), produce plots for binary attributes showing the fold change. If Y, skip this plot type. |
Property |
Description |
r_PlotMds.numAxis |
Sets # MDS axis to plot |
r_PlotMds.distance |
distance metric for calculating MDS (default: bray) |
r_PlotMds.reportFields |
Override field used to explicitly list metadata columns to build MDS plots. If left undefined, all columns are reported |
Property |
Description |
rarefyOtuCounts.iterations |
Positive integer. The number of iterations to randomly select the rarefyOtuCounts.quantile of OTUs |
rarefyOtuCounts.lowAbundantCutoff |
Minimum percentage of samples that must contain an OTU. |
rarefyOtuCounts.quantile |
Quantile for rarefication. The number of OTUs/sample are ordered, all samples with more OTUs than the quantile sample are subselected without replacement until they have the same number of OTUs as the quantile sample |
rarefyOtuCounts.rmLowSamples |
Options: Y/N. If Y, all samples below the rarefyOtuCounts.quantile quantile sample are removed |
Property |
Description |
rarefySeqs.max |
Randomly select maximum number of sequences per sample |
rarefySeqs.min |
Discard samples without minimum number of sequences |
Property |
Description |
rdp.db |
File property used to define an alternate RDP database file |
rdp.jar |
File property for RDP java executable JAR |
rdp.minThresholdScore |
Required RDP minimum threshold score for valid OTUs |
report
Property |
Description |
report.logBase |
Options: 10/e. If e, use natural log (base e), otherwise use log base 10 |
report.minCount |
Integer, minimum table count allowed. If a count less that this value is found, it is set to 0. |
report.numHits |
Options: Y/N. If Y, and add Num_Hits to metadata |
report.numReads |
Options: Y/N. If Y, and add Num_Reads to metadata |
report.scarceCountCutoff |
Minimum percentage of samples that must contain a count value for it to be kept. |
report.scarceSampleCutoff |
Minimum percentage of data columns that must be non-zero to keep the sample. |
report.taxonomyLevels |
Options: domain, phylum, class, order, family, genus, species. Generate reports for listed taxonomy levels |
script
Property |
Description |
script.batchSize |
Number of sequence files to process per worker script |
script.defaultHeader |
Used to set shebang line to define scripts as bash executables, such as "#!/bin/bash" |
script.numThreads |
Integer value passed to any module that takes a number of threads parameter |
script.permissions |
Set chmod command security bits on generated scripts (Ex. 770) |
script.timeout |
Integer, time (minutes) before worker scripts times out. |
Property |
Description |
seqFileValidator.requireEqualNumPairs |
Options: Y/N. default Y. |
seqFileValidator.seqMaxLen |
maximum number of bases per read |
seqFileValidator.seqMinLen |
minimum number of bases per read |
Property |
Description |
trimPrimers.filePath |
Path to file containing one primer sequence per line. |
trimPrimers.requirePrimer |
Options: Y/N. If Y, TrimPrimers will discard reads that do not include a primer sequence. |
Property |
Description |
validation.compareOn |
Which columns in the expectation file should be used for the comparison. Options: name, size, md5. Default: use all columns in the expectation file. |
validation.disableValidation |
Turn off validation. No validation file output is produced. Options: Y/N. default: N |
validation.expectationFile |
File path to the table of expectations. If a directory is given, BioLockJ will look for a file named after the module being evaluated. |
validation.reportOn |
Which attributes of the file should be included in the validation report file. Options: name, size, md5 |
validation.sizeWithinPercent |
What percentage difference is permitted between an output file and its expectation. Options: any positive number |
validation.stopPipeline |
If enabled, the validation utlility will stop the pipeline if any module fails validation. Options: Y/N |