Pipeline implementation - a-lud/nf-pipelines GitHub Wiki
This page outlines the development process for this project. This is not an exhaustive run-down of every detail, but should provide enough insight into how some key features of the pipeline have been developed and what their roles are.
I've essentially implemented the same approach as the nf-core
pipelines by using a Schema file. The schema contains metadata information about the implicit pipeline itself, but also all the information about sub-workflows and their arguments. The idea is that the schema file gets updated once, with changes propagating throughout the workflow without manual intervention.
The structure of the schema file is (currently) as follows.
{
version: "",
title: "",
description: "",
definitions: {
mandatory: {
title: "",
description: "",
arguments: {
arg-1: {
type: "<string/integer/boolean/memoryUnit/duration>",
format: "<directory-path/file-path>",
pattern: <regex>,
nfiles: <int>,
valid: [ ..., ..., ..., ... ],
description: "<text>",
optional: <boolean>
}
}
},
sub-workflow-1: {
title: "",
description: "",
arguments: {
arg-1: {
type: "<string/integer/boolean/memoryUnit/duration>",
format: "<directory-path/file-path>",
pattern: <regex>,
nfiles: <int>,
valid: [ ..., ..., ..., ... ],
description: "<text>",
optional: <boolean>
},
arg-2: {
type: "<string/integer/boolean/memoryUnit/duration>",
format: "<directory-path/file-path>",
pattern: <regex>,
nfiles: <int>,
valid: [ ..., ..., ..., ... ],
description: "<text>",
optional: <boolean>
}
}
}
}
}
The top level keys provide the following information:
- version: Version of the implicit workflow
- title: Title of the implicit workflow
- description: Description of the implicit workflow
- definitions: Contains nested keys for mandatory arguments and the sub-workflows
The second level keys which reside within definitions
(name taken from nf-core
) correspond to the mandatory arguments needed by the implicit workflow to run, along with all sub-workflows and the information relating to them.
You'll notice that each second-level key has the same title
and description
fields. They also have a new key:
- arguments: The argument names for the mandatory or sub-workflows.
Once we're at the level of the workflows
key (or mandatory
key), we are at the point where we're defining custom arguments. The arguments are usually described with a combination of the following, typically requiring type
, description
and optional
.
- type: What is the expected data type for the variable (e.g. 'string', 'int', 'boolean' etc...)
- format: This is a custom field that describes if the input is a file-path or directory-path
- description: A simple description of the argument (printed in the help page)
- valid: Typically a list of accepted inputs. A user provided argument will be checked against this list
- pattern: A regular expression that is used to match files in user provided paths
-
nfiles: The number of files to match and return using
Nextflow
built-in functions - optional: Boolean as to whether the argument is optional or not
The idea is that sub-workflows and their meta-information/arguments are all specified and described in the schema. Consequently, this requires a little bit of forward planning as to what kinds of input your sub-workflows might have. However, if you do this, there is no need to then define arguments in the nextflow.config
file, or write helper functions specific to each sub-workflow. What we can do is write genrealised parsing functions that traverse the schema, compare the users provided arguments to the schema definitions and either progress or error if they are incorrect.
Similarly, some quality of life functions, like writing help pages and printing pipeline summaries, can all be built using the information in the schema, rather than requiring the duplication of code and text. Below I've detailed some of the main methods I've written that work with the schema and user provided arguments
Possibly the most important part of the pipeline is the code in the /lib
directory, especially the NfSchema.groovy
file. This file contains all the code relating to parsing the schema file to check and validate all the user provided arguments. I have implemented three main methods in the NfSchema
class:
- checkMandatory: A method to check that the user has provided the mandatory arguments needed by the implicit workflow. If they haven't, the pipeline will error and exit early.
- checkPipelineArgs: This method compares the user provided arguments to the argument definitions in the schema. It will error early and intelligently if any argument has been provided incorrectly.
- checkCluster: Check that the user has provided a valid run-profile and partition.
These three methods are wrapped in the method
- validateParameters: Wrapper for all three of the above methods
The WorkflowMain.groovy
script contains the code relating to generating the help page, along with printing the pipeline summary information when it is running.
-
help: The help method simply parses the information from the schema and prints the information to the terminal. The
title
anddescriptions
fields at all the levels of the schema are used to build the sections, with the arguments being pulled from thearguments
level. Rather than having to write help code for each sub-workflow, the information is all present in one location reducing code written and duplication of text.
Another important feature to any Nextflow
pipeline are the .config
files. In these files are a range of configuration variables
that assist the execution of the pipeline.
This workflow has two main configuration files
- nextflow.config - This is the main configuration file present in the project directory.
- conf/base.config - This is the accessory configuration file that contains process specific variables
The profile scope
is configured in the nextflow.config
file. A profile is a set of attributes that are used when that 'profile' is
selected. This is a way of having multiple configurations in place to enable a pipeline to operate across many different infrastructures.
As this pipeline was built around Adelaide Universities Phoenix HPC, I've implemented three 'profiles'.
- SLURM: Phoenix uses
SLURM
for job management. As such, I have configured a profile to ensure the pipeline works nicely with the HPC job-scheduler. - Conda: Conda is a simple package manager that can install almost all bioinformatic software. As such, I use it to manage software installation for the pipeline to prevent the user from needing to do this.
- Standard: This profile is the base form of the pipeline. It would run jobs on the current machine, not requiring any job-scheduler etc...
Other profiles can be easily added for compatible job-management tools, or even cloud environments. Simply make a profile in the nextflow.config
file with the required attributes and specify the profile at the command line with -profile <new-profile>
.