Pipeline implementation - a-lud/nf-pipelines GitHub Wiki

Introduction
Schema file
Schema structure
How the schema is used
- Helper functions: Parsing the schema
- Helper functions: Help and summaries
Config files
- Profiles

Introduction

This page outlines the development process for this project. This is not an exhaustive run-down of every detail, but should provide enough insight into how some key features of the pipeline have been developed and what their roles are.

Schema File

I've essentially implemented the same approach as the nf-core pipelines by using a Schema file. The schema contains metadata information about the implicit pipeline itself, but also all the information about sub-workflows and their arguments. The idea is that the schema file gets updated once, with changes propagating throughout the workflow without manual intervention.

Schema structure

The structure of the schema file is (currently) as follows.

{
    version: "",
    title: "",
    description: "",
    definitions: {
        mandatory: {
            title: "",
            description: "",
            arguments: {
                arg-1: {
                    type: "<string/integer/boolean/memoryUnit/duration>",
                    format: "<directory-path/file-path>",
                    pattern: <regex>,
                    nfiles: <int>,
                    valid: [ ..., ..., ..., ... ],
                    description: "<text>",
                    optional: <boolean>
                }
            }
        },
        sub-workflow-1: {
            title: "",
            description: "",
            arguments: {
                arg-1: {
                    type: "<string/integer/boolean/memoryUnit/duration>",
                    format: "<directory-path/file-path>",
                    pattern: <regex>,
                    nfiles: <int>,
                    valid: [ ..., ..., ..., ... ],
                    description: "<text>",
                    optional: <boolean>
                },
                arg-2: {
                    type: "<string/integer/boolean/memoryUnit/duration>",
                    format: "<directory-path/file-path>",
                    pattern: <regex>,
                    nfiles: <int>,
                    valid: [ ..., ..., ..., ... ],
                    description: "<text>",
                    optional: <boolean>
                }
            }
        }
    }
}

Top level keys

The top level keys provide the following information:

version: Version of the implicit workflow
title: Title of the implicit workflow
description: Description of the implicit workflow
definitions: Contains nested keys for mandatory arguments and the sub-workflows

Second level keys

The second level keys which reside within definitions (name taken from nf-core) correspond to the mandatory arguments needed by the implicit workflow to run, along with all sub-workflows and the information relating to them.

You'll notice that each second-level key has the same title and description fields. They also have a new key:

arguments: The argument names for the mandatory or sub-workflows.

Argument keys

Once we're at the level of the workflows key (or mandatory key), we are at the point where we're defining custom arguments. The arguments are usually described with a combination of the following, typically requiring type, description and optional.

type: What is the expected data type for the variable (e.g. 'string', 'int', 'boolean' etc...)
format: This is a custom field that describes if the input is a file-path or directory-path
description: A simple description of the argument (printed in the help page)
valid: Typically a list of accepted inputs. A user provided argument will be checked against this list
pattern: A regular expression that is used to match files in user provided paths
nfiles: The number of files to match and return using Nextflow built-in functions
optional: Boolean as to whether the argument is optional or not

How the schema is used

The idea is that sub-workflows and their meta-information/arguments are all specified and described in the schema. Consequently, this requires a little bit of forward planning as to what kinds of input your sub-workflows might have. However, if you do this, there is no need to then define arguments in the nextflow.config file, or write helper functions specific to each sub-workflow. What we can do is write genrealised parsing functions that traverse the schema, compare the users provided arguments to the schema definitions and either progress or error if they are incorrect.

Similarly, some quality of life functions, like writing help pages and printing pipeline summaries, can all be built using the information in the schema, rather than requiring the duplication of code and text. Below I've detailed some of the main methods I've written that work with the schema and user provided arguments

Helper functions: Parsing the schema

Possibly the most important part of the pipeline is the code in the /lib directory, especially the NfSchema.groovy file. This file contains all the code relating to parsing the schema file to check and validate all the user provided arguments. I have implemented three main methods in the NfSchema class:

checkMandatory: A method to check that the user has provided the mandatory arguments needed by the implicit workflow. If they haven't, the pipeline will error and exit early.
checkPipelineArgs: This method compares the user provided arguments to the argument definitions in the schema. It will error early and intelligently if any argument has been provided incorrectly.
checkCluster: Check that the user has provided a valid run-profile and partition.

These three methods are wrapped in the method

validateParameters: Wrapper for all three of the above methods

Helper functions: Help and summaries

The WorkflowMain.groovy script contains the code relating to generating the help page, along with printing the pipeline summary information when it is running.

help: The help method simply parses the information from the schema and prints the information to the terminal. The title and descriptions fields at all the levels of the schema are used to build the sections, with the arguments being pulled from the arguments level. Rather than having to write help code for each sub-workflow, the information is all present in one location reducing code written and duplication of text.

Config files

Another important feature to any Nextflow pipeline are the .config files. In these files are a range of configuration variables that assist the execution of the pipeline.

This workflow has two main configuration files

nextflow.config - This is the main configuration file present in the project directory.
conf/base.config - This is the accessory configuration file that contains process specific variables

Profiles

The profile scope is configured in the nextflow.config file. A profile is a set of attributes that are used when that 'profile' is selected. This is a way of having multiple configurations in place to enable a pipeline to operate across many different infrastructures.

As this pipeline was built around Adelaide Universities Phoenix HPC, I've implemented three 'profiles'.

SLURM: Phoenix uses SLURM for job management. As such, I have configured a profile to ensure the pipeline works nicely with the HPC job-scheduler.
Conda: Conda is a simple package manager that can install almost all bioinformatic software. As such, I use it to manage software installation for the pipeline to prevent the user from needing to do this.
Standard: This profile is the base form of the pipeline. It would run jobs on the current machine, not requiring any job-scheduler etc...

Other profiles can be easily added for compatible job-management tools, or even cloud environments. Simply make a profile in the nextflow.config file with the required attributes and specify the profile at the command line with -profile <new-profile>.