Nextflow - koppsteinlab/knowledge-repo GitHub Wiki

Nextflow

What is Nextflow?

Nextflow is a scalable and reproducible workflow management system written in Groovy and Java using a custom domain-specific language (DSL). It is centered around three main concepts: Processes, Channels, and Workflows. Nextflow processes receive an input, perform computations, and produce an output. Processes can depend on one another, thus they are connected through channels. When multiple processes are linked together by channels, they form a Nextflow workflow.

If you want to learn more about Nextflow, you can checkout the Nextflow Training.

Where do I find Nextflow pipelines?

The bioinformatics community developed various computational Nextflow pipelines. These curated workflows are stored in a central location called nextflow core, also abbreviated as nf-core.

nf-core can be installed for example via the package manager conda.

Setting up Nextflow on the DKFZ Cluster

To set up Nextflow on the DKFZ cluster, please install conda. A guide for the installation of conda can be found here. Conda channel priorities (conda-forge, bioconda) need to be then configured as described here.

Now you have all your stuff together and are ready to setup Nextflow!

To install a conda environment with Nextflow inside, execute the following command:

conda create -n nextflow_env python=3.12 nextflow nf-core

Alternatively, you could also create a YAML file which include the necessary dependencies and set up your conda environment with conda env create -f nextflow.yaml. The YAML file might look like this:

name: nextflow_env
dependencies:
  - python=3.12
  - nextflow
  - nf-core

The advantage of a YAML file clearly lies here in improved readability and reproducibility of your conda environment. You can specify your versions in the given YAML file, i.e. with python=3.12 and share then the YAML file with one of your colleagues to install the exact same conda environment on their machine.

NOTE: Be aware of the spacing in the YAML file!

Great! Now, you have access to Nextflow by activating the conda environment with conda activate nextflow_env.

Before we can actually run nf-core pipelines, we have to make some changes to our ~/.bashrc, which will save us some future headache.

You can make changes to this file with your favorite terminal text editors such as nano, vim or IDEs such as Visual Studio Code.

nf-core pipelines can depend on singularity images. Since you usually don't want to re-download these singularity images for each pipeline again, nextflow has an option to store and pull previously downloaded singularity images from a local cache directory. This saves you some lifetime and storage. :)

To enable this option, add the following lines to your ~/.bashrc:

export MYHOME=/omics/groups/<group OE number>/internal/$USER
export NXF_SINGULARITY_CACHEDIR=$MYHOME/.cache/nxf_singularity

Note that the nxf_singularity folder should already exist in the given filepath, if not create the folder with:

mkdir -p /omics/groups/<group OE number>/internal/$USER/.cache/nxf_singularity

You can also use Apptainer instead of Singularity, then you might add this to your ~/.bashrc:

export APPTAINER_CACHEDIR=$MYHOME/.apptainer/cache

Generally, you have limited space in your home directory (the default location of nextflow is $HOME/.nextflow).

It is advisable to change the location of your default nextflow home directory in your ~/.bashrc, i.e. with

export NXF_HOME=/omics/groups/<group OE number>/internal/$USER/.nextflow

Note that the .nextflow folder should already exist in the given filepath, if not create the folder with:

mkdir -p /omics/groups/<group OE number>/internal/$USER/.nextflow

You can further include the following lines in your ~/.bashrc to enable pipeline debugging (there are different debugging levels!):

export NXF_DEBUG=1
export APPTAINERENV_NXF_DEBUG=1

Running an example nf-core pipeline on the DKFZ Cluster

After setting all these parameters as mentioned before, you can execute a nf-core pipeline like this:

nextflow run nf-core/rnaseq \
    --input <SAMPLESHEET> \
    --outdir <OUTDIR> \
    --gtf <GTF> \
    --fasta <GENOME FASTA> \
    -profile <docker/singularity/.../institute>

Here, I demonstrated for example purposes the run of the standardized nextflow rnaseq pipeline.

Just make sure to provide here -profile singularity along with -c dkfz.config. The default DKFZ config can be found here, but we need to add clusterOptions = '-L /bin/bash' to the executor to get it to source our variables, see also this PR.

Debugging

If you encounter any errors during the execution of your Nextflow pipeline, and you want to troubleshoot them, see here.

⚠️ **GitHub.com Fallback** ⚠️