011 Nextflow basics & project structure - McGranahanLab/Guidebook GitHub Wiki

Background

Nextflow is a powerful and flexible workflow language that enables the development of scalable and reproducible scientific workflows.

It can integrate various software package and environment management systems such as Docker, Singularity, and Conda. It allows for existing pipelines written in common scripting languages, such as BASH, R and Python, to be seamlessly coupled together.
Nextflow simplifies the implementation and running of workflows on cloud or high-performance computing (HPC) infrastructures.
Nextflow is backed by nf-core: A community effort to collect a curated set of analysis pipelines built using Nextflow.

This page

There is a huge amount of nextflow documentation available, most of which will not be covered here. Instead, the purpose of this page is to provide and summarise a simple example pipeline which can be used by others as a guide for setting up their own pipelines.

Project structure

All nextflow pipelines should try to follow roughly the same primary structure.

.
├── assets
├── bin
├── conda_envs
├── conf
│   ├── crick.config
│   ├── modules.config
│   └── test.config
├── containers
├── inventories
│   └── example_inventory.csv
├── lib
│   └── core_functions.nf
├── main.nf
├── modules
│   ├── subworkflow_1_modules.nf
│   └── subworkflow_2_modules.nf
├── nextflow.config
├── README.md
├── results
├── subworkflows
│   ├── subworkflow_1.nf
│   └── subworkflow_2.nf
├── test_data
│   ├── fastqs
│   │   ├── SRR2584863_1.fastq.gz
│   │   ├── SRR2584863_2.fastq.gz
│   │   ├── SRR2589044_1.fastq.gz
│   │   └── SRR2589044_2.fastq.gz
│   └── reference_genome
│       └── ecoli_rel606.fasta
├── work
└── workflows
    └── example_workflow.nf

15 directories, 20 files

assets/ contains various static support files for the analysis, i.e. reference genomes, COSMIC databases, etc

bin/ contains scripts used by processes within the pipeline

conf/ contains configuration files specific to the pipeline itself as well as the infrastructure running the pipeline

containers/ contains modularised singularity or docker containers for all software used within the pipeline

inventories/ contains any sample inventory tables (usually csvs) to be used by the pipeline

lib/ contains useful core functions such as help messages and the pipeline logo.

main.nf is the central nextflow script - this will detail which pipeline workflows will be run.

modules/ contains processes used within the pipeline

nextflow.config contains all the parameters required to run the pipeline

README.md contains information on what the pipeline does and how to run it

results/ contains all published pipeline results

subqorkflows/ contains the subwokflows consisting of a chain of multiple modules that offer a higher-level of functionality within the context of a pipeline.

test_data/ contains any test data which will allow users to run the pipeline locally to check it is working as expected

work/ contains all the process working directories that are produced as the pipeline runs

workflows/ contains scripts combining multiple subworkflows and/or modules together into a single end-to-end pipeline

The pipeline

This pipeline consists of the following steps: Subworkflow_1:

Adapter and quality trimming of fqs using trimgalore
quality control checks using fastqc
Generate a QC report using multiqc Subworkflow_2
Build a reference index using BWA
Align qc'd fastq files to the reference genome using BWA
Produce a summary of the number of aligned reads using samtools-flagstat