011 Nextflow basics & project structure - McGranahanLab/Guidebook GitHub Wiki

Background

Nextflow is a powerful and flexible workflow language that enables the development of scalable and reproducible scientific workflows.

  • It can integrate various software package and environment management systems such as Docker, Singularity, and Conda. It allows for existing pipelines written in common scripting languages, such as BASH, R and Python, to be seamlessly coupled together.

  • Nextflow simplifies the implementation and running of workflows on cloud or high-performance computing (HPC) infrastructures.

  • Nextflow is backed by nf-core: A community effort to collect a curated set of analysis pipelines built using Nextflow.

This page

There is a huge amount of nextflow documentation available, most of which will not be covered here. Instead, the purpose of this page is to provide and summarise a simple example pipeline which can be used by others as a guide for setting up their own pipelines.

Project structure

All nextflow pipelines should try to follow roughly the same primary structure.

.
├── assets
├── bin
├── conda_envs
├── conf
│   ├── crick.config
│   ├── modules.config
│   └── test.config
├── containers
├── inventories
│   └── example_inventory.csv
├── lib
│   └── core_functions.nf
├── main.nf
├── modules
│   ├── subworkflow_1_modules.nf
│   └── subworkflow_2_modules.nf
├── nextflow.config
├── README.md
├── results
├── subworkflows
│   ├── subworkflow_1.nf
│   └── subworkflow_2.nf
├── test_data
│   ├── fastqs
│   │   ├── SRR2584863_1.fastq.gz
│   │   ├── SRR2584863_2.fastq.gz
│   │   ├── SRR2589044_1.fastq.gz
│   │   └── SRR2589044_2.fastq.gz
│   └── reference_genome
│       └── ecoli_rel606.fasta
├── work
└── workflows
    └── example_workflow.nf

15 directories, 20 files

assets/ contains various static support files for the analysis, i.e. reference genomes, COSMIC databases, etc

bin/ contains scripts used by processes within the pipeline

conf/ contains configuration files specific to the pipeline itself as well as the infrastructure running the pipeline

containers/ contains modularised singularity or docker containers for all software used within the pipeline

inventories/ contains any sample inventory tables (usually csvs) to be used by the pipeline

lib/ contains useful core functions such as help messages and the pipeline logo.

main.nf is the central nextflow script - this will detail which pipeline workflows will be run.

modules/ contains processes used within the pipeline

nextflow.config contains all the parameters required to run the pipeline

README.md contains information on what the pipeline does and how to run it

results/ contains all published pipeline results

subqorkflows/ contains the subwokflows consisting of a chain of multiple modules that offer a higher-level of functionality within the context of a pipeline.

test_data/ contains any test data which will allow users to run the pipeline locally to check it is working as expected

work/ contains all the process working directories that are produced as the pipeline runs

workflows/ contains scripts combining multiple subworkflows and/or modules together into a single end-to-end pipeline

The pipeline

This pipeline consists of the following steps: Subworkflow_1:

  1. Adapter and quality trimming of fqs using trimgalore
  2. quality control checks using fastqc
  3. Generate a QC report using multiqc Subworkflow_2
  4. Build a reference index using BWA
  5. Align qc'd fastq files to the reference genome using BWA
  6. Produce a summary of the number of aligned reads using samtools-flagstat