011 Nextflow basics & project structure - McGranahanLab/Guidebook GitHub Wiki
Background
Nextflow is a powerful and flexible workflow language that enables the development of scalable and reproducible scientific workflows.
-
It can integrate various software package and environment management systems such as Docker, Singularity, and Conda. It allows for existing pipelines written in common scripting languages, such as BASH, R and Python, to be seamlessly coupled together.
-
Nextflow simplifies the implementation and running of workflows on cloud or high-performance computing (HPC) infrastructures.
-
Nextflow is backed by nf-core: A community effort to collect a curated set of analysis pipelines built using Nextflow.
This page
There is a huge amount of nextflow documentation available, most of which will not be covered here. Instead, the purpose of this page is to provide and summarise a simple example pipeline which can be used by others as a guide for setting up their own pipelines.
Project structure
All nextflow pipelines should try to follow roughly the same primary structure.
.
├── assets
├── bin
├── conda_envs
├── conf
│ ├── crick.config
│ ├── modules.config
│ └── test.config
├── containers
├── inventories
│ └── example_inventory.csv
├── lib
│ └── core_functions.nf
├── main.nf
├── modules
│ ├── subworkflow_1_modules.nf
│ └── subworkflow_2_modules.nf
├── nextflow.config
├── README.md
├── results
├── subworkflows
│ ├── subworkflow_1.nf
│ └── subworkflow_2.nf
├── test_data
│ ├── fastqs
│ │ ├── SRR2584863_1.fastq.gz
│ │ ├── SRR2584863_2.fastq.gz
│ │ ├── SRR2589044_1.fastq.gz
│ │ └── SRR2589044_2.fastq.gz
│ └── reference_genome
│ └── ecoli_rel606.fasta
├── work
└── workflows
└── example_workflow.nf
15 directories, 20 files
assets/
contains various static support files for the analysis, i.e. reference genomes, COSMIC databases, etc
bin/
contains scripts used by processes within the pipeline
conf/
contains configuration files specific to the pipeline itself as well as the infrastructure running the pipeline
containers/
contains modularised singularity or docker containers for all software used within the pipeline
inventories/
contains any sample inventory tables (usually csvs) to be used by the pipeline
lib/
contains useful core functions such as help messages and the pipeline logo.
main.nf
is the central nextflow script - this will detail which pipeline workflows will be run.
modules/
contains processes used within the pipeline
nextflow.config
contains all the parameters required to run the pipeline
README.md
contains information on what the pipeline does and how to run it
results/
contains all published pipeline results
subqorkflows/
contains the subwokflows consisting of a chain of multiple modules that offer a higher-level of functionality within the context of a pipeline.
test_data/
contains any test data which will allow users to run the pipeline locally to check it is working as expected
work/
contains all the process working directories that are produced as the pipeline runs
workflows/
contains scripts combining multiple subworkflows and/or modules together into a single end-to-end pipeline
The pipeline
This pipeline consists of the following steps: Subworkflow_1:
- Adapter and quality trimming of fqs using trimgalore
- quality control checks using fastqc
- Generate a QC report using multiqc Subworkflow_2
- Build a reference index using BWA
- Align qc'd fastq files to the reference genome using BWA
- Produce a summary of the number of aligned reads using samtools-flagstat