011 Nextflow basics & project structure - McGranahanLab/Guidebook GitHub Wiki
Background
Nextflow is a powerful and flexible workflow language that enables the development of scalable and reproducible scientific workflows.
-
It can integrate various software package and environment management systems such as Docker, Singularity, and Conda. It allows for existing pipelines written in common scripting languages, such as BASH, R and Python, to be seamlessly coupled together.
-
Nextflow simplifies the implementation and running of workflows on cloud or high-performance computing (HPC) infrastructures.
-
Nextflow is backed by nf-core: A community effort to collect a curated set of analysis pipelines built using Nextflow.
This page
There is a huge amount of nextflow documentation available, most of which will not be covered here. Instead, the purpose of this page is to provide and summarise a simple example pipeline which can be used by others as a guide for setting up their own pipelines.
Project structure
All nextflow pipelines should try to follow roughly the same primary structure.
.
├── assets
├── bin
├── conda_envs
├── conf
│ ├── crick.config
│ ├── modules.config
│ └── test.config
├── containers
├── inventories
│ └── example_inventory.csv
├── lib
│ └── core_functions.nf
├── main.nf
├── modules
│ ├── subworkflow_1_modules.nf
│ └── subworkflow_2_modules.nf
├── nextflow.config
├── README.md
├── results
├── subworkflows
│ ├── subworkflow_1.nf
│ └── subworkflow_2.nf
├── test_data
│ ├── fastqs
│ │ ├── SRR2584863_1.fastq.gz
│ │ ├── SRR2584863_2.fastq.gz
│ │ ├── SRR2589044_1.fastq.gz
│ │ └── SRR2589044_2.fastq.gz
│ └── reference_genome
│ └── ecoli_rel606.fasta
├── work
└── workflows
└── example_workflow.nf
15 directories, 20 files
assets/ contains various static support files for the analysis, i.e. reference genomes, COSMIC databases, etc
bin/ contains scripts used by processes within the pipeline
conf/ contains configuration files specific to the pipeline itself as well as the infrastructure running the pipeline
containers/ contains modularised singularity or docker containers for all software used within the pipeline
inventories/ contains any sample inventory tables (usually csvs) to be used by the pipeline
lib/ contains useful core functions such as help messages and the pipeline logo.
main.nf is the central nextflow script - this will detail which pipeline workflows will be run.
modules/ contains processes used within the pipeline
nextflow.config contains all the parameters required to run the pipeline
README.md contains information on what the pipeline does and how to run it
results/ contains all published pipeline results
subqorkflows/ contains the subwokflows consisting of a chain of multiple modules that offer a higher-level of functionality within the context of a pipeline.
test_data/ contains any test data which will allow users to run the pipeline locally to check it is working as expected
work/ contains all the process working directories that are produced as the pipeline runs
workflows/ contains scripts combining multiple subworkflows and/or modules together into a single end-to-end pipeline
The pipeline
This pipeline consists of the following steps: Subworkflow_1:
- Adapter and quality trimming of fqs using trimgalore
- quality control checks using fastqc
- Generate a QC report using multiqc Subworkflow_2
- Build a reference index using BWA
- Align qc'd fastq files to the reference genome using BWA
- Produce a summary of the number of aligned reads using samtools-flagstat