Nextflow Pipeline - egenomics/agb2025 GitHub Wiki

The Nextflow pipeline automates the processing and analysis of 16S rRNA amplicon data through a modular, containerised workflow. It is designed for reproducibility, traceability, and compatibility with standard bioinformatics tools and formats.

Pipeline Configuration Files

`main.nf`

This is the main script of the pipeline, written in the Nextflow DSL (domain-specific language). It defines the workflow logic, including the processes (such as FastQC, Trimmomatic, Kraken2, QIIME 2, etc.), their inputs/outputs, execution order, and how data flows between them.

In AGB2025, main.nf coordinates the six pipeline stages, each defined as an independent process, sometimes with multiple processes. It ensures modular execution, proper parallelization, and dependency management.

`nextflow.config`

This file defines the configuration settings for running the pipeline. It includes:

Parameters: such as paths to input data, metadata, or resources.
Profiles: environment-specific settings (e.g., Docker).
Process resources: like CPUs, memory, and container images assigned to each step.
Global settings: such as working directory, resume behavior, or reporting options.

Together, main.nf and nextflow.config enable reproducible, portable, and scalable bioinformatics workflows. The pipeline orchestrates six main stages in a fixed order:

1. Input Monitoring

For each declared RunID, the pipeline scans the corresponding raw_data/ folder for paired-end FASTQ files.

2. Preprocessing

Raw reads are quality-checked with FastQC, then cleaned and synchronized using Trimmomatic. Adapter trimming parameters are pre-configured but can be adjusted by the user.

3. Contamination Detection

Cleaned reads are passed to Kraken2 to classify reads against a database of human and lab contaminants. This step does not remove any reads, but contamination levels are detected and annotated in the metadata for transparency and downstream filtering if needed.

4. MultiQC and Metadata Integration

FastQC, Trimmomatic, and Kraken2 outputs are aggregated by MultiQC. These metrics are then merged with the original sample metadata to inform sample quality classification and diagnostics.

5. QIIME 2 Analysis

High-quality reads are imported into QIIME 2 for denoising with DADA2, taxonomic classification, feature table generation, and optional phylogenetic and diversity analyses. Parameters like truncation length or sampling depth can be provided manually or auto-suggested from reports.

6. Export and Reporting

Key outputs — including .qza/.qzv artifacts, taxonomy tables, rarefaction summaries, and diversity plots — are exported in standard formats. All results are organized in a versioned output directory: