Overview

This is a pipeline created in the Hospital del Mar bioinformaticians group (HdMBioinfo) to process microbiota data coming from stool samples processed in the hospital's analytical facilities by petitions of the Digestology Unit. The main objective is to classify the samples between healthy and non-healthy individuals, so the clinicians afterwards can decide useful interventions (ex: if that individual could be a potential stool donor, or if the individual needs from a stool transplant).

The sample metadata comes from the hospital database (which is a relational database created with SQL), we retrieve it using the Sample_Accession provided by the barcodes in the stool samples. Then when the sample is processed by a technician (doing a DNA extraction and amplification of the V4 16rRNA region), he/she introduces the technical metadata into the database linking the runID with the Sample_Accessions that will be sequenced in that sequencing batch.

After sequencing the samples with a Mi-Seq Illumina Sequencer (these are the sequencers available in the hospital),a small team made by bioinformatician technicians do the demultiplexing of the VCF file and send to us the FastQ files from each sample (which will be named as Sample IDs for compatibility with Qiime2) processed in that sequencing run.

Sample Workflow

Every stool specimen (“sample”; unique Sample_Accession) is first registered in the hospital database together with its clinical and technical metadata. After collection, the tube is stored at the recorded temperature and the Sample_Accession is paired with a sequencing batch identifier (“run”; R01030525).
Library preparation follows typically 16S rRNA amplification, although shotgun metagenomics is supported after which the MiSeq run produces raw FASTQ files that retain both the Sample_Accession and RunID in their names.
Demultiplexed reads enter the HdMBioinfo pipeline for trimming, contaminant filtering and other QC steps; the resulting high quality reads feed taxonomic and (when applicable) functional profilers to generate per sample abundance tables and diversity metrics.
Finally, the processed outputs, together with the harmonised metadata (parsed and cleaned via metadata_parsing.R), are loaded into the R Shiny dashboard, where the combined dataset is treated as an “analysis” object.
The HdMBioinfo team generated a .csv file ./metadata/healthy_controls/curated/metadata_cleaned.csv containing only those PRJEB41316 donors who passed our most stringent inclusion filters (no chronic disease, BMI 18.5-24.9, antibiotic-free ≥ 6 months, and comprehensive lifestyle data). This verified list is incorporated at the start of the process, so that subsequent analyses and the R Shiny panel can automatically flag each incoming sample as a healthy or unhealthy control by comparing its metadata to this reference set.

This dashboard lets clinicians compare healthy versus non-healthy profiles, view batch aware QC summaries, and export run-level or patient level reports.

Data preprocessing and quality control

Once the matching FASTQ files arrive in the raw_data/ directory for a given RunID, the modular Nextflow pipeline performs a three stage cleanup that prepares the reads for taxonomic analysis:

Stage	What Happens	Key Artifacts
1. Adapter & quality trimming	`trimmomatic` (via nf-core module) removes adapters, low quality bases and very short reads while keeping pairs in sync.	Clean FASTQs in `processed_data/data_filtered/`
2. Contamination filtering	`kraken2` screens the trimmed reads against a configurable database to discard host or non-microbial contaminants.	Contamination free FASTQs (same folder)
3. Run level QC reporting	`FastQC` snapshots (pre & post trim) are merged with `MultiQC`, and summary metrics (read counts, length, Phred, % contam) are appended to the original sample metadata.	`qc_reports/multiqc_report.html`, `metadata_sample_merged.csv`

Additional cleaning

All directories live under outputs/run_<run_id>/ for reproducibility.
The pipeline log with exact parameter settings is saved as <run_id_pipeline.log.
Optional .tar.gz archives of filtered FASTQs can be generated for downstream teams.

This overview its aligned with the broader HdMBioinfo workflow, ensuring that every sample passed downstream is high quality, contamination free, and fully traceable.

Taxonomic and functional analysis

After processing the raw reads, the pipeline generates representative sequences through DADA2 denoising, although deblur can be used too. Then, the pipeline performs functional and taxonomic classification to identify the microbial composition of each sample. This is done by comparing the representative sequences to the SILVA reference database using SILVA, a pre-trained classifier based on a Naive Bayes algorithm. The classification assigns each sequence to a taxonomic level (phylum, genus, etc.), allowing users to determine which microbes are present and in what relative abundance.

To generate the appropriate outputs for taxonomic and functional analysis, Nextflow pipeline performs these processes:

Stage	What Happens	Key Artifacts
1. Denoising (DADA2 or Deblur)	`DADA2` or `Deblur` denoise reads reads and groups them into representative features (ASVs).	Convert cleaned reads into `table.qza` and `rep_seqs.qza` FASTQs.
2. Taxonomy Assignment	Assigns taxonomy to each ASV based on a trained classifier database.	It gives `taxonomy.qza`, `taxonomy.qzv`, and `taxonomy.tsv`.
3. Phylogenetic Tree	A tree is built to show the evolutionary relationships between features.	It gives `tree.qza`, and `tree.nwk`.
4. Rarefaction (Optional)	Automatically determines or uses a provided rarefaction depth to normalize read counts across samples.	It gives `rarefaction_threshold.txt`, and `rarefaction_summary.txt`.
5. Alpha Diversity	Diversity within each sample is calculated (e.g., Shannon, Observed Features).	It gives `alpha_diversity.qza`, and `alpha_diversity.tsv`.
6. Results Export	Converts key outputs to user-friendly formats for downstream analysis or reports.	It gives `feature-table.tsv`, `taxonomy.tsv`, `tree.nwk`, and `rep-seqs.fasta`.
7. Final Summary	Collects key outputs and metadata into a single summary file for quick inspection.	It gives `results_summary.txt`.

The functional analysis step of the pipeline automatically processes your quality-checked microbiota sequencing data to give you:

A feature table (counts of microbial features/ASVs per sample)
The representative sequences for each feature
Taxonomic identification of features
A phylogenetic tree for downstream ecological analysis
Alpha diversity metrics
Rarefaction summaries (optional)
All results in both QIIME2 visual format and exportable text format

Result Integration and Clinical Reporting

The results delivery represents the final stage of the HdMBioinfo analysis pipeline. Building upon the processed data generated from stool samples collected for the Digestology Unit, this component consists of an interactive R Shiny Dashboard that transforms taxonomic and diversity analysis results into clinical insights. The dashboard directly supports the pipeline's primary objective of classifying samples between healthy and non-healthy individuals, enabling clinicians to make informed decisions regarding donor suitaibility for fecal microbiota transplantation and therapeutic interventions.

Dashboard Module Structure

Data Upload & Overview Module
Taxonomic Composition Analysis
Diversity Analysis Module
Clinical & Lifestyle Correlations
Summary & Reports

Technical Implementation Rationale

R Shiny Framework Selection: R Shiny was chosen for its robust statistical computing capabilities, seamless integration with existing R-based bioinformatics workflows, and capacity for interactive visualizations development suitable for clinical environments.

Validation

We implemented several steps to validate the pipeline.

First of all, we performed benchmarking using synthetic data. We generated 30 synthetic microbiome samples using InSilicoSeq based on real human gut microbiome data. These synthetic datasets include known ground truth abundances across multiple conditions: standard replicates, varying sequencing depths (0.25x-5x), different error models (MiSeq variants), and quality variations. Our benchmarking framework achieved 85.5% mean Pearson correlation for abundance estimation and 71.8% genus detection rate, with diversity metrics showing less than 2% error. The pipeline demonstrated robust performance across standard conditions (r>0.9) while gracefully degrading under challenging scenarios like ultra-low coverage.

To validate the first part of the pipeline, we generated mock data using a Python script. This enabled us to simulate multiple edge cases and test the pipeline robustness. This process falls into what is known as "stress testing". We covered the following scenarios:

Corrupted files
- malformed header
- missing quality
- missing sequence
- no separator
Incorrectly formatted FASTQ files
- the sequence contains invalid characters
- file not compressed
Datasets with various characteristics
- High read counts (100,000)
- High number of samples (150)
- Low quality reads
- Very low number of read counts (50)
- Mixed quality samples
- Files containing RNA sequences
- Single sample dataset

For the second part of the pipeline, we will use both positive controls (datasets expected to yield microbial results) and negative controls to ensure the pipeline properly identifies false positives. So far, we have conducted a extensive search for positive control datasets with the following characteristics: 16S raw sequencing data, paired-end reads, existing abundance and taxonomy tables. However, identifying datasets that meet all three criteria has proven to be very challenging. Many datasets with known abundance tables contain only single-end reads, while others have no available raw data, which may require purchase or special access requests.

In collaboration with Group 3, we also worked on selecting a correct rarefaction threshold for our microbiome dataset.

As for the metadata of the testing datasets (controls dataset and testing dataset), we performed a curation of the csvs provided by the publications so it would adapt to our metadata structure with its standardized fields. We also created empty csv templates in case other datasets should be included for testing. Although when we implement the pipeline in our hospital, the metadata csv will be automatically retrieved from the SQL database querying to the run_accession code (which is the FASTQ ID).

General overview - egenomics/agb2025 GitHub Wiki

Overview

Sample Workflow

Data preprocessing and quality control

Taxonomic and functional analysis

Result Integration and Clinical Reporting

Technical Implementation Rationale

Validation

⚠️ GitHub.com Fallback ⚠️

General overview - egenomics/agb2025 GitHub Wiki

Overview

Sample Workflow

Data preprocessing and quality control

Taxonomic and functional analysis

Result Integration and Clinical Reporting

Technical Implementation Rationale

Validation

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️