General overview - egenomics/agb2025 GitHub Wiki

Overview

This is a pipeline created in the Hospital del Mar bioinformaticians group (HdMBioinfo) to process microbiota data coming from stool samples processed in the hospital's analytical facilities by petitions of the Digestology Unit. The main objective is to classify the samples between healthy and non-healthy individuals, so the clinicians afterwards can decide useful interventions (ex: if that individual could be a potential stool donor, or if the individual needs from a stool transplant).

The sample metadata comes from the hospital database (which is a relational database created with SQL), we retrieve it using the Sample_Accession provided by the barcodes in the stool samples. Then when the sample is processed by a technician (doing a DNA extraction and amplification of the V4 16rRNA region), he/she introduces the technical metadata into the database linking the runID with the Sample_Accessions that will be sequenced in that sequencing batch.

After sequencing the samples with a Mi-Seq Illumina Sequencer (these are the sequencers available in the hospital),a small team made by bioinformatician technicians do the demultiplexing of the VCF file and send to us the FastQ files from each sample (which will be named as Sample IDs for compatibility with Qiime2) processed in that sequencing run.

Sample Workflow

  • Every stool specimen (“sample”; unique Sample_Accession) is first registered in the hospital database together with its clinical and technical metadata. After collection, the tube is stored at the recorded temperature and the Sample_Accession is paired with a sequencing batch identifier (“run”; R01030525).

  • Library preparation follows typically 16S rRNA amplification, although shotgun metagenomics is supported after which the MiSeq run produces raw FASTQ files that retain both the Sample_Accession and RunID in their names.

  • Demultiplexed reads enter the HdMBioinfo pipeline for trimming, contaminant filtering and other QC steps; the resulting high quality reads feed taxonomic and (when applicable) functional profilers to generate per sample abundance tables and diversity metrics.

  • Finally, the processed outputs, together with the harmonised metadata (parsed and cleaned via metadata_parsing.R), are loaded into the R Shiny dashboard, where the combined dataset is treated as an “analysis” object.

  • The HdMBioinfo team generated a .csv file ./metadata/healthy_controls/curated/metadata_cleaned.csv containing only those PRJEB41316 donors who passed our most stringent inclusion filters (no chronic disease, BMI 18.5-24.9, antibiotic-free ≥ 6 months, and comprehensive lifestyle data). This verified list is incorporated at the start of the process, so that subsequent analyses and the R Shiny panel can automatically flag each incoming sample as a healthy or unhealthy control by comparing its metadata to this reference set.

This dashboard lets clinicians compare healthy versus non-healthy profiles, view batch aware QC summaries, and export run-level or patient level reports.

Data preprocessing and quality control

Once the matching FASTQ files arrive in the raw_data/ directory for a given RunID, the modular Nextflow pipeline performs a three stage cleanup that prepares the reads for taxonomic analysis:

Stage What Happens Key Artifacts
1. Adapter & quality trimming trimmomatic (via nf-core module) removes adapters, low quality bases and very short reads while keeping pairs in sync. Clean FASTQs in processed_data/data_filtered/
2. Contamination filtering kraken2 screens the trimmed reads against a configurable database to discard host or non-microbial contaminants. Contamination free FASTQs (same folder)
3. Run level QC reporting FastQC snapshots (pre & post trim) are merged with MultiQC, and summary metrics (read counts, length, Phred, % contam) are appended to the original sample metadata. qc_reports/multiqc_report.html, metadata_sample_merged.csv

Additional cleaning

  • All directories live under outputs/run_<run_id>/ for reproducibility.
  • The pipeline log with exact parameter settings is saved as <run_id_pipeline.log.
  • Optional .tar.gz archives of filtered FASTQs can be generated for downstream teams.

This overview its aligned with the broader HdMBioinfo workflow, ensuring that every sample passed downstream is high quality, contamination free, and fully traceable.

Taxonomic and functional analysis

After processing the raw reads, the pipeline generates representative sequences through DADA2 denoising, although deblur can be used too. Then, the pipeline performs functional and taxonomic classification to identify the microbial composition of each sample. This is done by comparing the representative sequences to the SILVA reference database using SILVA, a pre-trained classifier based on a Naive Bayes algorithm. The classification assigns each sequence to a taxonomic level (phylum, genus, etc.), allowing users to determine which microbes are present and in what relative abundance.

To generate the appropriate outputs for taxonomic and functional analysis, Nextflow pipeline performs these processes:

Stage What Happens Key Artifacts
1. Denoising (DADA2 or Deblur) DADA2 or Deblur denoise reads reads and groups them into representative features (ASVs). Convert cleaned reads into table.qza and rep_seqs.qza FASTQs.
2. Taxonomy Assignment Assigns taxonomy to each ASV based on a trained classifier database. It gives taxonomy.qza, taxonomy.qzv, and taxonomy.tsv.
3. Phylogenetic Tree A tree is built to show the evolutionary relationships between features. It gives tree.qza, and tree.nwk.
4. Rarefaction (Optional) Automatically determines or uses a provided rarefaction depth to normalize read counts across samples. It gives rarefaction_threshold.txt, and rarefaction_summary.txt.
5. Alpha Diversity Diversity within each sample is calculated (e.g., Shannon, Observed Features). It gives alpha_diversity.qza, and alpha_diversity.tsv.
6. Results Export Converts key outputs to user-friendly formats for downstream analysis or reports. It gives feature-table.tsv, taxonomy.tsv, tree.nwk, and rep-seqs.fasta.
7. Final Summary Collects key outputs and metadata into a single summary file for quick inspection. It gives results_summary.txt.

The functional analysis step of the pipeline automatically processes your quality-checked microbiota sequencing data to give you:

  • A feature table (counts of microbial features/ASVs per sample)
  • The representative sequences for each feature
  • Taxonomic identification of features
  • A phylogenetic tree for downstream ecological analysis
  • Alpha diversity metrics
  • Rarefaction summaries (optional)
  • All results in both QIIME2 visual format and exportable text format

Result Integration and Clinical Reporting

The results delivery represents the final stage of the HdMBioinfo analysis pipeline. Building upon the processed data generated from stool samples collected for the Digestology Unit, this component consists of an interactive R Shiny Dashboard that transforms taxonomic and diversity analysis results into clinical insights. The dashboard directly supports the pipeline's primary objective of classifying samples between healthy and non-healthy individuals, enabling clinicians to make informed decisions regarding donor suitaibility for fecal microbiota transplantation and therapeutic interventions.

Dashboard Module Structure

  1. Data Upload & Overview Module
  2. Taxonomic Composition Analysis
  3. Diversity Analysis Module
  4. Clinical & Lifestyle Correlations
  5. Summary & Reports

Technical Implementation Rationale

R Shiny Framework Selection: R Shiny was chosen for its robust statistical computing capabilities, seamless integration with existing R-based bioinformatics workflows, and capacity for interactive visualizations development suitable for clinical environments.

Validation

We implemented several steps to validate the pipeline.

First of all, we performed benchmarking using synthetic data. We generated 30 synthetic microbiome samples using InSilicoSeq based on real human gut microbiome data. These synthetic datasets include known ground truth abundances across multiple conditions: standard replicates, varying sequencing depths (0.25x-5x), different error models (MiSeq variants), and quality variations. Our benchmarking framework achieved 85.5% mean Pearson correlation for abundance estimation and 71.8% genus detection rate, with diversity metrics showing less than 2% error. The pipeline demonstrated robust performance across standard conditions (r>0.9) while gracefully degrading under challenging scenarios like ultra-low coverage.

To validate the first part of the pipeline, we generated mock data using a Python script. This enabled us to simulate multiple edge cases and test the pipeline robustness. This process falls into what is known as "stress testing". We covered the following scenarios:

  • Corrupted files
    • malformed header
    • missing quality
    • missing sequence
    • no separator
  • Incorrectly formatted FASTQ files
    • the sequence contains invalid characters
    • file not compressed
  • Datasets with various characteristics
    • High read counts (100,000)
    • High number of samples (150)
    • Low quality reads
    • Very low number of read counts (50)
    • Mixed quality samples
    • Files containing RNA sequences
    • Single sample dataset

For the second part of the pipeline, we will use both positive controls (datasets expected to yield microbial results) and negative controls to ensure the pipeline properly identifies false positives. So far, we have conducted a extensive search for positive control datasets with the following characteristics: 16S raw sequencing data, paired-end reads, existing abundance and taxonomy tables. However, identifying datasets that meet all three criteria has proven to be very challenging. Many datasets with known abundance tables contain only single-end reads, while others have no available raw data, which may require purchase or special access requests.

In collaboration with Group 3, we also worked on selecting a correct rarefaction threshold for our microbiome dataset.

As for the metadata of the testing datasets (controls dataset and testing dataset), we performed a curation of the csvs provided by the publications so it would adapt to our metadata structure with its standardized fields. We also created empty csv templates in case other datasets should be included for testing. Although when we implement the pipeline in our hospital, the metadata csv will be automatically retrieved from the SQL database querying to the run_accession code (which is the FASTQ ID).

⚠️ **GitHub.com Fallback** ⚠️