Validation - egenomics/agb2025 GitHub Wiki

For testing and validating the pipeline, benchmarking, stress testing and validation with real samples were performed.

1. Microbiome Pipeline Benchmarking

A comprehensive synthetic dataset generation and benchmarking framework for evaluating microbiome analysis pipelines using InSilicoSeq.

Overview

It consists of a complete workflow for generating synthetic microbiome datasets with known ground truth for benchmarking amplicon sequencing analysis pipelines. The framework tests multiple aspects of pipeline performance including ASV detection accuracy, abundance estimation, diversity metrics calculation, and robustness across different sequencing platforms and depths.

Background

Accurate benchmarking of microbiome analysis pipelines is critical for ensuring reliable results in microbiome research. However, creating comprehensive benchmarks requires datasets with known ground truth - something rarely available with real experimental data. This framework addresses this challenge by generating realistic synthetic datasets using InSilicoSeq with biologically realistic log-normal abundance distributions, enabling rigorous evaluation of pipeline performance across multiple dimensions.

Data Source

The synthetic datasets are based on real amplicon sequencing data from:

Primary Study: Liao, C., Taylor, B.P., Ceccarani, C. et al. Compilation of longitudinal microbiota data and hospitalome from hematopoietic cell transplantation patients. Sci Data 8, 71 (2021). https://doi.org/10.1038/s41597-021-00860-8

The original dataset contains over 10,000 fecal samples from hematopoietic cell transplantation patients, analyzed using 16S rRNA amplicon sequencing (V4-V5 region) to characterize gut microbiota composition. This provides a realistic foundation for synthetic data generation based on authentic human gut microbiome profiles.

Data Repository: https://github.com/Jinyuan1998/scientific_data_metagenome_shotgun/tree/main/deidentified_data_tables

The tblcounts_asv_melt.csv and tblASVtaxonomy_silva132_v4v5_filter.csv were used for the creation of our two basic input files: library_gut.fasta and read_count_gut.tsv.

Repository Structure

tests/group4/microbiome-benchmark/
├── README.md
├── input_data/
│   ├── library_gut.fasta                    # Reference ASV sequences (top 100 ASVs)
│   ├── asv_metadata.tsv                     # ASV-to-genus mapping metadata
│   ├── create_iss_library.py                # Script to generate input files from raw data
│   ├── tblcounts_asv_melt.csv              # Original count data
│   └── tblASVtaxonomy_silva132_v4v5_filter.csv  # Original taxonomy data
├── scripts/
│   ├── complete_benchmarking_workflow.sh    # Main workflow script
│   ├── benchmarking_script.sh               # Synthetic sample generation
│   └── per_sample_ground_truth.py           # Per-sample ground truth calculator
├── benchmarking_output/                     # Generated synthetic data 
│   │                                       # (Not in GitHub - ask group4 for this directory if needed)
│   │ 
│   └── synthetic_samples/
│       ├── {sample_name}_R1.fastq          # Paired-end FASTQ files (30 samples)
│       ├── {sample_name}_R2.fastq          # Generated by InSilicoSeq
│       ├── {sample_name}_abundance.txt     # ISS ground truth abundances
│       └── sample_summary.txt              # Sample group information
├── benchmarking_results/                    # Pipeline validation results
│   ├── benchmarking_summary.txt            # Performance metrics summary
│   ├── detailed_metrics.csv                # Per-sample validation data
│   ├── abundance_correlations.png          # Individual sample correlation plots
│   ├── diversity_comparison.png            # Alpha diversity validation plots
│   └── performance_by_condition.png        # Condition-specific performance analysis
├── images/                                  # README visualization assets
│   ├── individual_sample_correlations.png  # For GitHub display
│   ├── diversity_metrics_validation.png    # For GitHub display
│   └── condition_performance_analysis.png  # For GitHub display
├── genus_benchmark_v15.py                   # Benchmarking analysis script
└── run_benchmark/                          # Example pipeline run results
    └── S01070625/                          # Sample run ID
        ├── qiime_output/relevant_results/  # Pipeline outputs for validation
        │   ├── feature_table.tsv          # ASV abundance table
        │   └── taxonomy.tsv               # Taxonomic assignments
        ├── kraken/                         # Alternative classifier results
        ├── raw_data/                      # Compressed input files
        └── trimmed_reads/                 # (Not in GitHub - ask group4 for this directory if needed)

Some of the files in this repo are not provided through github due to file size limit.

Key Directory Functions:

input_data/: Original reference data and ASV metadata for synthetic generation
scripts/: Automated workflow scripts for generating synthetic datasets
benchmarking_output/synthetic_samples/: 30 synthetic FASTQ files with ground truth
benchmarking_results/: Validation metrics and performance visualizations
genus_benchmark_v15.py: Analysis script for comparing pipeline results to ground truth

Benchmarking Results

Validation Performance Overview

We validated our microbiome analysis pipeline using the synthetic dataset, achieving good performance across multiple metrics:

Overall Performance Summary:

Abundance Correlation: 85.5% mean Pearson correlation
Genus Detection Rate: 71.8% (detecting 32/44 genera on average)
Shannon Diversity Error: Only 1.9% relative error
Samples with Excellent Correlation (r>0.9): 11 out of 27 samples
Samples with Good Correlation (r>0.7): 23 out of 27 samples

Individual Sample Performance

Individual Sample Correlations

The scatter plots show genus-level abundance correlations between pipeline output and ground truth for each sample. Key observations:

Standard samples show consistently excellent correlations (r>0.9)
High-depth samples (2x, 5x) maintain strong performance but less than standard
Low-depth samples (0.25x, 0.5x) show reduced but acceptable performance
MiSeq variants demonstrate robust performance across different sequencing protocols where standard depth=100000

Diversity Metrics Validation

Alpha diversity metrics comparison reveals strong pipeline performance:

Shannon Diversity (r=0.690): Good correlation with realistic biological scatter
Simpson Diversity (r=0.681): Excellent tight clustering around 1:1 line
Richness (r=nan): Consistent detection across samples caution!
Evenness (r=0.683): Good preservation of community structure patterns

The diversity metrics are borderline acceptable and we believe this is coming from rarefaction threshold artifacts.

Performance Across Experimental Conditions

Performance by Condition

Comprehensive analysis across synthetic experimental conditions shows:

Excellent Performers (r > 0.9):

100K samples: Mean r=0.937 (highly reproducible)
MiSeq variants (standard, 24, 28): r=0.91-0.93 (platform robust)
Higher quality samples 200K: r=0.889 (still good but less than 100K depth samples)

Good Performers (r > 0.8):

No GC bias samples: r=0.845 (shows GC bias correction helps)

Challenging Conditions (r < 0.8):

Depth 0.25x & 0.5x: r=0.68-0.69 (expected with low coverage)
Depth 5x: r=0.760 (potential oversaturation effects)
Basic error samples: Complete failure (0% detection)

Key Insights:

Platform Robustness: Excellent performance across different MiSeq variants (24, 28, standard)
Depth Sensitivity: Performance degrades with low coverage
Quality Dependence: High-quality samples show superior results
Error Model Impact: Basic error samples failed completely

Detection Rate Analysis

Consistent detection: 70-75% across most conditions
Best detection: Depth 2x and 5x samples (75%)
Lowest detection: Depth 0.25x samples (60%)
Platform independence: Similar detection rates across MiSeq variants

Diversity Error Analysis

Shannon diversity errors remain consistently low (<3%) across all successful conditions:

Lowest errors: High quality and MiSeq variant samples (<1.5%)
Highest errors: Low depth samples (3-5%)
Standard samples: Excellent consistency (1.7% error)

Performance Interpretation

Biological Relevance

These results demonstrate that our pipeline:

Preserves biological signal in high-quality, high-depth samples (100K)
Maintains ecological relationships moderately between samples for community analyses
Handles platform variation effectively across different sequencing protocols
Degrades under challenging conditions (lower depth, poor quality)

Recommended Use Cases

Based on benchmarking results, the pipeline is recommended for:

Standard 16S amplicon analysis (100K+ reads per sample) Multi-platform studies (robust across MiSeq variants)

Use with caution for:

Medium to low-depth samples (<50K reads)
Samples with severe sequencing artifacts
Studies requiring detection of very rare taxa

The benchmarking was not performed in

Files Generated

All benchmarking results are available in benchmarking_results/:

benchmarking_summary.txt: Comprehensive performance metrics
detailed_metrics.csv: Per-sample detailed statistics

Sample Groups

The framework generates 6 distinct sample groups for comprehensive testing:

Group 1: Standard Replicates (n=8)

Purpose: Test reproducibility and baseline performance
Configuration: 100K reads, MiSeq, log-normal abundance, GC bias
Files: standard_rep_1 to standard_rep_8

Group 2: Error Model Testing (n=6)

Purpose: Test platform-specific robustness
Configurations: 100K reads each
- MiSeq (standard): miseq_rep_1, miseq_rep_2
- MiSeq-24: miseq-24_rep_1, miseq-24_rep_2
- MiSeq-28: miseq-28_rep_1, miseq-28_rep_2

Group 3: Depth Testing (n=8) Critical for Diversity Analysis

Purpose: Test rarefaction effects and depth-dependent diversity
Configurations:
- Low depth (25K reads): depth_0.25x_rep_1, depth_0.25x_rep_2
- Medium depth (50K reads): depth_0.5x_rep_1, depth_0.5x_rep_2
- High depth (200K reads): depth_2.0x_rep_1, depth_2.0x_rep_2
- Very high depth (500K reads): depth_5.0x_rep_1, depth_5.0x_rep_2

Group 4: GC Bias Controls (n=3)

Purpose: Assess GC bias impact
Configuration: 100K reads, MiSeq without GC bias
Files: no_gc_bias_1 to no_gc_bias_3

Group 5: Basic Error Model (n=3)

Purpose: Test with simpler error structure
Configuration: 100K reads, basic error model (no indels)
Files: basic_error_1 to basic_error_3

Group 6: High Quality (n=2)

Purpose: Test with higher-quality reads
Configuration: 100K reads, MiSeq-36 with GC bias
Files: high_quality_1, high_quality_2

Installation

Prerequisites

Python 3.7+
InSilicoSeq
Required Python packages: pandas, numpy

Install InSilicoSeq

conda install -c bioconda insilicoseq
# or
pip install InSilicoSeq

Clone Repository

Clone the repo and:

cd tests/group4/microbiome-benchmark/
chmod +x scripts/*.sh

Quick Start

1. Generate Complete Benchmarking Dataset

# Run the complete workflow (estimated time: 30-45 minutes)
./scripts/complete_benchmarking_workflow.sh

This single command will:

Create organized output directories
Generate 30 synthetic samples across multiple conditions using ISS log-normal abundance distribution
Calculate per-sample ground truth for each synthetic sample
Create summary files for easy analysis

2. Output Structure

After completion, you'll have:

benchmarking_output/
├── ground_truth/                            # Input reference (currently unused)
└── synthetic_samples/
    ├── standard_rep_1_R1.fastq              # FASTQ files for your pipeline
    ├── standard_rep_1_R2.fastq
    ├── standard_rep_1_abundance.txt         # ISS-generated abundance
    ├── standard_rep_1_ground_truth.json     # True diversity metrics
    ├── standard_rep_1_ground_truth_asvs.csv # True ASV abundances
    ├── ... (all 30 samples)
    ├── all_samples_ground_truth.json        # Combined metrics
    ├── ground_truth_summary.csv             # Summary table
    └── sample_summary.txt                   # Human-readable summary

3. Run Your Pipeline

Run your microbiome analysis pipeline on all generated FASTQ files:

cd benchmarking_output/synthetic_samples
# Example for each sample
your_pipeline standard_rep_1_R1.fastq standard_rep_1_R2.fastq

4. Compare Results

Compare your pipeline outputs against the corresponding ground truth files:

# For each sample, compare:
# Your ASV table ↔ {sample_name}_ground_truth_asvs.csv
# Your diversity metrics ↔ {sample_name}_ground_truth.json

Ground Truth Files

For each synthetic sample, the framework generates:

Individual Sample Files

{sample_name}_abundance.txt: ISS-generated abundance file (genome_id, abundance, coverage, nb_reads)
{sample_name}_ground_truth.json: Diversity metrics (Shannon, Simpson, richness, evenness)
{sample_name}_ground_truth_asvs.csv: ASV-level abundances and relative abundances

Summary Files

all_samples_ground_truth.json: Combined metrics for all samples
ground_truth_summary.csv: Tabular summary for easy analysis
sample_summary.txt: Human-readable summary with sample group information

Benchmarking Metrics

ASV Detection Accuracy

# Calculate per sample
sensitivity = true_positives / (true_positives + false_negatives)
precision = true_positives / (true_positives + false_positives)
f1_score = 2 * (precision * sensitivity) / (precision + sensitivity)

Abundance Correlation

# Compare predicted vs true abundances
pearson_r = pearsonr(true_abundances, predicted_abundances)
spearman_r = spearmanr(true_abundances, predicted_abundances)

Diversity Metrics Accuracy

# Compare diversity metrics
shannon_error = abs(predicted_shannon - true_shannon)
simpson_error = abs(predicted_simpson - true_simpson)
richness_error = abs(predicted_richness - true_richness)

Depth-Dependent Analysis

Compare performance across different sequencing depths (25K to 500K reads)
Evaluate rarefaction curve accuracy
Assess rare ASV detection limits

2. Stress Testing

Overview

To ensure the reliability and resilience of the 16S rRNA pipeline, we generated a comprehensive set of synthetic datasets using the script located at: /tests/mock_data_generator.py

This utility simulates a wide range of input conditions, including valid and intentionally corrupted datasets. The goal is to verify that the pipeline:

Successfully processes valid data
Handles edge cases
Produces coherent warnings or informative errors for invalid inputs

Synthetic Data Description

The script creates paired-end FASTQ files with 16S sequences and metadata for a variety of test scenarios. Below is a summary of the dataset types:

Dataset ID	Description
`stress_test_low_quality`	Sequences with degraded quality scores to test trimming and filtering logic
`stress_test_low_reads`	Samples with very low read counts
`stress_test_high_samples`	Large number of samples to evaluate scalability
`stress_test_high_depth`	Very high sequencing depth per sample
`stress_test_mixed_quality`	Mix of high, medium, low, and very low quality reads
`stress_test_standard`	Standard quality and depth, used as a control
`stress_test_single_sample`	One-sample dataset to validate minimal inputs
`stress_test_corrupted`	Files with intentional FASTQ format violations
`stress_test_rna`	RNA sequences containing 'U' instead of 'T'
`stress_test_format_validation`	Files for compression validation tests

File Structure

Each generated sample includes:

Paired-end FASTQ files: *_1.fastq.gz and *_2.fastq.gz
Metadata file

FASTQ Content

FASTQ files contain:

Realistic 16S rRNA sequences (e.g., V4 or V3V4 regions)
Reverse-complemented R2 reads with mutation simulation
Quality scores generated with degradation profiles
Corrupt files are intentionally malformed (e.g., missing quality lines, malformed headers, invalid separators)

Results and Reporting

Each test run produces a structured report with the following sections:

Test Case Header

Name and goal of the test

Input

Example FASTQ filenames used
Path references

Expected Behavior

Description of correct pipeline response
Notes on proper error handling and user feedback

Observed Results

Step at which the failure occurred (e.g., TRIMMOMATIC)
Error messages (e.g., RuntimeException: Invalid FASTQ name line: read0)
Notes on whether the failure was gracefully handled or caused a crash

Recommendations

Add pre-validation for FASTQ structure
Add existence checks for required files
Improve clarity of user-facing error messages

These reports provide a reproducible framework for improving the pipeline’s robustness and enhancing user experience in edge-case scenarios.

Test Case: corrupted (malformed header)

Goal: Validate pipeline robustness when receiving invalid FASTQ inputs, malformed header in this case, missing @.

Input:

malformed_header_1.fastq.gz
malformed_header_2.fastq.gz from the folder stress_test_corrupted

Expected: Pipeline should provide informative error messages and do not go straight to stacktrace, that is not user friendly.

Result:

Pipeline failed at TRIMMOMATIC step.
Errors:
- Command error: application/gzip Failed to process malformed_header_1.gz
- Command error: application/gzip Failed to process malformed_header_2.gz

Test Case: corrupted (missing quality)

Goal: Validate pipeline robustness when receiving invalid FASTQ inputs, missing quality line in this case.

Input:

missing_quality_1.fastq.gz
missing_quality.fastq.gz from the folder stress_test_corrupted

Expected: Pipeline should provide informative error messages and do not go straight to stacktrace, that is not user friendly.

Result:

Pipeline failed at TRIMMOMATIC step.
Errors:
- java.io.FileNotFoundException: group2a/adapters/TruSeq3-PE-2.fa (No such file or directory)
- Exception in thread "main" java.lang.RuntimeException: Sequence and quality length don't match: 'ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG' vs '@read1'
Failure not gracefully handled (standard Java crash)

Test Case: corrupted (missing sequence)

Goal: Validate pipeline robustness when receiving invalid FASTQ inputs, missing sequence line in this case.

Input:

missing_sequence_1.fastq.gz
missing_sequence.fastq.gz from the folder stress_test_corrupted

Expected: Pipeline should provide informative error messages and do not go straight to stacktrace.

Result:

Pipeline failed at TRIMMOMATIC step.
Errors:
- Failed to process missing_sequence_1.gz uk.ac.babraham.FastQC.Sequence.SequenceFormatException: Midline 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' didn't start with '+' at 3
- Failed to process missing_sequence_2.gz uk.ac.babraham.FastQC.Sequence.SequenceFormatException: Midline 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' didn't start with '+' at 3

Test Case: corrupted (no separator)

Goal: Validate pipeline robustness when receiving invalid FASTQ inputs, no separator line (+) in this case.

Input:

no_separator_1.fastq.gz
no_separator.fastq.gz from the folder stress_test_corrupted

Expected: Pipeline should provide informative error messages and do not go straight to stacktrace.

Result:

Pipeline failed at TRIMMOMATIC step.
Errors:
- java.io.FileNotFoundException: group2a/adapters/TruSeq3-PE-2.fa (No such file or directory)
- Exception in thread "main" java.lang.RuntimeException: Invalid FASTQ comment line: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Test Case: FASTQ falsely compressed

Goal: Validate pipeline robustness when receiving FASTAQ file with .gz extension but not compressed.

Input:

not_compressed_1.fastq.gz
not_compressed_2.fastq.gz from the folder stress_test_format_validation

Expected: The pipeline should output a message to indicate the user that the file is not compressed.

Result:

Pipeline failed at FASTQC_RAW step.
Pipeline failed immediately and the next message is shown:
- Failed to process not_compressed_1.gz java.util.zip.ZipException: Not in GZIP format at java.base/java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:165) at java.base/java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79) at java.base/java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91) at uk.ac.babraham.FastQC.Utilities.MultiMemberGZIPInputStream.<init>(MultiMemberGZIPInputStream.java:37) at uk.ac.babraham.FastQC.Sequence.FastQFile.<init>(FastQFile.java:84) at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:106) at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:62) at uk.ac.babraham.FastQC.Analysis.OfflineRunner.processFile(OfflineRunner.java:163) at uk.ac.babraham.FastQC.Analysis.OfflineRunner.<init>(OfflineRunner.java:125) at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:316) application/gzip

Test Case: high_depth

Goal: Verify pipeline behavior when it received a high read (100000) count dataset.

Input:

Sample001_1.fastq.gz
Sample001_2.fastq.gz
Sample002_1.fastq.gz
Sample002_2.fastq.gz
Sample003_1.fastq.gz
Sample003_2.fastq.gz
Sample004_1.fastq.gz
Sample004_2.fastq.gz
Sample005_1.fastq.gz
Sample005_2.fastq.gz
Sample006_1.fastq.gz
Sample006_2.fastq.gz
Sample007_1.fastq.gz
Sample007_2.fastq.gz
Sample008_1.fastq.gz
Sample008_2.fastq.gz
Sample009_1.fastq.gz
Sample009_2.fastq.gz
Sample010_1.fastq.gz
Sample010_2.fastq.gz from the folder stress_test_high_depth

Expected: The pipeline should output a message to indicate that it cannot handle files with such high read counts.

Result:

Pipeline failed at DENOISE_DADA2 step.
Errors:
- No reads passed the filter. trunc_len_f (240) or trunc_len_r (180) may be individually longer than read lengths, or trunc_len_f + trunc_len_r may be shorter than the length of the amplicon + 12 nucleotides (the length of the overlap). Alternatively, other arguments (such as max_ee or trunc_q) may be preventing reads from passing the filter.

Test Case: high_samples

Goal: Verify the pipeline behavior when it receives 150 samples in one go.

Input:

Sample001_1.fastq.gz
Sample001_2.fastq.gz
Sample002_1.fastq.gz
Sample002_2.fastq.gz
Sample003_1.fastq.gz
Sample003_2.fastq.gz up to
Sample150_1.fastq.gz
Sample150_2.fastq.gz

Expected: Pipeline indicates the maximum number of files it can handle in one go. That way, the user can plan better how to input their files.

Result:

Pipeline failed at DENOISE_DADA2 step.
Error:
- No reads passed the filter. trunc_len_f (240) or trunc_len_r (180) may be individually longer than read lengths, or trunc_len_f + trunc_len_r may be shorter than the length of the amplicon + 12 nucleotides (the length of the overlap). Alternatively, other arguments (such as max_ee or trunc_q) may be preventing reads from passing the filter.

Test Case: low_quality

Goal: Verify what happens when the pipeline receives low quality reads.

Input:

Sample001_1.fastq.gz
Sample001_2.fastq.gz
Sample002_1.fastq.gz
Sample002_2.fastq.gz
Sample003_1.fastq.gz
Sample003_2.fastq.gz
Sample004_1.fastq.gz
Sample004_2.fastq.gz
Sample005_1.fastq.gz
Sample005_2.fastq.gz
Sample006_1.fastq.gz
Sample006_2.fastq.gz
Sample007_1.fastq.gz
Sample007_2.fastq.gz
Sample008_1.fastq.gz
Sample008_2.fastq.gz
Sample009_1.fastq.gz
Sample009_2.fastq.gz
Sample010_1.fastq.gz
Sample010_2.fastq.gz from the folder stress_test_low_quality

Expected: The pipeline outputs a message indicating it has received low quality data and it cannot process it.

Result:

Pipeline failed at DENOISE_DADA2 step.
Error:
- No reads passed the filter. trunc_len_f (240) or trunc_len_r (180) may be individually longer than read lengths, or trunc_len_f + trunc_len_r may be shorter than the length of the amplicon + 12 nucleotides (the length of the overlap). Alternatively, other arguments (such as max_ee or trunc_q) may be preventing reads from passing the filter.

Test Case: low_reads

Goal: Verify pipeline behavior when it received a low read (50) count dataset.

Input:

Sample001_1.fastq.gz
Sample001_2.fastq.gz
Sample002_1.fastq.gz
Sample002_2.fastq.gz
Sample003_1.fastq.gz
Sample003_2.fastq.gz
Sample004_1.fastq.gz
Sample004_2.fastq.gz
Sample005_1.fastq.gz
Sample005_2.fastq.gz from the folder stress_test_low_reads

Expected: The pipeline outputs a message indicating it has received data with a low number of reads and it cannot process it.

Result:

Pipeline failed at DENOISE_DADA2 step.
Error:
- No reads passed the filter. trunc_len_f (240) or trunc_len_r (180) may be individually longer than read lengths, or trunc_len_f + trunc_len_r may be shorter than the length of the amplicon + 12 nucleotides (the length of the overlap). Alternatively, other arguments (such as max_ee or trunc_q) may be preventing reads from passing the filter.

Test Case: mixed_quality

Goal: Verify the pipeline result with a dataset containing high, medium, low and very low quality.

Input:

Sample001_1.fastq.gz
Sample001_2.fastq.gz
Sample002_1.fastq.gz
Sample002_2.fastq.gz
Sample003_1.fastq.gz
Sample003_2.fastq.gz
...
Sample019_1.fastq.gz
Sample019_2.fastq.gz
Sample020_1.fastq.gz
Sample020_2.fastq.gz from the folder stress_test_mixed_quality

Expected: The pipeline outputs a message indicating it has received mixed quality data and it cannot process it.

Result:

Pipeline failed at DENOISE_DADA2 step.
Error:
- No reads passed the filter. trunc_len_f (240) or trunc_len_r (180) may be individually longer than read lengths, or trunc_len_f + trunc_len_r may be shorter than the length of the amplicon + 12 nucleotides (the length of the overlap). Alternatively, other arguments (such as max_ee or trunc_q) may be preventing reads from passing the filter.

Test Case: rna

Goal: Check the pipeline result when we input RNA sequence files.

Input:

RNA_Sample001_1.fastq.gz
RNA_Sample001_2.fastq.gz from the folder stress_test_rna

Expected: The pipeline should output a message specifying the type of files it can process.

Result:

Pipeline failed at DENOISE_DADA2 step.
Error:
- No reads passed the filter. trunc_len_f (240) or trunc_len_r (180) may be individually longer than read lengths, or trunc_len_f + trunc_len_r may be shorter than the length of the amplicon + 12 nucleotides (the length of the overlap). Alternatively, other arguments (such as max_ee or trunc_q) may be preventing reads from passing the filter.

Test Case: single_sample

Goal: Check the pipeline result for a single sample dataset.

Input:

Sample001_1.fastq.gz
Sample001_2.fastq.gz from the folder stress_test_single_sample

Expected: The pipeline should output an informative message.

Result:

Pipeline failed at DENOISE_DADA2 step.
Error:
- No reads passed the filter. trunc_len_f (240) or trunc_len_r (180) may be individually longer than read lengths, or trunc_len_f + trunc_len_r may be shorter than the length of the amplicon + 12 nucleotides (the length of the overlap). Alternatively, other arguments (such as max_ee or trunc_q) may be preventing reads from passing the filter.

Test Case: sample name does not appear in the metadata file

Goal: Check the pipeline result when the raw_data folder contains a sample whose name is not in the metadata file.

Input:

malformed_header_1.fastq.gz
malformed_header_2.fastq.gz from the folder stress_test_corrupted

Expected: The pipeline should output a message indicating that the sample name could not be found in the metadata.

Result:

Pipeline failed at TRIMMOMATIC step.
Errors:
- threads 1 -trimlog malformed_header_trim.log -summary malformed_header.summary malformed_header_1.fastq.gz malformed_header_2.fastq.gz malformed_header.paired.trim_1.fastq.gz malformed_header.unpaired.trim_1.fastq.gz malformed_header.paired.trim_2.fastq.gz malformed_header.unpaired.trim_2.fastq.gz ILLUMINACLIP:group2a/adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:187

Test Case: user provides no metadata file

Goal: Check the pipeline result when the metadata folder is empty.

Input:

malformed_header_1.fastq.gz
malformed_header_2.fastq.gz from the folder stress_test_corrupted

Expected: The pipeline should output a message indicating that the metadata could not be found.

Result:

Pipeline exits immediately.
Messages:
Output directory: runs/R01050625
Metadata file not found: runs/R01050625/metadata/metadata.tsv

Note: Good example of fatal message for the user.

Recomendations from the testing team

After running all the stress testing cases detailed above, Group 4 offers the following recommendations to improve the pipeline robustness and end user experience.

Implement dependency checks: Ensure all external files and resources exist and are available before the pipeline execution begins. This prevents confusing errors from happening.

Add pre-validation of FASTQ files: Introduce a validation step to check the integrity of the FASTQ files provided as input. For example, missing headers, missing sequence, etc. This validation should cover all cases evaluated in this report.

Improve user-facing error messages: Errors should clearly distinguish between:

Informative messages: the input is not optimal, but the processing can continue (e.g. a dataset with only one run).
Fatal messages: the execution of the pipeline must halt due to invalid input files (e.g. missing sequences, incorrect file format). In both cases, the pipeline should avoid raw stack traces and instead guide users towards a resolution.

In summary, we emphasize the importance of providing clear feedback to the user rather than allowing the pipeline to fail silently or crash.

Validation - egenomics/agb2025 GitHub Wiki

1. Microbiome Pipeline Benchmarking

Overview

Background

Data Source

Repository Structure

Key Directory Functions:

Benchmarking Results

Validation Performance Overview

Individual Sample Performance

Diversity Metrics Validation

Performance Across Experimental Conditions

Excellent Performers (r > 0.9):

Good Performers (r > 0.8):

Challenging Conditions (r < 0.8):

Key Insights:

Detection Rate Analysis

Diversity Error Analysis

Performance Interpretation

Biological Relevance

Recommended Use Cases

Files Generated

Sample Groups

Group 1: Standard Replicates (n=8)

Group 2: Error Model Testing (n=6)

Group 3: Depth Testing (n=8) Critical for Diversity Analysis

Group 4: GC Bias Controls (n=3)

Group 5: Basic Error Model (n=3)

Group 6: High Quality (n=2)

Installation

Prerequisites

Install InSilicoSeq

Clone Repository

Quick Start

1. Generate Complete Benchmarking Dataset

2. Output Structure

3. Run Your Pipeline

4. Compare Results

Ground Truth Files

Individual Sample Files

Summary Files

Benchmarking Metrics

ASV Detection Accuracy

Abundance Correlation

Diversity Metrics Accuracy

Depth-Dependent Analysis

2. Stress Testing

Overview

Synthetic Data Description

File Structure

FASTQ Content

Results and Reporting

Test Case Header

Input

Expected Behavior

Observed Results

Recommendations

Test Case: corrupted (malformed header)

Test Case: corrupted (missing quality)

Test Case: corrupted (missing sequence)

Test Case: corrupted (no separator)

Test Case: FASTQ falsely compressed

Test Case: high_depth

Test Case: high_samples

Test Case: low_quality

Test Case: low_reads

Test Case: mixed_quality

Test Case: rna

Test Case: single_sample

Test Case: sample name does not appear in the metadata file

Test Case: user provides no metadata file

Recomendations from the testing team

3. Application in real samples

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️