Validation - egenomics/agb2025 GitHub Wiki
For testing and validating the pipeline, benchmarking, stress testing and validation with real samples were performed.
A comprehensive synthetic dataset generation and benchmarking framework for evaluating microbiome analysis pipelines using InSilicoSeq
.
It consists of a complete workflow for generating synthetic microbiome datasets with known ground truth for benchmarking amplicon sequencing analysis pipelines. The framework tests multiple aspects of pipeline performance including ASV detection accuracy, abundance estimation, diversity metrics calculation, and robustness across different sequencing platforms and depths.
Accurate benchmarking of microbiome analysis pipelines is critical for ensuring reliable results in microbiome research. However, creating comprehensive benchmarks requires datasets with known ground truth - something rarely available with real experimental data. This framework addresses this challenge by generating realistic synthetic datasets using InSilicoSeq with biologically realistic log-normal abundance distributions, enabling rigorous evaluation of pipeline performance across multiple dimensions.
The synthetic datasets are based on real amplicon sequencing data from:
Primary Study: Liao, C., Taylor, B.P., Ceccarani, C. et al. Compilation of longitudinal microbiota data and hospitalome from hematopoietic cell transplantation patients. Sci Data 8, 71 (2021). https://doi.org/10.1038/s41597-021-00860-8
The original dataset contains over 10,000 fecal samples from hematopoietic cell transplantation patients, analyzed using 16S rRNA amplicon sequencing (V4-V5 region) to characterize gut microbiota composition. This provides a realistic foundation for synthetic data generation based on authentic human gut microbiome profiles.
Data Repository: https://github.com/Jinyuan1998/scientific_data_metagenome_shotgun/tree/main/deidentified_data_tables
The tblcounts_asv_melt.csv
and tblASVtaxonomy_silva132_v4v5_filter.csv
were used for the creation of our two basic input files: library_gut.fasta
and read_count_gut.tsv
.
tests/group4/microbiome-benchmark/
βββ README.md
βββ input_data/
β βββ library_gut.fasta # Reference ASV sequences (top 100 ASVs)
β βββ asv_metadata.tsv # ASV-to-genus mapping metadata
β βββ create_iss_library.py # Script to generate input files from raw data
β βββ tblcounts_asv_melt.csv # Original count data
β βββ tblASVtaxonomy_silva132_v4v5_filter.csv # Original taxonomy data
βββ scripts/
β βββ complete_benchmarking_workflow.sh # Main workflow script
β βββ benchmarking_script.sh # Synthetic sample generation
β βββ per_sample_ground_truth.py # Per-sample ground truth calculator
βββ benchmarking_output/ # Generated synthetic data
β β # (Not in GitHub - ask group4 for this directory if needed)
β β
β βββ synthetic_samples/
β βββ {sample_name}_R1.fastq # Paired-end FASTQ files (30 samples)
β βββ {sample_name}_R2.fastq # Generated by InSilicoSeq
β βββ {sample_name}_abundance.txt # ISS ground truth abundances
β βββ sample_summary.txt # Sample group information
βββ benchmarking_results/ # Pipeline validation results
β βββ benchmarking_summary.txt # Performance metrics summary
β βββ detailed_metrics.csv # Per-sample validation data
β βββ abundance_correlations.png # Individual sample correlation plots
β βββ diversity_comparison.png # Alpha diversity validation plots
β βββ performance_by_condition.png # Condition-specific performance analysis
βββ images/ # README visualization assets
β βββ individual_sample_correlations.png # For GitHub display
β βββ diversity_metrics_validation.png # For GitHub display
β βββ condition_performance_analysis.png # For GitHub display
βββ genus_benchmark_v15.py # Benchmarking analysis script
βββ run_benchmark/ # Example pipeline run results
βββ S01070625/ # Sample run ID
βββ qiime_output/relevant_results/ # Pipeline outputs for validation
β βββ feature_table.tsv # ASV abundance table
β βββ taxonomy.tsv # Taxonomic assignments
βββ kraken/ # Alternative classifier results
βββ raw_data/ # Compressed input files
βββ trimmed_reads/ # (Not in GitHub - ask group4 for this directory if needed)
Some of the files in this repo are not provided through github due to file size limit.
-
input_data/
: Original reference data and ASV metadata for synthetic generation -
scripts/
: Automated workflow scripts for generating synthetic datasets -
benchmarking_output/synthetic_samples/
: 30 synthetic FASTQ files with ground truth -
benchmarking_results/
: Validation metrics and performance visualizations -
genus_benchmark_v15.py
: Analysis script for comparing pipeline results to ground truth
We validated our microbiome analysis pipeline using the synthetic dataset, achieving good performance across multiple metrics:
Overall Performance Summary:
- Abundance Correlation: 85.5% mean Pearson correlation
- Genus Detection Rate: 71.8% (detecting 32/44 genera on average)
- Shannon Diversity Error: Only 1.9% relative error
- Samples with Excellent Correlation (r>0.9): 11 out of 27 samples
- Samples with Good Correlation (r>0.7): 23 out of 27 samples
The scatter plots show genus-level abundance correlations between pipeline output and ground truth for each sample. Key observations:
- Standard samples show consistently excellent correlations (r>0.9)
- High-depth samples (2x, 5x) maintain strong performance but less than standard
- Low-depth samples (0.25x, 0.5x) show reduced but acceptable performance
- MiSeq variants demonstrate robust performance across different sequencing protocols where standard depth=100000
Alpha diversity metrics comparison reveals strong pipeline performance:
- Shannon Diversity (r=0.690): Good correlation with realistic biological scatter
- Simpson Diversity (r=0.681): Excellent tight clustering around 1:1 line
- Richness (r=nan): Consistent detection across samples caution!
- Evenness (r=0.683): Good preservation of community structure patterns
The diversity metrics are borderline acceptable and we believe this is coming from rarefaction threshold artifacts.
Comprehensive analysis across synthetic experimental conditions shows:
- 100K samples: Mean r=0.937 (highly reproducible)
- MiSeq variants (standard, 24, 28): r=0.91-0.93 (platform robust)
- Higher quality samples 200K: r=0.889 (still good but less than 100K depth samples)
- No GC bias samples: r=0.845 (shows GC bias correction helps)
- Depth 0.25x & 0.5x: r=0.68-0.69 (expected with low coverage)
- Depth 5x: r=0.760 (potential oversaturation effects)
- Basic error samples: Complete failure (0% detection)
- Platform Robustness: Excellent performance across different MiSeq variants (24, 28, standard)
- Depth Sensitivity: Performance degrades with low coverage
- Quality Dependence: High-quality samples show superior results
- Error Model Impact: Basic error samples failed completely
- Consistent detection: 70-75% across most conditions
- Best detection: Depth 2x and 5x samples (75%)
- Lowest detection: Depth 0.25x samples (60%)
- Platform independence: Similar detection rates across MiSeq variants
Shannon diversity errors remain consistently low (<3%) across all successful conditions:
- Lowest errors: High quality and MiSeq variant samples (<1.5%)
- Highest errors: Low depth samples (3-5%)
- Standard samples: Excellent consistency (1.7% error)
These results demonstrate that our pipeline:
- Preserves biological signal in high-quality, high-depth samples (100K)
- Maintains ecological relationships moderately between samples for community analyses
- Handles platform variation effectively across different sequencing protocols
- Degrades under challenging conditions (lower depth, poor quality)
Based on benchmarking results, the pipeline is recommended for:
Standard 16S amplicon analysis (100K+ reads per sample) Multi-platform studies (robust across MiSeq variants)
Use with caution for:
- Medium to low-depth samples (<50K reads)
- Samples with severe sequencing artifacts
- Studies requiring detection of very rare taxa
The benchmarking was not performed in
All benchmarking results are available in benchmarking_results/
:
-
benchmarking_summary.txt
: Comprehensive performance metrics -
detailed_metrics.csv
: Per-sample detailed statistics
The framework generates 6 distinct sample groups for comprehensive testing:
- Purpose: Test reproducibility and baseline performance
- Configuration: 100K reads, MiSeq, log-normal abundance, GC bias
-
Files:
standard_rep_1
tostandard_rep_8
- Purpose: Test platform-specific robustness
-
Configurations: 100K reads each
- MiSeq (standard):
miseq_rep_1
,miseq_rep_2
- MiSeq-24:
miseq-24_rep_1
,miseq-24_rep_2
- MiSeq-28:
miseq-28_rep_1
,miseq-28_rep_2
- MiSeq (standard):
- Purpose: Test rarefaction effects and depth-dependent diversity
-
Configurations:
- Low depth (25K reads):
depth_0.25x_rep_1
,depth_0.25x_rep_2
- Medium depth (50K reads):
depth_0.5x_rep_1
,depth_0.5x_rep_2
- High depth (200K reads):
depth_2.0x_rep_1
,depth_2.0x_rep_2
- Very high depth (500K reads):
depth_5.0x_rep_1
,depth_5.0x_rep_2
- Low depth (25K reads):
- Purpose: Assess GC bias impact
- Configuration: 100K reads, MiSeq without GC bias
-
Files:
no_gc_bias_1
tono_gc_bias_3
- Purpose: Test with simpler error structure
- Configuration: 100K reads, basic error model (no indels)
-
Files:
basic_error_1
tobasic_error_3
- Purpose: Test with higher-quality reads
- Configuration: 100K reads, MiSeq-36 with GC bias
-
Files:
high_quality_1
,high_quality_2
- Python 3.7+
- InSilicoSeq
- Required Python packages:
pandas
,numpy
conda install -c bioconda insilicoseq
# or
pip install InSilicoSeq
Clone the repo and:
cd tests/group4/microbiome-benchmark/
chmod +x scripts/*.sh
# Run the complete workflow (estimated time: 30-45 minutes)
./scripts/complete_benchmarking_workflow.sh
This single command will:
- Create organized output directories
- Generate 30 synthetic samples across multiple conditions using ISS log-normal abundance distribution
- Calculate per-sample ground truth for each synthetic sample
- Create summary files for easy analysis
After completion, you'll have:
benchmarking_output/
βββ ground_truth/ # Input reference (currently unused)
βββ synthetic_samples/
βββ standard_rep_1_R1.fastq # FASTQ files for your pipeline
βββ standard_rep_1_R2.fastq
βββ standard_rep_1_abundance.txt # ISS-generated abundance
βββ standard_rep_1_ground_truth.json # True diversity metrics
βββ standard_rep_1_ground_truth_asvs.csv # True ASV abundances
βββ ... (all 30 samples)
βββ all_samples_ground_truth.json # Combined metrics
βββ ground_truth_summary.csv # Summary table
βββ sample_summary.txt # Human-readable summary
Run your microbiome analysis pipeline on all generated FASTQ files:
cd benchmarking_output/synthetic_samples
# Example for each sample
your_pipeline standard_rep_1_R1.fastq standard_rep_1_R2.fastq
Compare your pipeline outputs against the corresponding ground truth files:
# For each sample, compare:
# Your ASV table β {sample_name}_ground_truth_asvs.csv
# Your diversity metrics β {sample_name}_ground_truth.json
For each synthetic sample, the framework generates:
-
{sample_name}_abundance.txt
: ISS-generated abundance file (genome_id, abundance, coverage, nb_reads) -
{sample_name}_ground_truth.json
: Diversity metrics (Shannon, Simpson, richness, evenness) -
{sample_name}_ground_truth_asvs.csv
: ASV-level abundances and relative abundances
-
all_samples_ground_truth.json
: Combined metrics for all samples -
ground_truth_summary.csv
: Tabular summary for easy analysis -
sample_summary.txt
: Human-readable summary with sample group information
# Calculate per sample
sensitivity = true_positives / (true_positives + false_negatives)
precision = true_positives / (true_positives + false_positives)
f1_score = 2 * (precision * sensitivity) / (precision + sensitivity)
# Compare predicted vs true abundances
pearson_r = pearsonr(true_abundances, predicted_abundances)
spearman_r = spearmanr(true_abundances, predicted_abundances)
# Compare diversity metrics
shannon_error = abs(predicted_shannon - true_shannon)
simpson_error = abs(predicted_simpson - true_simpson)
richness_error = abs(predicted_richness - true_richness)
- Compare performance across different sequencing depths (25K to 500K reads)
- Evaluate rarefaction curve accuracy
- Assess rare ASV detection limits
To ensure the reliability and resilience of the 16S rRNA pipeline, we generated a comprehensive set of synthetic datasets using the script located at: /tests/mock_data_generator.py
This utility simulates a wide range of input conditions, including valid and intentionally corrupted datasets. The goal is to verify that the pipeline:
- Successfully processes valid data
- Handles edge cases
- Produces coherent warnings or informative errors for invalid inputs
The script creates paired-end FASTQ files with 16S sequences and metadata for a variety of test scenarios. Below is a summary of the dataset types:
Dataset ID | Description |
---|---|
stress_test_low_quality |
Sequences with degraded quality scores to test trimming and filtering logic |
stress_test_low_reads |
Samples with very low read counts |
stress_test_high_samples |
Large number of samples to evaluate scalability |
stress_test_high_depth |
Very high sequencing depth per sample |
stress_test_mixed_quality |
Mix of high, medium, low, and very low quality reads |
stress_test_standard |
Standard quality and depth, used as a control |
stress_test_single_sample |
One-sample dataset to validate minimal inputs |
stress_test_corrupted |
Files with intentional FASTQ format violations |
stress_test_rna |
RNA sequences containing 'U' instead of 'T' |
stress_test_format_validation |
Files for compression validation tests |
Each generated sample includes:
- Paired-end FASTQ files:
*_1.fastq.gz
and*_2.fastq.gz
- Metadata file
FASTQ files contain:
- Realistic 16S rRNA sequences (e.g., V4 or V3V4 regions)
- Reverse-complemented R2 reads with mutation simulation
- Quality scores generated with degradation profiles
- Corrupt files are intentionally malformed (e.g., missing quality lines, malformed headers, invalid separators)
Each test run produces a structured report with the following sections:
- Name and goal of the test
- Example FASTQ filenames used
- Path references
- Description of correct pipeline response
- Notes on proper error handling and user feedback
- Step at which the failure occurred (e.g.,
TRIMMOMATIC
) - Error messages (e.g., RuntimeException: Invalid FASTQ name line: read0)
- Notes on whether the failure was gracefully handled or caused a crash
- Add pre-validation for FASTQ structure
- Add existence checks for required files
- Improve clarity of user-facing error messages
These reports provide a reproducible framework for improving the pipelineβs robustness and enhancing user experience in edge-case scenarios.
Goal: Validate pipeline robustness when receiving invalid FASTQ inputs, malformed header in this case, missing @
.
Input:
malformed_header_1.fastq.gz
-
malformed_header_2.fastq.gz
from the folderstress_test_corrupted
Expected: Pipeline should provide informative error messages and do not go straight to stacktrace, that is not user friendly.
Result:
- Pipeline failed at
TRIMMOMATIC
step. - Errors:
Command error: application/gzip Failed to process malformed_header_1.gz
Command error: application/gzip Failed to process malformed_header_2.gz
Goal: Validate pipeline robustness when receiving invalid FASTQ inputs, missing quality line in this case.
Input:
missing_quality_1.fastq.gz
-
missing_quality.fastq.gz
from the folderstress_test_corrupted
Expected: Pipeline should provide informative error messages and do not go straight to stacktrace, that is not user friendly.
Result:
- Pipeline failed at
TRIMMOMATIC
step. - Errors:
java.io.FileNotFoundException: group2a/adapters/TruSeq3-PE-2.fa (No such file or directory)
Exception in thread "main" java.lang.RuntimeException: Sequence and quality length don't match: 'ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG' vs '@read1'
- Failure not gracefully handled (standard Java crash)
Goal: Validate pipeline robustness when receiving invalid FASTQ inputs, missing sequence line in this case.
Input:
missing_sequence_1.fastq.gz
-
missing_sequence.fastq.gz
from the folderstress_test_corrupted
Expected: Pipeline should provide informative error messages and do not go straight to stacktrace.
Result:
- Pipeline failed at
TRIMMOMATIC
step. - Errors:
Failed to process missing_sequence_1.gz uk.ac.babraham.FastQC.Sequence.SequenceFormatException: Midline 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' didn't start with '+' at 3
Failed to process missing_sequence_2.gz uk.ac.babraham.FastQC.Sequence.SequenceFormatException: Midline 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' didn't start with '+' at 3
Goal: Validate pipeline robustness when receiving invalid FASTQ inputs, no separator line (+) in this case.
Input:
no_separator_1.fastq.gz
-
no_separator.fastq.gz
from the folderstress_test_corrupted
Expected: Pipeline should provide informative error messages and do not go straight to stacktrace.
Result:
- Pipeline failed at
TRIMMOMATIC
step. - Errors:
java.io.FileNotFoundException: group2a/adapters/TruSeq3-PE-2.fa (No such file or directory)
Exception in thread "main" java.lang.RuntimeException: Invalid FASTQ comment line: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Goal: Validate pipeline robustness when receiving FASTAQ file with .gz extension but not compressed.
Input:
not_compressed_1.fastq.gz
-
not_compressed_2.fastq.gz
from the folderstress_test_format_validation
Expected: The pipeline should output a message to indicate the user that the file is not compressed.
Result:
- Pipeline failed at
FASTQC_RAW
step. - Pipeline failed immediately and the next message is shown:
Failed to process not_compressed_1.gz java.util.zip.ZipException: Not in GZIP format at java.base/java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:165) at java.base/java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79) at java.base/java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91) at uk.ac.babraham.FastQC.Utilities.MultiMemberGZIPInputStream.<init>(MultiMemberGZIPInputStream.java:37) at uk.ac.babraham.FastQC.Sequence.FastQFile.<init>(FastQFile.java:84) at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:106) at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:62) at uk.ac.babraham.FastQC.Analysis.OfflineRunner.processFile(OfflineRunner.java:163) at uk.ac.babraham.FastQC.Analysis.OfflineRunner.<init>(OfflineRunner.java:125) at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:316) application/gzip
Goal: Verify pipeline behavior when it received a high read (100000) count dataset.
Input:
Sample001_1.fastq.gz
Sample001_2.fastq.gz
Sample002_1.fastq.gz
Sample002_2.fastq.gz
Sample003_1.fastq.gz
Sample003_2.fastq.gz
Sample004_1.fastq.gz
Sample004_2.fastq.gz
Sample005_1.fastq.gz
Sample005_2.fastq.gz
Sample006_1.fastq.gz
Sample006_2.fastq.gz
Sample007_1.fastq.gz
Sample007_2.fastq.gz
Sample008_1.fastq.gz
Sample008_2.fastq.gz
Sample009_1.fastq.gz
Sample009_2.fastq.gz
Sample010_1.fastq.gz
-
Sample010_2.fastq.gz
from the folderstress_test_high_depth
Expected: The pipeline should output a message to indicate that it cannot handle files with such high read counts.
Result:
- Pipeline failed at
DENOISE_DADA2
step. - Errors:
No reads passed the filter. trunc_len_f (240) or trunc_len_r (180) may be individually longer than read lengths, or trunc_len_f + trunc_len_r may be shorter than the length of the amplicon + 12 nucleotides (the length of the overlap). Alternatively, other arguments (such as max_ee or trunc_q) may be preventing reads from passing the filter.
Goal: Verify the pipeline behavior when it receives 150 samples in one go.
Input:
Sample001_1.fastq.gz
Sample001_2.fastq.gz
Sample002_1.fastq.gz
Sample002_2.fastq.gz
Sample003_1.fastq.gz
-
Sample003_2.fastq.gz
up to Sample150_1.fastq.gz
Sample150_2.fastq.gz
Expected: Pipeline indicates the maximum number of files it can handle in one go. That way, the user can plan better how to input their files.
Result:
- Pipeline failed at
DENOISE_DADA2
step. - Error:
No reads passed the filter. trunc_len_f (240) or trunc_len_r (180) may be individually longer than read lengths, or trunc_len_f + trunc_len_r may be shorter than the length of the amplicon + 12 nucleotides (the length of the overlap). Alternatively, other arguments (such as max_ee or trunc_q) may be preventing reads from passing the filter.
Goal: Verify what happens when the pipeline receives low quality reads.
Input:
Sample001_1.fastq.gz
Sample001_2.fastq.gz
Sample002_1.fastq.gz
Sample002_2.fastq.gz
Sample003_1.fastq.gz
Sample003_2.fastq.gz
Sample004_1.fastq.gz
Sample004_2.fastq.gz
Sample005_1.fastq.gz
Sample005_2.fastq.gz
Sample006_1.fastq.gz
Sample006_2.fastq.gz
Sample007_1.fastq.gz
Sample007_2.fastq.gz
Sample008_1.fastq.gz
Sample008_2.fastq.gz
Sample009_1.fastq.gz
Sample009_2.fastq.gz
Sample010_1.fastq.gz
-
Sample010_2.fastq.gz
from the folderstress_test_low_quality
Expected: The pipeline outputs a message indicating it has received low quality data and it cannot process it.
Result:
- Pipeline failed at
DENOISE_DADA2
step. - Error:
No reads passed the filter. trunc_len_f (240) or trunc_len_r (180) may be individually longer than read lengths, or trunc_len_f + trunc_len_r may be shorter than the length of the amplicon + 12 nucleotides (the length of the overlap). Alternatively, other arguments (such as max_ee or trunc_q) may be preventing reads from passing the filter.
Goal: Verify pipeline behavior when it received a low read (50) count dataset.
Input:
Sample001_1.fastq.gz
Sample001_2.fastq.gz
Sample002_1.fastq.gz
Sample002_2.fastq.gz
Sample003_1.fastq.gz
Sample003_2.fastq.gz
Sample004_1.fastq.gz
Sample004_2.fastq.gz
Sample005_1.fastq.gz
-
Sample005_2.fastq.gz
from the folderstress_test_low_reads
Expected: The pipeline outputs a message indicating it has received data with a low number of reads and it cannot process it.
Result:
- Pipeline failed at
DENOISE_DADA2
step. - Error:
No reads passed the filter. trunc_len_f (240) or trunc_len_r (180) may be individually longer than read lengths, or trunc_len_f + trunc_len_r may be shorter than the length of the amplicon + 12 nucleotides (the length of the overlap). Alternatively, other arguments (such as max_ee or trunc_q) may be preventing reads from passing the filter.
Goal: Verify the pipeline result with a dataset containing high, medium, low and very low quality.
Input:
Sample001_1.fastq.gz
Sample001_2.fastq.gz
Sample002_1.fastq.gz
Sample002_2.fastq.gz
Sample003_1.fastq.gz
Sample003_2.fastq.gz
...
Sample019_1.fastq.gz
Sample019_2.fastq.gz
Sample020_1.fastq.gz
-
Sample020_2.fastq.gz
from the folderstress_test_mixed_quality
Expected: The pipeline outputs a message indicating it has received mixed quality data and it cannot process it.
Result:
- Pipeline failed at
DENOISE_DADA2
step. - Error:
No reads passed the filter. trunc_len_f (240) or trunc_len_r (180) may be individually longer than read lengths, or trunc_len_f + trunc_len_r may be shorter than the length of the amplicon + 12 nucleotides (the length of the overlap). Alternatively, other arguments (such as max_ee or trunc_q) may be preventing reads from passing the filter.
Goal: Check the pipeline result when we input RNA sequence files.
Input:
RNA_Sample001_1.fastq.gz
-
RNA_Sample001_2.fastq.gz
from the folderstress_test_rna
Expected: The pipeline should output a message specifying the type of files it can process.
Result:
- Pipeline failed at
DENOISE_DADA2
step. - Error:
No reads passed the filter. trunc_len_f (240) or trunc_len_r (180) may be individually longer than read lengths, or trunc_len_f + trunc_len_r may be shorter than the length of the amplicon + 12 nucleotides (the length of the overlap). Alternatively, other arguments (such as max_ee or trunc_q) may be preventing reads from passing the filter.
Goal: Check the pipeline result for a single sample dataset.
Input:
Sample001_1.fastq.gz
-
Sample001_2.fastq.gz
from the folderstress_test_single_sample
Expected: The pipeline should output an informative message.
Result:
- Pipeline failed at
DENOISE_DADA2
step. - Error:
No reads passed the filter. trunc_len_f (240) or trunc_len_r (180) may be individually longer than read lengths, or trunc_len_f + trunc_len_r may be shorter than the length of the amplicon + 12 nucleotides (the length of the overlap). Alternatively, other arguments (such as max_ee or trunc_q) may be preventing reads from passing the filter.
Goal: Check the pipeline result when the raw_data folder contains a sample whose name is not in the metadata file.
Input:
malformed_header_1.fastq.gz
-
malformed_header_2.fastq.gz
from the folderstress_test_corrupted
Expected: The pipeline should output a message indicating that the sample name could not be found in the metadata.
Result:
- Pipeline failed at
TRIMMOMATIC
step. - Errors:
threads 1 -trimlog malformed_header_trim.log -summary malformed_header.summary malformed_header_1.fastq.gz malformed_header_2.fastq.gz malformed_header.paired.trim_1.fastq.gz malformed_header.unpaired.trim_1.fastq.gz malformed_header.paired.trim_2.fastq.gz malformed_header.unpaired.trim_2.fastq.gz ILLUMINACLIP:group2a/adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:187
Goal: Check the pipeline result when the metadata folder is empty.
Input:
malformed_header_1.fastq.gz
-
malformed_header_2.fastq.gz
from the folderstress_test_corrupted
Expected: The pipeline should output a message indicating that the metadata could not be found.
Result:
- Pipeline exits immediately.
- Messages:
Output directory: runs/R01050625
Metadata file not found: runs/R01050625/metadata/metadata.tsv
Note: Good example of fatal message for the user.
After running all the stress testing cases detailed above, Group 4 offers the following recommendations to improve the pipeline robustness and end user experience.
Implement dependency checks: Ensure all external files and resources exist and are available before the pipeline execution begins. This prevents confusing errors from happening.
Add pre-validation of FASTQ files: Introduce a validation step to check the integrity of the FASTQ files provided as input. For example, missing headers, missing sequence, etc. This validation should cover all cases evaluated in this report.
Improve user-facing error messages: Errors should clearly distinguish between:
- Informative messages: the input is not optimal, but the processing can continue (e.g. a dataset with only one run).
- Fatal messages: the execution of the pipeline must halt due to invalid input files (e.g. missing sequences, incorrect file format). In both cases, the pipeline should avoid raw stack traces and instead guide users towards a resolution.
In summary, we emphasize the importance of providing clear feedback to the user rather than allowing the pipeline to fail silently or crash.