Week 1: Reproducibility - bcb420-2025/Izumi_Ando GitHub Wiki

⏰ (expected vs actual time taken) - 1 hour : 2.25 hours

Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers

This paper discusses how bioinformatics workflow managers address these challenges, making computational pipelines more efficient, shareable, and maintainable.

Citation

Wratten, L., Wilm, A., & Göke, J. (2021). Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nature methods, 18(10), 1161–1168. https://doi.org/10.1038/s41592-021-01254-9

Notes

1. Challenges in Computational Biology Workflows

Problem of Reproducibility

Software versions, parameters, and reference data versions can significantly impact results.
Traditional pipelines built with custom scripts or Makefiles often suffer from: High dependence on local infrastructure, Poor documentation and version tracking.
Inability to easily resume failed runs.

Portability Issues

Running the same analysis across different environments (local machine, HPC, cloud) is difficult.
Software dependencies and OS-specific issues prevent seamless execution.

Scalability Concerns

Large datasets require efficient resource management.
Traditional methods lack built-in parallelization and job scheduling.

2. Benefits of Workflow Managers

Workflow managers were developed to solve these issues by automating and standardizing computational pipelines.

Data Provenance

Workflow managers track software versions, parameters, and execution environments automatically.
Some provide execution reports with: Input parameters, Tool versions, Resource usage details (CPU, memory, execution time), Visualization of pipeline steps.

Portability

Workflow managers use containerization (e.g., Docker, Singularity) and package managers (e.g., Conda, Bioconda) to ensure cross-platform reproducibility.
Bioinformatics workflow repositories like Dockstore and BioContainers simplify tool distribution.

Scalability

Built-in parallelization and resource-aware scheduling optimize performance.
Support for HPC, cloud computing, and container orchestration (e.g., Kubernetes, Docker Swarm).

Re-Entrancy (Checkpointing)

Allows workflows to resume from the last successful step instead of restarting from scratch in case of failure.
Uses caching to avoid recomputing intermediate results, saving time and cost.

3. Comparison of Different Workflow Managers

Comparisons of several workflow managers based on usability, expressiveness, portability, scalability, and learning resources.

Table taken from the paper cited above

4. Pipeline Sharing and Community-Curated Workflows

To prevent redundant effort in pipeline development, community efforts have created curated repositories of reusable pipelines.

Table taken from the paper cited above

5. Future Directions

Standardization and Benchmarking

Need for systematic benchmarking of entire workflows, not just individual tools.
More performance evaluations of workflow managers themselves (memory, storage, execution speed).

Long-Term Software Maintenance

Many bioinformatics tools are open-source but lack sustained maintenance.
Funding initiatives (e.g., Chan Zuckerberg Initiative) are essential for supporting long-term pipeline development.

Improved Accessibility for Non-Computational Users

Increasing integration of graphical interfaces in DSL-based workflow managers (e.g., Nextflow Tower).
Workflow repositories like WorkflowHub.eu are helping bridge gaps across different workflow languages.

The Five Pillars of Computational Reproducibility: Bioinformatics and Beyond

Introduces a framework to improve computational reproducibility to ensure that research is transparent, reliable, and reusable.

Citation

Ziemann, M., Poulain, P., & Bora, A. (2023). The five pillars of computational reproducibility: bioinformatics and beyond. Briefings in bioinformatics, 24(6), bbad375. https://doi.org/10.1093/bib/bbad375

Notes

Motivation

Computational reproducibility is critical for ensuring research reliability.
There are real world consequences to lack of reproducibility, case study below

In 2006, a study on genomic signatures for chemotherapy selection was retracted due to severe data errors such as mislabeling, duped data etc. These flaws influenced clinical trials leading to patient lawsuits against Duke.

5 Pillars of Computational Reproducibility

Literate Programming - combining code + comments

R-markdown, Quatro, Jupyter Notebooks
results get emnbedded, reduces copy & paste errors, ensures provenance tracking

Code Version Control & Sharing

Use Git, Github/GitLab/Bitbucket, and Zenodo/Software Heritage (for final versions)

Compute Environment Control

software version differences can lead to diff results, package updates can break older scripts
use containers (Docker, Singularity), or package managers (Conda, Guix) and document dependencies (sessionInfo(), pip freeze)

Persistent Data Sharing

"data available on request" is not always true
deposit datasets in FAIR (findable, accessible, interoperable, resuable) repos
domain specific repos are good: GEO, SRA, ENA, PRIDE
general datasets can be shared via: Zenodo, Figshare, Dryad

In a study of Jupyter Notebooks in biomedical research, only 5.9% produced expected results due to missing data and broken dependencies. - this is crazy!

Documentation

include README in repos with project purpose, installation steps, expected outputs
use protocol repos like protocols.io for detailed workflows
use standardized reporting guidelines (MDAR, MIABi)

Taken from the article

Challenges

Lack of incentives – Scientists are rewarded for novelty, not reproducibility.
Poor journal policies – Most journals do not check if computational results can be reproduced.
Training gaps – Many life science researchers lack computational skills.
Poor awareness – Many bioinformatics workflows are not documented properly.

Future Directions

Automated reproducibility testing: Journals could integrate continuous validation tools to check reproducibility before publication.
More training programs: Initiatives like The Carpentries train researchers in reproducible workflows.
Policy changes: Funding agencies could require reproducibility as a condition for grants.
Encouraging preprints & open science: Platforms like bioRxiv, eLife and Open Science Framework can host fully reproducible research.