Week 1: Reproducibility - bcb420-2025/Izumi_Ando GitHub Wiki

⏰ (expected vs actual time taken) - 1 hour : 2.25 hours

Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers

This paper discusses how bioinformatics workflow managers address these challenges, making computational pipelines more efficient, shareable, and maintainable.

Citation

Wratten, L., Wilm, A., & GΓΆke, J. (2021). Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nature methods, 18(10), 1161–1168. https://doi.org/10.1038/s41592-021-01254-9

Notes

1. Challenges in Computational Biology Workflows

Problem of Reproducibility

  • Software versions, parameters, and reference data versions can significantly impact results.
  • Traditional pipelines built with custom scripts or Makefiles often suffer from: High dependence on local infrastructure, Poor documentation and version tracking.
  • Inability to easily resume failed runs.

Portability Issues

  • Running the same analysis across different environments (local machine, HPC, cloud) is difficult.
  • Software dependencies and OS-specific issues prevent seamless execution.

Scalability Concerns

  • Large datasets require efficient resource management.
  • Traditional methods lack built-in parallelization and job scheduling.

2. Benefits of Workflow Managers

Workflow managers were developed to solve these issues by automating and standardizing computational pipelines.

Data Provenance

  • Workflow managers track software versions, parameters, and execution environments automatically.
  • Some provide execution reports with: Input parameters, Tool versions, Resource usage details (CPU, memory, execution time), Visualization of pipeline steps.

Portability

  • Workflow managers use containerization (e.g., Docker, Singularity) and package managers (e.g., Conda, Bioconda) to ensure cross-platform reproducibility.
  • Bioinformatics workflow repositories like Dockstore and BioContainers simplify tool distribution.

Scalability

  • Built-in parallelization and resource-aware scheduling optimize performance.
  • Support for HPC, cloud computing, and container orchestration (e.g., Kubernetes, Docker Swarm).

Re-Entrancy (Checkpointing)

  • Allows workflows to resume from the last successful step instead of restarting from scratch in case of failure.
  • Uses caching to avoid recomputing intermediate results, saving time and cost.

3. Comparison of Different Workflow Managers

Comparisons of several workflow managers based on usability, expressiveness, portability, scalability, and learning resources.

image
Table taken from the paper cited above

4. Pipeline Sharing and Community-Curated Workflows

To prevent redundant effort in pipeline development, community efforts have created curated repositories of reusable pipelines.

image
Table taken from the paper cited above

5. Future Directions

Standardization and Benchmarking

  • Need for systematic benchmarking of entire workflows, not just individual tools.
  • More performance evaluations of workflow managers themselves (memory, storage, execution speed).

Long-Term Software Maintenance

  • Many bioinformatics tools are open-source but lack sustained maintenance.
  • Funding initiatives (e.g., Chan Zuckerberg Initiative) are essential for supporting long-term pipeline development.

Improved Accessibility for Non-Computational Users

  • Increasing integration of graphical interfaces in DSL-based workflow managers (e.g., Nextflow Tower).
  • Workflow repositories like WorkflowHub.eu are helping bridge gaps across different workflow languages.

The Five Pillars of Computational Reproducibility: Bioinformatics and Beyond

Introduces a framework to improve computational reproducibility to ensure that research is transparent, reliable, and reusable.

Citation

Ziemann, M., Poulain, P., & Bora, A. (2023). The five pillars of computational reproducibility: bioinformatics and beyond. Briefings in bioinformatics, 24(6), bbad375. https://doi.org/10.1093/bib/bbad375

Notes

Motivation

  • Computational reproducibility is critical for ensuring research reliability.
  • There are real world consequences to lack of reproducibility, case study below

In 2006, a study on genomic signatures for chemotherapy selection was retracted due to severe data errors such as mislabeling, duped data etc. These flaws influenced clinical trials leading to patient lawsuits against Duke.

5 Pillars of Computational Reproducibility

  1. Literate Programming - combining code + comments
  • R-markdown, Quatro, Jupyter Notebooks
  • results get emnbedded, reduces copy & paste errors, ensures provenance tracking
  1. Code Version Control & Sharing
  • Use Git, Github/GitLab/Bitbucket, and Zenodo/Software Heritage (for final versions)
  1. Compute Environment Control
  • software version differences can lead to diff results, package updates can break older scripts
  • use containers (Docker, Singularity), or package managers (Conda, Guix) and document dependencies (sessionInfo(), pip freeze)
  1. Persistent Data Sharing
  • "data available on request" is not always true
  • deposit datasets in FAIR (findable, accessible, interoperable, resuable) repos
  • domain specific repos are good: GEO, SRA, ENA, PRIDE
  • general datasets can be shared via: Zenodo, Figshare, Dryad

In a study of Jupyter Notebooks in biomedical research, only 5.9% produced expected results due to missing data and broken dependencies. - this is crazy!

  1. Documentation
  • include README in repos with project purpose, installation steps, expected outputs
  • use protocol repos like protocols.io for detailed workflows
  • use standardized reporting guidelines (MDAR, MIABi)

image
Taken from the article

Challenges

  • Lack of incentives – Scientists are rewarded for novelty, not reproducibility.
  • Poor journal policies – Most journals do not check if computational results can be reproduced.
  • Training gaps – Many life science researchers lack computational skills.
  • Poor awareness – Many bioinformatics workflows are not documented properly.

Future Directions

  • Automated reproducibility testing: Journals could integrate continuous validation tools to check reproducibility before publication.
  • More training programs: Initiatives like The Carpentries train researchers in reproducible workflows.
  • Policy changes: Funding agencies could require reproducibility as a condition for grants.
  • Encouraging preprints & open science: Platforms like bioRxiv, eLife and Open Science Framework can host fully reproducible research.