External pipelines - core-unit-bioinformatics/knowledge-base GitHub Wiki

author date tags
KH 2022-10-10 cubi, internal, workflow, pipeline, extern, standard, criteria
PE 2023-01-25 update, forks, policy
KH 2023-02-24 update, git, tags

Usage of external Workflows

For frequently used and standardized bioinformatic analyses, there are often workflows already available from third-party sources (typically, a github repo). Using those workflows may save resources such as development time, and, ideally, we directly benefit from an existing community for bug-fixing, support and regular updates.
In the best case, the locally produced workflow results are directly comparable to those of other institutions relying on the same workflow; this can be a solid starting point to facilitate collaborations.

Criteria for CUBI use

These external workflows should fulfil some criteria for running, maintaining and ensuring reproducible results within the CUBI:

  1. Fitting your project and data
    The probably most important requirement. The workflow should produce the needed output with methods that are suited for your data. For example, if your data includes unique molecular identifiers (UMIs), the workflow should also account for that.
  • Note: if you find that the external workflow fulfills all requirements but lacks a clear documentation on the input data specification, the right way to fix that would be updating the documentation and creating a pull request in the original workflow repository.
  1. Running on local infrastructure
    This includes infrastructure requirements as, e.g., execution on high-performance clusters, restricted use of online resources, available software, etc.
  • Note: the possible ways of deploying the workflow on compute infrastructure must cover an offline deployment option (always consider HPC environments to be disconnected from the outside world). If an offline deployment is not possible, are there ways around that, e.g., by using local mirror servers for software setup or by using resources that were downloaded locally before the workflow deployment starts? Is the workflow small enough to be containerized including all dependencies and then simply be executed on a single node in the cluster?
  1. Usability
    The Installation and configuration should not be too complex. As a guideline these workflows should be usable and understandable by other CUBI members without special knowledge, e.g., it should not contain cryptic code for configuration or configuration files with idiosyncratic syntax or language.
    There should also be proper documentation that enables fast and easy configuration and debugging. Also consider that, potentially, the workflows will be executed by non-bioinformaticians such as biologists or technicians at some point. For people that are not experts in the respective domain, a simple and clearly documented user interface is vital.
  2. Reporting and quality control
    The workflow should provide comprehensive quality control and results reporting. The reports should give information about the processing, the created output and its quality. The workflow steps and output should be transparent and not represent a 'black box'.
  3. Flexibility
    Integration within the infrastructure and adaptation towards yopur project should be possible and easy. For example, if you need to use another reference genome this should be as simple as one change in the configuration file. This should also be the case for updates from the external workflow into your running system.
  4. Reproducibility
    The workflow release model should clearly identify stable versions. Important changes between different versions (changes in runtime behavior, default parameters etc.) should be documented in a transparent manner (preferably: in a public CHANGELOG file in the repo).
  5. Open Access
    Of course, the workflow and all its code should be completely public and open-access, following the principles of open science. The used programs and dependencies should also be non-proprietary and usable within their license agreements.
  • Note: code without license must not be used. If you think the author(s) just forgot adding a license file, reach out to them and kindly ask them to add a license to their repository.

When evaluating whether to use an external workflow, you can ask yourself the following:

  • How many project-specific adaptation do I need?
  • If some steps are missing or not suited: Can I Integrate adaptations as valuable contributions to the external workflow?
  • Is it more reasonable to choose another or create your own workflow?

Best practices: incorporate an external workflow

Assuming that you positively evaluated an external workflow, and decide that using it is better than writing something from scratch, you should adopt the following strategy to make the workflow locally available:

  1. Fork the workflow into the CUBI organization on github
  • only fork the main branch (commonly called main, master or trunk) that contains the release versions of the code base. Forking the entire upstream repository including all development and features branches, tags etc. will "litter" your repository with information that are irrelevant to your users.
  • set up multiple push targets as per CUBI policy.
  1. Read and understand the license agreement of the forked workflow, and never violate these conditions! As an example, even the liberal MIT license includes this statement:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
    
  2. Add the CUBI metadata files (our license, citation and pyproject.toml etc.). Add also the version of the upstream repo. If adding these files clashes with existing files - quite likely at least for the license - create a subfolder cubi/ and add "our" metadata files there.

    • Note: obviously, these files should never be pushed upstream ...
  3. Even if the original code base does not contain specific citation information, always update the default CUBI citation file with a reference to the original project. Citing the CUBI version of the workflow must trigger giving credit to the original creators in the appropriate form.

  4. If you add new code files, mark them explicitly (beginning of the file) as being licensed under the CUBI's default (= MIT) license (point explicitly to the location of the CUBI's license file).

  5. Simple fixes or feature extensions in the existing code base that you feed back to the original project per pull request to upstream do not need to be specifically marked (or licensed) - that's a quid pro quo service.

  6. Follow the CUBI git development process. This implies working with the usual CUBI branch structure (main, dev etc.).

  7. Working with a forked repository requires tracking two records of code version numbers: (i) the version of the upstream repository (x.y.z), and (ii) the version of your own (adapted) code base of the repository (a.b.c). Both version records can be tracked in the pyproject.toml file. It is strongly recommended to only work with proper release versions of upstream, and to avoid frequent syncs with upstream. For your own versioning, remember to follow the CUBI versioning standards. When pulling a new release version from upstream into your forked repository, keep in mind that breaking changes upstream should also be reflected in the version number of your repository. For example, if upstream updates from x.y.z to x+1.y.z, you should as well increment the major version of your repo to signal a breaking change (i.e., a.b.c to a+1.b.c).

⚠️ **GitHub.com Fallback** ⚠️