Complementary Tools - EDS-Bioinformatics-Laboratory/ENCORE GitHub Wiki

The use of complementary (GitHub) tools in combination with ENCORE

This page discusses the use of complementary software tools alongside ENCORE to improve computational reproducibility.

ENCORE requirements

In the main text we explained that ENCORE was driven by eight main requirements. Our view on the use of complementary (GitHub) tools with ENCORE mainly concerns requirements 1, 5, 7 and 8:

  1. Consist of a single self-contained project compendium. The computational project should be organized and available as a self-contained and integrated compendium of data, code, results, and (conceptual) documentation, stored at a single location. It should also be easily transferable to other researchers or reviewers without breaking its internal consistency.
  2. Facilitate transparency and documentation. ENCORE should facilitate transparency and a deep understanding (e.g., addressing why specific methods were selected and how these were applied) of the project through its standardized structure and documentation of concepts, methodology, data, code, and results.
  3. Enable reproducibility. The project compendium should enable an external researcher to autonomously execute and understand the computational techniques and recreate the (published) outcomes.
  4. Adhere to proposed guidelines. ENCORE should follow published guidelines for computational reproducibility as much as possible.
  5. Enable version control. ENCORE should allow version control of code and code documentation.
  6. Facilitate harmonization. The ENCORE approach itself should be standardized and well-documented such that it can easily be adopted by any researcher. This allows harmonization within research groups, enabling further joint development of best practices within the ENCORE framework. Moreover, harmonization also facilitates checking transparency and reproducibility prior to publication by direct colleagues.
  7. Provide a generic approach. ENCORE should be agnostic to the type of computational project (e.g., statistical analysis, mathematical modelling), data, programming language, and ICT infrastructure (e.g., operating system and computer hardware). ENCORE should make use of a software versioning system but otherwise should not rely on tools for project management, data processing, etc.
  8. Allow adaptation to different styles of working. ENCORE should leave sufficient flexibility to accommodate different styles of working. The underlying sFSS should be accessible from any software tool the researcher might be using.

Requirement 5. Software version control

From the initial stages of ENCORE development, we recognized the need for a software versioning system for which we selected Git/GitHub. Consequently, the ENCORE template (documentation and github.txt in \Processing) and Step-by-Step Guide are based on the use of GitHub. However, any other version control platform such as GitLab, BitBucket, Subversion, or Mercurial could be used, as this would not alter the ENCORE approach.

Requirement 8. Allow adaptation to different styles of working

Most researchers develop their own methods for organizing research projects and utilize specific software tools for software development, computation, project management, reporting, etc. Therefore, any platform aimed at supporting reproducibility must accommodate diverse working styles and be compatible with the software tools researchers already use. Imposing an entirely different way of working (e.g., different tools) is, in our view, too disruptive and a recipe for failure. Although ENCORE prescribes a certain project organization structure, we believe it also provides sufficient flexibility. Additionally, since ENCORE is file system-based, it will be compatible with most software used by researchers.

Requirement 1. Single self-contained project compendium & Requirement 7. Provide a generic approach.

We strongly believe that for a platform like ENCORE to be successful, it should neither depend on specific software tools for project management, data storage, or processing, nor on particular hardware or operating systems. Such dependencies would be too restrictive and disruptive for many researchers, resulting in a platform that will not be widely adopted.

Currently, ENCORE only requires the use of a software versioning system (e.g., GitHub or an alternative platform) and a small Python program (sFSS Navigator). However, this Python program is also available as an executable for Windows and macOS, and as a shell script for Unix, ensuring that it can be used even if Python is not installed on Windows or macOS. Moreover, the sFSS Navigator is not essential for working with ENCORE and is only meant to generate an HTML file (Navigate.html) for the compendium recipient.

‘Neglecting’ software tools: Good or bad?

Many tools that significantly enhance computational reproducibility may initially appear to be overlooked by ENCORE. Examples mentioned in the main text include software versioning systems (e.g., GitHub, BitBucket, GitLab), tools to preserve the computing environment (e.g., conda, renv, Docker, Apptainer (formely Singularity)), workflow management systems (e.g., Snakemake, NextFlow, Galaxy, Knime), Integrated Development Environments (e.g., Visual Studio Code, NetBeans, PyCharm), software documentation tools (e.g., Doxygen, Sphinx), AI-based tools to support software engineering (e.g., Copilot, ChatGPT, Cody AI), and project management and documentation tools (e.g., Wikis, Trello, Word, PowerPoint, LaTeX). Additionally, platforms like GitHub, GitPod, and AWS Cloud offer cloud-based Integrated Development Environments (IDEs) to provide a range of functionalities for software development. For example, GitHub provides an integrated toolset that includes software versioning, wikis, project discussions, integration with Copilot, workflows (i.e., GitHub Action) and a development environment (Codespaces/containers). The use of such platform seems attractive since it may largely contribute to reproducibility. However, the choice of tools depends greatly on the researcher’s preferences and the specific requirements of a project.

By design, ENCORE does not impose the use of any specific tools other than basic Git/GitHub functionalities for software versioning. This flexibility does not preclude the use of additional tools, some of which are essential for achieving reproducibility, such as environment management and containerization. We emphasize that ENCORE is neither intended to replace these tools, nor does it exclude their use. In fact, we believe that aforementioned tools should be utilized as much as possible in combination with ENCORE. However, ENCORE leaves it to the researcher to decide which tools to use. Incorporating specific tools in ENCORE by design would make the approach far less attractive to the broader community given individual preferences, expertise, and experiences.

An exception: approaches for the preservation of software environment

The preservation of the software environment using tools like conda, renv, Docker, Apptainer, and Virtual Machines, is essential for reproducibility. We are currently investigating within our group how to best include such tools in the ENCORE specifications. Likely, in a next ENCORE version we will include a set of preferred tools and approaches to preserve the computing environment. We expect to do this in a similar way we do now for Git/GitHub, i.e., provide detailed instructions and automation scripts.

Can GitHub functionalities provide an alternative for ENCORE?

Since ENCORE relies on Git/GitHub for software versioning, it is worthwhile to evaluate the extent to which we can leverage other functionalities of GitHub, or if the GitHub platform could serve as an alternative to ENCORE.

The ENCORE file system structure (sFSS) is the central hub and entry point for a project (Figure 1). While ENCORE incorporates basic Git/GitHub functionalities for software versioning, it is not primarily based on Git/GitHub. ENCORE synchronizes only part of a project with the GitHub repository, specifically a selection of files (code, notebooks, and code documentation) within the \Processing directory. The ENCORE sFSS is intended to be shared, whereas sharing the GitHub repository is optional and only necessary if the recipient is interested in accessing previous software versions.

We believe it is important to keep in mind that GitHub is designed to support collaborative software development, with versioning of code and text-based documents as core functionalities. GitHub is not designed to support (large-scale) data analysis projects although additional GitHub functionalities (described below) make it a more general computation platform. Let us first review part of the GitHub functionalities in the context of ENCORE.

ENCORE setup

Figure 1. The sFSS and its environment. The green box denotes the Project Compendium (sFSS) with part of the directory structure shown. The sFSS is the central point of entry for a project and is initially cloned from the ENCORE template GitHub repository when starting a new project. The project team is responsible for the organization and documentation of the project. Only the code and code documentation within the project compendium are synchronized to a project specific GitHub repository. An sFSS project compendium can be shared with a compendium recipient. The compendium recipient starts exploring the project by opening navigate.html in a web browser. (copied from main text).

GitHub Large File Storage (LFS)

GitHub LFS (Large File Storage) is an extension to Git that enables handling large (data) files efficiently. Traditional Git is not optimized for storing large (binary) files which can cause performance issues and increase the repository size too much. GitHub LFS addresses this problem by storing large files outside the main Git repository on a separate server and only keeping pointers to these large files in the repository itself. Only versions of the large files needed for the current ‘checkout’ are downloaded.

Currently, every account using Git LFS receives 1 Gbyte of free storage and 1 Gbyte/month of free bandwidth, which is by far not sufficient for most projects performed in our group. To extend one needs to acquire additional data packs. One data pack cost $5/month and provides a monthly quota of 50 Gbyte for bandwidth (that is also consumed by anyone cloning the repository) and 50 Gbyte for storage. You can purchase as many data packs as you need. For example, if you need 150 Gbyte of storage, you'd need three data packs, which can quickly become expensive, especially for long-term data retention. Additionally, data stored in the local (harddisk) Git repository also consumes (paid) local storage and creates additional costs.

  • Example. For example, for AIRR-seq B-cell/T-cell repertoire experiments we would need about 150 Gbytes for a full analysis corresponding to 3 data packages ($15 / month = $180 /year). Per year we analyze multiple projects with total data storage of about 800 GB per year ($960 /year) with currently about 8000 Gbyte of data ($9600 /year) that we currently store for free on a Dutch cloud-storage facility.

In principle, LFS allows to host complete ENCORE projects on GitHub, but this seems only useful if one intends to use this data with Codespaces, since otherwise the data can be shared/archived as part of the ENCORE sFSS. The pricing of LFS may also be a significant limitation, potentially increasing storage costs if free storage is not available, as one would likely maintain a local copy as well. Additionally, storing confidential or patient data outside the firewall of a research institute or hospital may not be permitted, even temporarily. Furthermore, the fact that GitHub is owned by Microsoft could raise additional concerns regarding data privacy and security.

Documentation: https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-git-large-file-storage

GitHub Codespaces/ development containers

Codespaces is a very useful functionality if one needs a cloud-based IDE and if every member of the project team must use the same development environment and tools. Within Codespaces, a large range of extensions and tools can be installed. When working with Codespaces, one is using a Docker container on a virtual machine, which largely supports reproducibility. The Docker image is in the Cloud to be used by anyone. However, Codespaces are meant for software development, not necessarily for (large scale) data analysis/simulations. Occasionally, Codespaces may be too restrictive if one needs specific hardware (e.g., GPUs) or complete control over the environment. Nevertheless, Codespaces may prove very useful for software development and testing, but also for specific types of data analysis projects.

In principle, one could host a complete ENCORE project (i.e., not only the \Processing directory; Figure 1) on GitHub and make use of Codespaces for the software development and data analysis/simulations/etc. This would improve reproducibility but in practice one would run into various limitations:

  • GitHub file size and storage limits. ENCORE contains much more than only code and documentation. Given current GitHub limits (<5 Gbyte for a repository, and 100 Mbyte for single files) it would in many cases not be possible to host a complete project on GitHub.

  • Large datasets. Projects that make use of large datasets are limited by the default GitHub storage limits and/or Codespaces limits (currently, 15 GB/month and 120 core hours/month for a GitHub Free personal account; https://docs.github.com/en/billing/managing-billing-for-github-codespaces/about-billing-for-github-codespaces#pricing-for-paid-usage). Consequently, for large datasets one needs to resort to, for example, GitHub LFS (see below), which also has limits for free accounts. Moreover, using Codespaces in combination with LFS implies two separate bills (one for storage used in LFS and one for storage in Codespaces). Also note that a computational project may produce a large number of (large output) files (data tables, images, objects, etc.) that may further increase storage requirements and costs, and requires some mechanism to store these output files on LFS if/once committed to the repository (unclear if one can push new output files to LFS from Codespaces at all).

    • Example. For (single-cell) RNA-seq transcriptomics experiments we receive data files of up to 15 Gbytes per sample. For lipidomics/metabolomics studies we typically have between 120 and 500 Mbyte per sample. For studies with multiple samples (tens to hundreds) for which many and/or large (intermediate) output files are generated, the storage requirement will rapidly increase. For example, for AIRR-seq experiments we typically receive about 15 Gbytes of raw data for a single project consisting of multiple samples, which is increased to about 150 Gbytes after the full analyses.
  • Computation time. For the free version of Codespaces computation time is limited to 120 core hours/month, which might be sufficient to test (parts of) software but in general is not sufficient for a full data analysis/simulation. Nowadays, many researchers run their analyses/simulations on local/national computer facilities, which may offer more compute power and/or other functionalities than those provided by GitHub Codespaces, and/or may be free or cheaper.

    • Example. For a typical AIRR-seq project we require about 1000 CPU hours. On a yearly basis we use about 50,000 CPU hours for AIRR-seq projects. We currently run these projects on the Dutch national compute infrastructure (https://www.surf.nl/en/services) for free. Using a Codespaces machine with 32 cores, we would pay $4500 yearly. In addition, we would have to adapt our code for using a multicore machine since currently our software is designed for multiple single-core CPUs. If we would use parallel computing using multiple single-core CPUs with Codespaces then the costs are about twice as high, that is $9000 yearly. Moreover, we would run into data size/file limits in this case.

In summary, for projects that don’t face CodeSpaces CPU or storage limits, Codespaces could be a great functionality for the development and testing of the software and small-scale data analyses, while at the same time preserving the compute environment. This would be in full agreement with ENCORE requirements since the GitHub repository (including the container specifications) could still be synchronized with the local ENCORE copy on one’s harddisk. Codespaces also integrates with local IDEs such as DataSpell/PyCharm (JetBrains) and Visual Studio Code. However, for larger data analysis/simulation projects one probably would run into GitHub/Codespaces CPU or storage limitations (even for paid usage). Moreover, researchers may prefer to use alternative compute infrastructures and/or alternative approaches towards software development.

Documentation: https://docs.github.com/en/codespaces/overview

GitHub Copilot.

Large Language Models (LLMs) such as those used by GitHub Copilot offer valuable functionalities for coding assistance, code documentation generation, and writing unit tests. GitHub ‘Copilot Chat’ is a chat interface to interact with GitHub Copilot within GitHub Codespaces or within supported IDEs (e.g., PyCharm/Dataspell from JetBrains). LLMs can also be used independently of the GitHub/Codespaces platform. Therefore, in general, LLMs can easily be used as part of the ENCORE framework. However, GitHub Copilot or alternatives like Codeium AlphaCode, ChatGPT Plus are not for free and a subscription is required. Pricing for Copilot is currently 100 USD/year for individual users, which is not overly expensive given the potential benefits.

Documentation: https://github.com/features/copilot, https://docs.github.com/en/copilot/github-copilot-chat

GitHub Projects

Projects is a basic cloud-based project management tool that is visible for all project members. One of its key advantages is that it can be linked to multiple repositories, but it does not become part of a specific repository. In the context of ENCORE, we are not in favor of using Projects, since it is not in agreement with the first ENCORE requirement (self-contained project compendium).

Documentation: https://docs.github.com/en/issues/planning-and-tracking-with-projects/learning-about-projects/about-projects

GitHub Issues

Issues can be used to track ideas, bugs, etc. and can be used in the context of ENCORE but should not contain important documentation or discussions that we require to be part of ENCORE (first requirement: self-contained project compendium). Consequently, we mainly use it to document bugs or other minor software issues.

Documentation: https://github.com/features/issues

GitHub Discussions

GitHub Discussions is a platform to share information, ask questions, or have discussions. It is possible to attach documents to the discussion but is less suited for large files. A ‘Discussion’ is linked to a specific repository but is not part of the repository and, therefore, cannot be synchronized with the repository. Currently, the Discussions also cannot be exported. Consequently, it is not in agreement with the first ENCORE requirement (self-contained project compendium). However, Discussions has clear advantages. Project members can easily engage in project/software discussions that can be organized in different sections and threads. Discussions can be linked to GitHub Issues. However, we believe it is better to have all documentation within the sFSS and close to the files (e.g., data, results, etc) it is documenting, which is more difficult to achieve with a separate (cloud-based) forum. Currently, many of our ENCORE projects are hosted on the computers of the project members and synchronized to a cloud system (e.g., Dropbox, SURFdrive), making them accessible to all project members. While they can use the LabJournal.docx or other files to engage in discussions, this method is admittedly less convenient. However, a project team may decide to use a cloud-based system for project discussions. For example, something simple like Google would be useful and also allows to download Discussions to the ENCORE project at the moment of sharing or archiving.

However, we recently started to use ‘Discussions’ for the ENCORE template repository (https://github.com/EDS-Bioinformatics-Laboratory/ENCORE/discussions). This provides a channel to communicate with the scientific community about ENCORE. This Discussion section now reflects part of the steps that are to be addressed as part of ongoing discussions and development.

Documentation: https://docs.github.com/en/discussions

GitHub actions

GitHub Actions enable the setup and execution of software development workflows as part of a repository. This can, for example, be used to automatically build, test, and deploy software. Actions is probably more suited for (multi-member) software development projects and less suited for initiating data analysis/simulations. In the context of ENCORE, Actions might be useful for automatically running unit tests when new code is pushed to the repository. Using Actions, these unit tests can be executed on a GitHub or local server (hosting the data). Depending on the setup, limits to the use of Actions may apply (https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits).

Documentation: https://docs.github.com/en/actions

GitHub Wiki

“A wiki is a form of online hypertext publication that is collaboratively edited and managed by its own audience directly through a web browser. A typical wiki contains multiple pages that can either be edited by the public or limited to use within an organization for maintaining its internal knowledge base” (https://en.wikipedia.org/wiki/Wiki). A GitHub Wiki is not part of the repository itself but can be synchronized separately but this will not pull/clone images used in the Wiki documents.

We have been using Wiki’s in the past for project documentation. For ENCORE we did not consider the use of the GitHub Wiki because (i) we don’t want to rely on a specific documentation system outside ENCORE (first requirement). Although, it is possible to pull/clone the Wiki associated with a repository, this will not clone any images that are used in the documentation. In addition, within ENCORE we prefer to have documentation in the subdirectory of relevance, (ii) Markdown formatting is still too limited in specific cases. (iii) Wiki’s are not for free for our GitHub Organization account unless a repository is public which is neither always desired nor possible (e.g., confidential information) during the execution of a project.

However, we created a WIKI for the ENCORE template GitHub repository (https://github.com/EDS-Bioinformatics-Laboratory/ENCORE/wiki) and for the ENCORE-AUTOMATION repository (https://github.com/EDS-Bioinformatics-Laboratory/ENCORE_AUTOMATION/wiki) containing general basic information and documentation.

Documentation: https://docs.github.com/en/communities/documenting-your-project-with-wikis/about-wikis

Summary

This document describes our view on the use of complementary software tools alongside ENCORE to further improve reproducibility, and in the context of four primary requirements that guided the development of ENCORE. Specifically, we focused on requirements 1, 5, 7, and 8, which were detailed in Section 1.

With respect to the use of complementary tools our view is as follows:

  1. Any platform designed to support reproducibility, including ENCORE, should avoid imposing significant restrictions on researchers (e.g., in terms of use of software tools). Excessive constraints may be perceived by individual researchers as too disruptive, leading to a platform that will not be adopted by the community.

  2. ENCORE is designed to accommodate various styles of working and to be compatible with a large range of software tools. Consequently, it does not impose the use of any specific software tool except for Git/GitHub (Figure 1; which is easily replaced with another versioning system). However, ENCORE users are encouraged to utilize complementary tools that enhance reproducibility, including those for (i) preservation of the compute environment, (ii) software development, (iii) workflow management, (iv) (software) documentation, and (v) project management. In fact, some of these tools are essential for improving reproducibility but currently it is left to the individual researcher to make appropriate choices.

  3. The first ENCORE requirement (a single self-contained project compendium) excludes the use of external of (cloud-based) tools/platforms to host project documentation, project discussions, data, etc but do not allow to synchronize/download this information into the ENCORE project. In addition, for project documentation we prefer to store documentation in the appropriate ENCORE subdirectories.

Since ENCORE relies on Git/GitHub for software versioning, it raises the question of how other functionalities of GitHub can complement ENCORE or whether the GitHub platform could serve as an alternative to ENCORE. We summarize our evaluation in the following points:

  1. Git/GitHub has been developed for software development and version control of text-based documents, while the focus of ENCORE is on computational research which includes but goes beyond mere software development. ENCORE, therefore, includes much more than code, and using the GitHub platform as a replacement for ENCORE could be considered misuse of the GitHub platform.

  2. GitHub offers numerous functionalities that could enhance reproducibility. However, there may be limitations depending on the type of project. For example, projects with high CPU or storage requirements may face functional constraints and incur additional costs. Additionally, not all GitHub functionalities align with the first ENCORE requirement of a single self-contained project compendium.

  3. A researcher may prefer the use of (free/cheaper) local/national compute infrastructure, or there might be data privacy and confidentiality issues that prevent the use of GitHub for large-scale calculations and/or data storage.

  4. Therefore, it is up to individual researchers to decide which GitHub functionalities to use in conjunction with ENCORE for their specific projects.