Project repo org - core-unit-bioinformatics/knowledge-base GitHub Wiki

author date tags
PE 2025-02-13 convention, rule, policy, standard, structure, organization
PE 2024-05-09 cubi, internal, convention, sop, rule, policy, standard

CUBI project repository layout guidelines

These guidelines are intended to help harmonizing project repository layout for standard CUBI projects. Note that this document applies to project (git) repositories, which is not the same as a project folder on a shared compute infrastructure.

The same rules as for all guidelines apply: within-project consistency is more important than adhering to these guidelines. However, any deviation from these guidelines must be well-motivated and not be rooted in laziness, stubbornness or bad time or self-management (I'll fix that later disease).

Preliminaries

Initialize a new project repository following the CUBI default procedure:

  1. set the project name according to the CUBI naming standards
  2. initialize the project repository using the CUBI tool auto_git.py
  3. add and commit the required metadata using the CUBI tool update_metadata.py

Hard rules

  1. the main content type of a project repo must be (searchable) plain text
    • obviously, Windows-style line endings (carriage return / CR / CRLF / ^M / \r) are forbidden
    • an exception to this rule applies to client-supplied documents (see below, section Subfolders::Annotations)
  2. text files should be written in (github-flavored) Markdown and if so, must have the file extension .md
  3. readme files must be written in Markdown and be named README.md
  4. the project repository must not be used to capture results that were generated programmatically
    • in simple terms: no file that was generated by executing code should be committed to the repository
    • an exception to this rule applies to client-supplied documents (see below, section Subfolders::Annotations)
  5. Static or slowly developing information such as sample sheets, annotation files etc. must always be kept up-to-date in the main branch of the repository; see section Project branches for details.

Default project layout

The main README.md should document only the most relevant information about the project (purpose, minimal summary of data types and biological samples, pointers to working directories and result files etc.). The main README.md should not be used as an infinite "live documentation" of everything project-related.

Subfolders

The following subfolders are binding defaults that are likely --- but not necessarily --- present in every project repository that has reached a certain maturity.

Important: all (!) data organized in these subfolders must exist in the project repository main branch. That is, if they are added to a non-main branch, they must immediately be merged into main, e.g. by making use of git cherry-pick.

Annotations

This subfolder is intended to collect all client-supplied annotation files (i.e., most commonly Excel files). Such annotation files must be handled as follows:

  1. ROOT/annotations/README.md: create this readme to document important information etc.
  2. ROOT/annotations/raw: copy the annotation file verbatim, i.e. w/o changing anything about its content
    • only accept and commit client files that can be opened in some way in a Linux environment
    • raw annotation files are the only files where Windows-style line endings are allowed
    • add date: prefix the file name with a standard-formatted date if needed
  3. ROOT/annotations/export: produce an unchanged export of the raw file in a Linux-friendly format
    • most commonly: export an Excel table into a tab-separated plain text table, i.e. go from table.xlsx to table.tsv
    • if an Excel table contains several sheets, produce several exports that are appropriately named
  4. ROOT/annotations/norm: if necessary, post-process the exported annotation to make it usable for the subsequent analysis
    • the norm procedure should by all means be codified, e.g. using Jupyter Notebooks or a tiny script
    • if the norm procedure relies on an external software, i.e. you are just converting from format A into format B, then this step should (must ...) be realized as part of the actual analysis

Sample sheets

The subfolder ROOT/samples captures all properly formatted sample sheets. If more than one sample sheet is needed, the file name should at least be suggestive of the respective analysis, i.e. no samplesA.tsv, samples_old.tsv, samples_new.tsv etc.

What is the goal for ROOT/samples? The ROOT/samples folder in the main branch must be the authoritative source for the information of which samples were processed using what input data. Collecting all kinds of sample sheets in one location also reveals differences in sample naming, which is a prime source for unnecessary biolerplate code to adapt sample names on-the-fly when moving from one stage of the analysis to the next.

Note: if revealed, differences in sample naming should be dealt with by writing a small script or notebook normalizing sample names. Do not manually edit sample sheets in this folder, such changes are prone to getting lost!

Project-specific configuration files

The subfolder ROOT/config captures all project-related configuration files in the respective data format. For example, if a project requires executing several different Snakemake workflows, all project-specific YAML configuration files are collected in this folder.

What is the goal for ROOT/config? The ROOT/config folder in the main branch must be the authoritative source for the information of how data were processed. In particular, it must be avoided to create different flavors of run config files "hidden" in subfolders or in different branches, which makes tracing back analysis steps much harder.

Extensive documentation

The subfolder ROOT/docs captures detailed project documentations. If ROOT/docs exists, it must contain a main ROOT/docs/README.md file. For larger projects, ROOT/docs may be logically subdivided, e.g. into ROOT/docs/meetings, ROOT/docs/notes, ROOT/docs/methods and so on.

Code

Update 2025

    Please pay attention to the updates below --- all code should be organized
    in an appropriate folder structure underneath codebase. Older projects
    do not have to be reorganized to meet that requirement.

    By default, there should be one analysis subfolder for standardizing
    metadata, in particular sample names, which are then propagated throughout
    the project in other parts of the codebase.

    That means, it is strongly recommended that there is one folder tree like this
    ROOT/codebase/metadata/<MORE-SUBFOLDERS-AS-REQUIRED>
    that contains code for normalizing the basic metadata of the workflow, e.g., sample
    names and sample annotations.

Code must be organized as follows:

  1. ROOT/notebooks: Jupyter notebooks, logically sorted into subfolders if applicable
    • Deprecated as of 2025: create notebooks/ subfolders underneath codebase/<ANALYIS-NAME> if required
    • ROOT/notebooks/envs: folder to collect Conda environment specifications for Jupyter notebooks
  2. ROOT/scripts: stand-alone scripts (!), e.g., to normalize annotation files or metadata tables
    • Deprecated as of 2025: create scripts/ subfolders underneath codebase/<ANALYIS-NAME> if required
    • ROOT/scripts/envs: folder to collect Conda environment specifications for scripts
    • for example, scripts in this folder are executed to normalize an annotation table, i.e. they read a table from ROOT/annotations/export, clean up its content and write the clean version to ROOT/annotations/norm. Other analysis code (see next, point 3) may then take the cleaned up annotation table as additional input downstream.
  3. ROOT/codebase/<ANALYSIS-NAME>: for project-specific analysis code that represents a coherent unit, create an appropriately named subfolder and organize your code as you see fit, or as recommended for the respective workflow ecosystem if applicable. Examples:
    • update 2025: ROOT/codebase/metadata/<SUBFOLDERS> for code normalizing/standardizing metadata records, in particular related to sample naming and sample annotation. Must exist at most once.
    • the analysis code is developed in form of a Snakemake workflow. According to Snakemake best practices, the following folder structure would likely be created:
      • ROOT/codebase/<ANALYSIS-NAME>/workflow: default root folder of a Snakemake workflow
      • ROOT/codebase/<ANALYSIS-NAME>/workflow/scripts: scripts used in the workflow
      • ROOT/codebase/<ANALYSIS-NAME>/workflow/envs: Conda environment specs for the workflow
      • ROOT/codebase/<ANALYSIS-NAME>/workflow/rules: Snakemake files
      • and so on ...
    • the analysis code is just a bunch of scripts:
      • ROOT/codebase/<ANALYSIS-NAME>/scripts: the scripts
      • ROOT/codebase/<ANALYSIS-NAME>/envs: if needed, special environment specs
      • and so on ...
    • usually, when starting project-specific code development work for data analysis, this development work would happen in a new analysis- branch (see below, Project branches)
    • all analyses in ROOT/codebase must briefly be documented in the main branch ROOT/README.md or ROOT/docs/README.md
    • note: for project analyses that only make use of generic workflows, do not create a subfolder here to put the workflow configs. They belong into the ROOT/config subfolder and sample sheets into the ROOT/samples subfolder.

What is the goal for ROOT/codebase/<ANALYSES>? Several developers can implement different stages of the analysis in parallel without creating any conflicting commits. Hence, a final merge of all analysis- branches (see next section) back into main should come with little overhead.

Project branches

Standard CUBI project repositories can deviate from the default development process in that there is likely no need to start with a prototype branch and a dedicated dev branch may also not be needed. If several developers contribute to a project repository, the exact process should be discussed among them.

However, as explained above, static or slowly developing information such as sample sheets, annotation files and project documentation must always be kept up-to-date in the main branch. This implies that code solely developed for the purpose of dealing with such quasi-static files must exist in main.

For larger, project-specific code developments, e.g. a non-generic workflow that performs an integrative analysis, a separate branch with the prefix analysis- should be created. This is analogous to the feature- branch in workflow or tool repositories with the important distinction that an analysis- branch may exist for much longer (as long as the analysis has not been finished). If the overall project development suggests to make a release to mark certain milestones, e.g. paper submitted or similar, it is strongly suggested to merge any analysis- branch back into main.

The existence and the purpose of all analysis- branches must be documented in the main README.md (or docs/README.md) file of the project repository.

Important note about workflows: the above applies to non-generic, project-specific workflows. If project data are analyzed with a standard CUBI workflow, there is no need to create a separate analysis- branch. This fact must simply be recorded in the project's pyproject.toml, documented in the main project README.md (or docs/README.md) and, if applicable, relevant config files and sample sheets sorted into ROOT/config and ROOT/samples.

⚠️ **GitHub.com Fallback** ⚠️