Project repo org - core-unit-bioinformatics/knowledge-base GitHub Wiki
author | date | tags |
---|---|---|
PE | 2025-02-13 | convention, rule, policy, standard, structure, organization |
PE | 2024-05-09 | cubi, internal, convention, sop, rule, policy, standard |
These guidelines are intended to help harmonizing project repository layout for standard CUBI projects. Note that this document applies to project (git) repositories, which is not the same as a project folder on a shared compute infrastructure.
The same rules as for all guidelines apply: within-project consistency is more
important than adhering to these guidelines. However, any deviation from these
guidelines must be well-motivated and not be rooted in laziness, stubbornness or
bad time or self-management (I'll fix that later
disease).
Initialize a new project repository following the CUBI default procedure:
- set the project name according to the CUBI naming standards
- initialize the project repository using the CUBI tool
auto_git.py
- add and commit the required metadata
using the CUBI tool
update_metadata.py
- the main content type of a project repo must be (searchable) plain text
- obviously, Windows-style line endings (carriage return / CR / CRLF /
^M
/\r
) are forbidden - an exception to this rule applies to client-supplied documents (see below, section Subfolders::Annotations)
- obviously, Windows-style line endings (carriage return / CR / CRLF /
- text files should be written in (github-flavored) Markdown and if so, must have the file extension
.md
- readme files must be written in Markdown and be named
README.md
- the project repository must not be used to capture results that were generated programmatically
- in simple terms: no file that was generated by executing code should be committed to the repository
- an exception to this rule applies to client-supplied documents (see below, section Subfolders::Annotations)
- Static or slowly developing information such as sample sheets, annotation files etc. must always
be kept up-to-date in the
main
branch of the repository; see section Project branches for details.
The main README.md
should document only the most relevant information about
the project (purpose, minimal summary of data types and biological samples,
pointers to working directories and result files etc.). The main README.md
should not be used as an infinite "live documentation" of everything
project-related.
The following subfolders are binding defaults that are likely --- but not necessarily --- present in every project repository that has reached a certain maturity.
Important: all (!) data organized in these subfolders must exist in
the project repository main
branch. That is, if they are added to a
non-main branch, they must immediately be merged into main, e.g. by making
use of git cherry-pick
.
This subfolder is intended to collect all client-supplied annotation files (i.e., most commonly Excel files). Such annotation files must be handled as follows:
-
ROOT/annotations/README.md
: create this readme to document important information etc. -
ROOT/annotations/raw
: copy the annotation file verbatim, i.e. w/o changing anything about its content- only accept and commit client files that can be opened in some way in a Linux environment
-
raw
annotation files are the only files where Windows-style line endings are allowed - add date: prefix the file name with a standard-formatted date if needed
-
ROOT/annotations/export
: produce an unchanged export of theraw
file in a Linux-friendly format- most commonly: export an Excel table into a tab-separated plain text table, i.e. go from
table.xlsx
totable.tsv
- if an Excel table contains several sheets, produce several exports that are appropriately named
- most commonly: export an Excel table into a tab-separated plain text table, i.e. go from
-
ROOT/annotations/norm
: if necessary, post-process the exported annotation to make it usable for the subsequent analysis- the
norm
procedure should by all means be codified, e.g. using Jupyter Notebooks or a tiny script - if the
norm
procedure relies on an external software, i.e. you are just converting from format A into format B, then this step should (must ...) be realized as part of the actual analysis
- the
The subfolder ROOT/samples
captures all properly formatted sample sheets.
If more than one sample sheet is needed, the file name should at least be suggestive of the
respective analysis, i.e. no samplesA.tsv
, samples_old.tsv
, samples_new.tsv
etc.
What is the goal for ROOT/samples
? The ROOT/samples
folder in the main
branch
must be the authoritative source for the information of which samples were processed
using what input data. Collecting all kinds of sample sheets in one location also
reveals differences in sample naming, which is a prime source for unnecessary biolerplate
code to adapt sample names on-the-fly when moving from one stage of the analysis to
the next.
Note: if revealed, differences in sample naming should be dealt with by writing a small script or notebook normalizing sample names. Do not manually edit sample sheets in this folder, such changes are prone to getting lost!
The subfolder ROOT/config
captures all project-related configuration files in the
respective data format. For example, if a project requires executing several different
Snakemake workflows, all project-specific YAML
configuration files are collected in this folder.
What is the goal for ROOT/config
? The ROOT/config
folder in the main
branch must be the
authoritative source for the information of how data were processed. In particular,
it must be avoided to create different flavors of run config files "hidden" in subfolders
or in different branches, which makes tracing back analysis steps much harder.
The subfolder ROOT/docs
captures detailed project documentations. If ROOT/docs
exists,
it must contain a main ROOT/docs/README.md
file.
For larger projects, ROOT/docs
may be logically subdivided, e.g.
into ROOT/docs/meetings
, ROOT/docs/notes
, ROOT/docs/methods
and so on.
Update 2025
Please pay attention to the updates below --- all code should be organized
in an appropriate folder structure underneath codebase. Older projects
do not have to be reorganized to meet that requirement.
By default, there should be one analysis subfolder for standardizing
metadata, in particular sample names, which are then propagated throughout
the project in other parts of the codebase.
That means, it is strongly recommended that there is one folder tree like this
ROOT/codebase/metadata/<MORE-SUBFOLDERS-AS-REQUIRED>
that contains code for normalizing the basic metadata of the workflow, e.g., sample
names and sample annotations.
Code must be organized as follows:
-
ROOT/notebooks
: Jupyter notebooks, logically sorted into subfolders if applicable-
Deprecated as of 2025: create
notebooks/
subfolders underneathcodebase/<ANALYIS-NAME>
if required -
ROOT/notebooks/envs
: folder to collect Conda environment specifications for Jupyter notebooks
-
Deprecated as of 2025: create
-
ROOT/scripts
: stand-alone scripts (!), e.g., to normalize annotation files or metadata tables-
Deprecated as of 2025: create
scripts/
subfolders underneathcodebase/<ANALYIS-NAME>
if required -
ROOT/scripts/envs
: folder to collect Conda environment specifications for scripts - for example, scripts in this folder are executed to normalize an annotation table, i.e. they
read a table from
ROOT/annotations/export
, clean up its content and write the clean version toROOT/annotations/norm
. Other analysis code (see next, point 3) may then take the cleaned up annotation table as additional input downstream.
-
Deprecated as of 2025: create
-
ROOT/codebase/<ANALYSIS-NAME>
: for project-specific analysis code that represents a coherent unit, create an appropriately named subfolder and organize your code as you see fit, or as recommended for the respective workflow ecosystem if applicable. Examples:-
update 2025:
ROOT/codebase/metadata/<SUBFOLDERS>
for code normalizing/standardizing metadata records, in particular related to sample naming and sample annotation. Must exist at most once. - the analysis code is developed in form of a Snakemake workflow. According to Snakemake best practices,
the following folder structure would likely be created:
-
ROOT/codebase/<ANALYSIS-NAME>/workflow
: default root folder of a Snakemake workflow -
ROOT/codebase/<ANALYSIS-NAME>/workflow/scripts
: scripts used in the workflow -
ROOT/codebase/<ANALYSIS-NAME>/workflow/envs
: Conda environment specs for the workflow -
ROOT/codebase/<ANALYSIS-NAME>/workflow/rules
: Snakemake files - and so on ...
-
- the analysis code is just a bunch of scripts:
-
ROOT/codebase/<ANALYSIS-NAME>/scripts
: the scripts -
ROOT/codebase/<ANALYSIS-NAME>/envs
: if needed, special environment specs - and so on ...
-
- usually, when starting project-specific code development work for data analysis,
this development work would happen in a new
analysis-
branch (see below, Project branches) - all analyses in
ROOT/codebase
must briefly be documented in themain
branchROOT/README.md
orROOT/docs/README.md
- note: for project analyses that only make use of generic workflows, do not create a
subfolder here to put the workflow configs. They belong into the
ROOT/config
subfolder and sample sheets into theROOT/samples
subfolder.
-
update 2025:
What is the goal for ROOT/codebase/<ANALYSES>
? Several developers can implement different
stages of the analysis in parallel without creating any conflicting commits. Hence, a final
merge of all analysis-
branches (see next section) back into main
should come with
little overhead.
Standard CUBI project repositories can deviate from the default development process
in that there is likely no need to start with a prototype
branch and a dedicated dev
branch may also not be needed. If several developers contribute to a project repository, the
exact process should be discussed among them.
However, as explained above, static or slowly developing information such as sample sheets, annotation
files and project documentation must always be kept up-to-date in the main
branch. This implies that
code solely developed for the purpose of dealing with such quasi-static files must exist in main
.
For larger, project-specific code developments, e.g. a non-generic workflow that performs an integrative
analysis, a separate branch with the prefix analysis-
should be created. This is analogous to the
feature-
branch in workflow or tool repositories with the important distinction that an analysis-
branch may exist for much longer (as long as the analysis has not been finished). If the overall project
development suggests to make a release to mark certain milestones, e.g. paper submitted or similar,
it is strongly suggested to merge any analysis-
branch back into main.
The existence and the purpose of all analysis-
branches must be documented in the
main README.md
(or docs/README.md
) file of the project repository.
Important note about workflows: the above applies to non-generic, project-specific workflows.
If project data are analyzed with a standard CUBI workflow, there is no need to create a separate
analysis-
branch. This fact must simply be recorded in the project's pyproject.toml
, documented in the
main project README.md
(or docs/README.md
) and, if applicable, relevant config files and
sample sheets sorted into ROOT/config
and ROOT/samples
.