Cookiecutter - koppsteinlab/knowledge-repo GitHub Wiki

Cookiecutter

What is cookiecutter?

Cookiecutter is a command-line tool that helps you to quickly create projects from predefined templates. It’s perfect for setting up Python packages and other types of projects with a consistent folder structure.

Our cookiecutter template is a forked version of the cookiecutter-bioinformatics-project and has a similar structure of Snakemake workflows:

├── CITATION.cff       <- Contains metadata on how the project might eventually be published. 
├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── config             <- Configuration options for the analysis. 
|   ├── config.yaml    <- Snakemake config file. 
|   └── samples.tsv    <- A metadata table for all the samples run in the analysis.  
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── environment.yaml   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `conda env export > environment.yaml`
│
├── img                <- A place to store images associated with the project/pipeline, e.g. a 
│                         a figure of the pipeline DAG. 
│
├── notebooks          <- Jupyter or Rmd notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── resources          <- Place for data. By default excluded from the git repository. 
│   ├── external       <- Data from third party sources.
│   └── raw_data       <- The original, immutable data dump.
│
├── results            <- Final output of the data processing pipeline. By default excluded from the git repository.
│ 
├── sandbox            <- A place to test scripts and ideas. By default excluded from the git repository.
│ 
├── scripts            <- A place for short shell or python scripts.
│ 
├── setup.py           <- Makes project pip installable (pip install -e .) so src can be imported
│
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
├── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io
│
├── workflow           <- Place to store the main pipeline for rerunning all the analysis. 
│   ├── envs           <- Contains different conda environments in .yaml format for running the pipeline. 
│   ├── rules          <- Contains .smk files that are included by the main Snakefile, including common.smk for functions. 
│   ├── scripts        <- Contains different R or python scripts used by the script: directive in Snakemake.
│   ├── Snakefile      <- Contains the main entrypoint to the pipeline.
│ 
├── workspace          <- Space for intermediate results in the pipeline. By default excluded from the git repository.

Your main code with the different rules will be stored in a GNU Makefile, so that someone else can just execute later i.e. make test to run the whole pipeline.

Why should I use cookiecutter?

The goal is to have a standardized folder structure to ensure consistency and reproducibility across your different research projects.

Setting up the cookiecutter template on HILBERT

Here is a brief tutorial for setting up the cookiecutter template on HILBERT.

HILBERT is not directly connected to the internet, so there might be small differences in setting it up there compared to the DKFZ Cluster. Make sure you have a package manager, i.e. conda installed beforehand and set the right channels (conda-forge, bioconda) and channel priorities in your .condarc file before following this tutorial. You can find a brief description on how to do this here.

Here is a step-by-step guide:

Setup the cookiecutter environment.

conda create --name cookiecutter_env cookiecutter

Activate your conda environment.

conda activate cookiecutter_env

The cookiecutter template from the Koppstein Lab is located under the following path /gpfs/project/projects/KoppstBioCore/cookiecutter_template. Don't touch this template folder! Go inside your analysis folder in KoppstBioCore with i.e. cd analyses_username.
With the cookiecutter_template folder, you can generate now a predefined folder structure in your own project folder, i.e. with cookiecutter ../cookiecutter_template. Just provide as an argument to the activated cookiecutter environment, the path to the cookiecutter template with it's corresponding metadata and JSON file. The path here can be either absolute or relative (doesn't really matter). Just call the command above in your corresponding analysis folder, where you would like to set up the cookiecutter template.
Fill out the required entries i.e. with default values.
Happy coding! :smiley:

Setting up the cookiecutter template on the DKFZ Cluster

Make sure you have the cookiecutter environment installed. It is assumed that you've already setup conda and it's channels as described above.

conda create --name cookiecutter_env cookiecutter

Activate your conda environment.

conda activate cookiecutter_env

Go into your analysis folder.
Execute inside the activated conda environment the following command:

cookiecutter gh:koppsteinlab/cookiecutter-bioinformatics-project

To create your cookiecutter template, fill out the required entries i.e. with default values.
Happy coding! :smiley: