How to use Recipes - epigen/MrBiomics GitHub Wiki

This guide provides the essential information for using MrBiomics Recipes. Following these steps will help you successfully adapt and run these end-to-end best practice analyses on your own data.

Definition & Objectives

A Recipe combines several independent self-contained MrBiomics Modules into a comprehensive, reproducible, end-to-end analysis workflow.

Recipe wikis focus on the strategic 'why' and 'how' of the analysis—explaining the scientific reasoning behind each step and the process of interpreting results to make informed decisions. The technical 'what'—such as specific parameters and code—is thoroughly documented within the modules and their configurations.

Each Recipe is designed to achieve three primary goals:

  1. To Demonstrate & Teach: Showcase a complete, best-practice analysis (e.g., for RNA-seq Analysis) from start to finish, serving as a reproducible, step-by-step tutorial.
  2. To Validate & Verify: Prove the effectiveness and scientific validity of our methods by successfully re-analyzing high-quality, published datasets and reproducing their key biological findings.
  3. To Empower & Accelerate: Provide a robust template that you can directly apply to your own data, allowing you to leverage our expertise and accelerate your research without reinventing the wheel.

(*) Reproducibility is achieved through directly using the specified versions of the respective MrBiomics modules from GitHub and tracked changes of configurations.

[!NOTE] How Reproducibility is built-in by design: MrBiomics ensures reproducibility through two key mechanisms: it uses specific, locked versions of each module directly from GitHub, and it defines all parameters in configuration/annotation files that are subjected to version control. This combination creates a complete, traceable record of your analysis.

Prerequisites

This guide provides the general principles for Recipes. For more detailed instructions on the technical aspects of running these MrBiomics Modules, please refer to the respective wikis Installation, Configuration, Execution and Module Usage in Projects for prior reading.

All Recipes assume a consistent project structure. Before running, ensure you have the following directories set up in your repository (i.e., project's root folder).

  • data/: For your raw input data files (not necessary for re-running the Recipe on public data).
  • resources/: For database files, genome annotations, etc.
  • results/: The designated output directory for all generated files.

[!TIP] For users of Unix/Linux/HPC systems or with large datasets, we recommend using symbolic links (ln -s) to point to your data/, resources/ and results folders (e.g., on another partition). This avoids unnecessary accumulation of large files in your repository.

Run them Module-by-Module

Each Recipe comes with pre-filled configuration and annotation files to analyze a specific public dataset and can be run end-to-end to reproduce the reported results out-of-the-box.

To apply a Recipe on your own data, you must adapt/replace these with your own and execute it iteratively module-by-module.

[!IMPORTANT] Do not run the entire Recipe from start to finish on your first try. Instead, execute it one module at a time. This iterative process is the key to a successful and robust analysis.

  1. Run only the first module in the chain (e.g., fetch_ngs) by commenting the other module's results in the main Snakefile's target rule all.
  2. Inspect the results and diagnostic plots to understand what happened and ensure quality and correctness.
  3. Configure the next module using the outputs and insights from the previous module.
  4. Repeat this process until you reach the final module.

This iterative approach allows you to make informed decisions at each stage. By inspecting the outputs of each module, you can understand how the data is being transformed, catch potential issues early (e.g., a sample failing QC), and tailor the analysis to the specific needs of your data.

This iterative method allows for quick initial completion, followed by refinement in subsequent iterations based on feedback from yourself or collaborators. Adjustments in later iterations are straightforward, requiring only changes to individual configurations or annotations. Ultimately you end up with a reproducible and readable end-to-end analysis for each dataset.

Why Module-by-Module?

When applying a Recipe to your own data, the module-by-module approach is not just a best practice—it's a technical requirement. Many Recipes contain modules that depend on files generated by upstream modules for their configuration and annotation. For example, a sample annotation file created by rnaseq_pipeline is needed to configure spilterlize_integrate.

[!CAUTION] If you try to run a full Recipe end-to-end before all necessary configuration inputs exist, Snakemake will fail.

The solution is the iterative, module-by-module approach:

  1. Run the upstream module (e.g., rnaseq_pipeline).
  2. Locate the generated file needed for configuration (e.g., results/MyData/rnaseq_pipeline/counts/sample_annotation.csv).
  3. Copy (and rename) this file into a static location, such as your project's config/MyData/ directory.
  4. Update the configuration of the downstream module to point to this new, static file.

This makes your entire analysis re-runnable, reproducible, and easy to modify in the future.

Once this initial module-by-module analysis phase is complete, your analysis becomes a fully reproducible and automated workflow to be built upon. With all configuration and annotation files—including those informed by module outputs—statically in place, you can now execute the entire analysis from start to finish with a single command. This unlocks the true power of MrBiomics: you can easily tweak a parameter or change a setting, and then simply re-run the entire workflow to see the impact, making your research agile while keeping it reproducible.

Results

To ensure transparency and reproducibility, all figures and tables shown in a Recipe's wiki page are direct outputs from the Modules themselves—no results are re-plotted or altered. While the wiki highlights key findings, each Recipe is also part of our full, interactive Snakemake report where you can explore many more generated outputs. Remember, Recipes showcase core functionalities for standard analyses applied to publicly available data, but each Module is capable of much more. For complete details on all available features, please consult the individual module documentation.

"Bridge Rules" for End-to-End Execution (Advanced Topic)

A key challenge in creating fully end-to-end Recipes is that Snakemake must build its entire dependency graph (DAG) before execution. This becomes a problem when an upstream module generates files dynamically within a specified directory, while a downstream module requires those exact, specific filenames as input. This mismatch leads to a MissingInputException, as Snakemake cannot find a rule that produces the required file (only the parent directory).

To solve this, we employ a workaround we call the "bridge rule." It is a neat, if "dirty," trick that acts as a bridge, making dynamically generated files visible to Snakemake's DAG and enabling true end-to-end execution.

The concept is best explained with an example.

  • Scenario: The dea_limma module dynamically creates feature lists for each cell type comparison (e.g., Bcell_featureScores_annot.csv). The enrichment_analysis module needs this specific file as an input for its configuration.
  • The Problem: The main rule in dea_limma is only aware of its output directory (e.g., results/CorcesRNA/dea_limma/normCQN_OvA_cell_type/feature_lists/), not the individual files that will be created inside it.
  • The Solution: We add a "bridge rule" inside the dea_limma module itself.

This bridge rule has two key properties:

  1. Its output is defined with exact, specific filename (using wildcards) that the downstream enrichment_analysis module needs and the update() directive such that the file is not removed by Snakemake before rule execution.
  2. Its shell command is simply touch {output}.

This touch command does not perform any real computation; it only ensures the file exists and its timestamp is updated. When Snakemake builds the DAG, it sees that enrichment_analysis requires Bcell_featureScores_annot.csv. It then finds the bridge rule, which explicitly produces this file. The bridge rule, in turn, depends on the main dea_limma rule that creates the directory. The dependency chain is now complete and resolvable, and no unnecessary computation is performed.

# bridge rule to enable downstream processing
# requires to know that the file will exist in that exact location, otherwise MissingInputException error
rule fetch_file:
    input:
        dea_stats = os.path.join(result_path,'{analysis}','stats.csv'),
    output:
        feature_list = update(os.path.join(result_path,'{analysis}','feature_lists',"{group}_{type,(up_features.txt|up_features_annot.txt|down_features.txt|down_features_annot.txt|featureScores.csv|featureScores_annot.csv)}")),
    resources:
        mem_mb="1000",
    shell:
        """
        # only if the file already exists
        if [ -f {output.feature_list} ]; then \
            touch {output.feature_list}; \
        fi
        """

Bridge Rule used in dea_limma module

Why Not Use Checkpoints?

While Snakemake's checkpoints are designed to handle dynamic outputs, they are intended to work within a single workflow. They cannot resolve dependencies across different modules that are loaded into a parent Snakefile, which is the core architecture of MrBiomics Recipes. The bridge rule provides a necessary bridge between these modular boundaries.

A Known Limitation & A Call for Collaboration

The bridge rule is a pragmatic workaround that enables the significant benefit of end-to-end automation, making recipes more reproducible and easier to iterate on. We use this pattern in modules like fetch_ngs (to handle dynamic *.bam/fastq.gz filenames) and dea_limma (for feature lists, as described above).

[!NOTE] We acknowledge that this is a workaround for a known limitation in cross-module dependency resolution. We are always open to more elegant solutions. If you know of a better way to handle this challenge in Snakemake, please let us know—your contribution would be highly appreciated.