Calibration Architectures - laser-base/laser-core GitHub Wiki

LASER Calibration Architecture

Technical Choices

  • Optuna
  • Docker
  • Kubernetes
  • Azure

image

Design Summary

  1. Local Development & Testing
  • The controller runs natively on your machine.
  • The model (laser.py) and objective function run in transient Docker containers (one per worker).
  • A single database instance runs in a long-lived temporary container, backed by persistent storage.
  1. Scaling to Kubernetes in Azure
  • The controller runs inside a Kubernetes pod instead of natively.
  • The model (laser.py) and objective function continue to run in transient worker containers, managed by Kubernetes.
  • The database runs as a persistent service in the cluster, backed by an Azure-managed storage solution (e.g., Azure Disk, Azure Database for PostgreSQL/MySQL).

We don't foresee needing to use Ray.

Rationale

Our Optuna + Docker + Kubernetes approach balances ease of use, scalability, and reproducibility, making it ideal for researchers who know Python but don’t want excessive complexity. Here’s why:

Key Strengths

  1. Simple Python Workflow for the Researcher

    • Researchers only need to write Python code for:
      • The disease model (laser.py).
      • The objective function (objective.py).
      • Any custom input parameters.
    • Optuna handles the optimization logic, reducing the need for manual hyperparameter tuning.
    • No need for researchers to manually manage containers—the infrastructure is abstracted.
  2. Local Development Feels Like Running a Normal Python Script

    • The controller (run.py) runs locally, allowing researchers to test calibration with minimal setup.
    • Workers execute in Docker containers, but the researcher doesn’t need to manage them explicitly.
    • The database (SQLite locally, MySQL in production) is abstracted away.

    Benefit: Researchers can start with a simple, local setup and later scale to Kubernetes without rewriting their code.

  3. Seamless Transition from Local to Cloud (Kubernetes on Azure)

    • The same Docker image used for local testing can be deployed to Kubernetes (AKS).
    • Researchers don’t need to rewrite scripts to scale.
    • Kubernetes automatically spins up worker pods to handle additional trials.

    Benefit: Researchers don’t have to worry about managing infrastructure—scaling is handled by Kubernetes.

  4. Reproducibility & Version Control with Docker

    • Each trial runs in an isolated container, ensuring consistent execution.
    • All dependencies are packaged in the Docker image, preventing version conflicts.
    • Researchers can easily share their setup with colleagues by sharing the Docker image + objective function.

    Benefit: Eliminates the "works on my machine" problem.

  5. Persistence & Checkpointing with Database Integration

    • Optuna stores results in the database, ensuring that:
      • Partial runs are saved (no data loss if a run crashes).
      • Results can be reviewed and analyzed later.
    • The database runs in a persistent volume, making the calibration fault-tolerant.

    Benefit: Researchers can stop and restart calibration without losing progress.

  6. Familiarity

    • Our Infra Team is familiar with this approach and has experience supporting it.

Alternatives: Pros & Cons

Approach Pros Cons
Manually running Optuna trials in Python ✅ Simple for small-scale runs ❌ Hard to parallelize
(no containers) ✅ No Docker complexity ❌ No reproducibility
❌ No fault tolerance
------------------------------------------------ ---------------------------------------- ------------------------------------
Running everything on a single large server ✅ No need for container orchestration ❌ Doesn't scale easily
(no K8s) ✅ Easier debugging ❌ Wastes resources if idle
❌ Harder to share results
------------------------------------------------ ---------------------------------------- ------------------------------------
Using cloud-only tools (Azure AutoML, etc.) ✅ Fully managed (no infra to manage) ❌ Less customization
✅ Scales well ❌ Vendor lock-in
❌ Learning curve for Python users
------------------------------------------------ ---------------------------------------- ------------------------------------
HPC clusters with SLURM ✅ Well-suited for supercomputing ❌ Requires learning SLURM
✅ Can handle complex jobs ❌ Not as flexible as Kubernetes
❌ Harder to reproduce runs

Python-Friendly

  1. Python-first approach: No need to learn Kubernetes or Docker in-depth.
  2. Local-to-cloud flexibility: Start on a laptop, scale up when needed.
  3. Scalability without complexity: Kubernetes manages parallel workers.
  4. Reproducibility & sharing: Docker ensures that everyone gets the same environment.
  5. Fault tolerance: Optuna + DB ensures no lost work.