Calibration Architectures - laser-base/laser-core GitHub Wiki

LASER Calibration Architecture

Technical Choices

Optuna
Docker
Kubernetes
Azure

Design Summary

Local Development & Testing

The controller runs natively on your machine.
The model (laser.py) and objective function run in transient Docker containers (one per worker).
A single database instance runs in a long-lived temporary container, backed by persistent storage.

Scaling to Kubernetes in Azure

The controller runs inside a Kubernetes pod instead of natively.
The model (laser.py) and objective function continue to run in transient worker containers, managed by Kubernetes.
The database runs as a persistent service in the cluster, backed by an Azure-managed storage solution (e.g., Azure Disk, Azure Database for PostgreSQL/MySQL).

We don't foresee needing to use Ray.

Rationale

Our Optuna + Docker + Kubernetes approach balances ease of use, scalability, and reproducibility, making it ideal for researchers who know Python but don’t want excessive complexity. Here’s why:

Key Strengths

Simple Python Workflow for the Researcher
- Researchers only need to write Python code for:
  - The disease model (laser.py).
  - The objective function (objective.py).
  - Any custom input parameters.
- Optuna handles the optimization logic, reducing the need for manual hyperparameter tuning.
- No need for researchers to manually manage containers—the infrastructure is abstracted.
Local Development Feels Like Running a Normal Python Script
- The controller (run.py) runs locally, allowing researchers to test calibration with minimal setup.
- Workers execute in Docker containers, but the researcher doesn’t need to manage them explicitly.
- The database (SQLite locally, MySQL in production) is abstracted away.
Benefit: Researchers can start with a simple, local setup and later scale to Kubernetes without rewriting their code.
Seamless Transition from Local to Cloud (Kubernetes on Azure)
- The same Docker image used for local testing can be deployed to Kubernetes (AKS).
- Researchers don’t need to rewrite scripts to scale.
- Kubernetes automatically spins up worker pods to handle additional trials.
Benefit: Researchers don’t have to worry about managing infrastructure—scaling is handled by Kubernetes.
Reproducibility & Version Control with Docker
- Each trial runs in an isolated container, ensuring consistent execution.
- All dependencies are packaged in the Docker image, preventing version conflicts.
- Researchers can easily share their setup with colleagues by sharing the Docker image + objective function.
Benefit: Eliminates the "works on my machine" problem.
Persistence & Checkpointing with Database Integration
- Optuna stores results in the database, ensuring that:
  - Partial runs are saved (no data loss if a run crashes).
  - Results can be reviewed and analyzed later.
- The database runs in a persistent volume, making the calibration fault-tolerant.
Benefit: Researchers can stop and restart calibration without losing progress.
Familiarity
- Our Infra Team is familiar with this approach and has experience supporting it.

Alternatives: Pros & Cons

Approach	Pros	Cons
Manually running Optuna trials in Python	✅ Simple for small-scale runs	❌ Hard to parallelize
(no containers)	✅ No Docker complexity	❌ No reproducibility
		❌ No fault tolerance
------------------------------------------------	----------------------------------------	------------------------------------
Running everything on a single large server	✅ No need for container orchestration	❌ Doesn't scale easily
(no K8s)	✅ Easier debugging	❌ Wastes resources if idle
		❌ Harder to share results
------------------------------------------------	----------------------------------------	------------------------------------
Using cloud-only tools (Azure AutoML, etc.)	✅ Fully managed (no infra to manage)	❌ Less customization
	✅ Scales well	❌ Vendor lock-in
		❌ Learning curve for Python users
------------------------------------------------	----------------------------------------	------------------------------------
HPC clusters with SLURM	✅ Well-suited for supercomputing	❌ Requires learning SLURM
	✅ Can handle complex jobs	❌ Not as flexible as Kubernetes
		❌ Harder to reproduce runs

Python-Friendly

Python-first approach: No need to learn Kubernetes or Docker in-depth.
Local-to-cloud flexibility: Start on a laptop, scale up when needed.
Scalability without complexity: Kubernetes manages parallel workers.
Reproducibility & sharing: Docker ensures that everyone gets the same environment.
Fault tolerance: Optuna + DB ensures no lost work.