Calibration Architectures - laser-base/laser-core GitHub Wiki
LASER Calibration Architecture
Technical Choices
- Optuna
- Docker
- Kubernetes
- Azure
Design Summary
- Local Development & Testing
- The controller runs natively on your machine.
- The model (laser.py) and objective function run in transient Docker containers (one per worker).
- A single database instance runs in a long-lived temporary container, backed by persistent storage.
- Scaling to Kubernetes in Azure
- The controller runs inside a Kubernetes pod instead of natively.
- The model (laser.py) and objective function continue to run in transient worker containers, managed by Kubernetes.
- The database runs as a persistent service in the cluster, backed by an Azure-managed storage solution (e.g., Azure Disk, Azure Database for PostgreSQL/MySQL).
We don't foresee needing to use Ray.
Rationale
Our Optuna + Docker + Kubernetes approach balances ease of use, scalability, and reproducibility, making it ideal for researchers who know Python but don’t want excessive complexity. Here’s why:
Key Strengths
-
Simple Python Workflow for the Researcher
- Researchers only need to write Python code for:
- The disease model (
laser.py). - The objective function (
objective.py). - Any custom input parameters.
- The disease model (
- Optuna handles the optimization logic, reducing the need for manual hyperparameter tuning.
- No need for researchers to manually manage containers—the infrastructure is abstracted.
- Researchers only need to write Python code for:
-
Local Development Feels Like Running a Normal Python Script
- The controller (
run.py) runs locally, allowing researchers to test calibration with minimal setup. - Workers execute in Docker containers, but the researcher doesn’t need to manage them explicitly.
- The database (SQLite locally, MySQL in production) is abstracted away.
Benefit: Researchers can start with a simple, local setup and later scale to Kubernetes without rewriting their code.
- The controller (
-
Seamless Transition from Local to Cloud (Kubernetes on Azure)
- The same Docker image used for local testing can be deployed to Kubernetes (AKS).
- Researchers don’t need to rewrite scripts to scale.
- Kubernetes automatically spins up worker pods to handle additional trials.
Benefit: Researchers don’t have to worry about managing infrastructure—scaling is handled by Kubernetes.
-
Reproducibility & Version Control with Docker
- Each trial runs in an isolated container, ensuring consistent execution.
- All dependencies are packaged in the Docker image, preventing version conflicts.
- Researchers can easily share their setup with colleagues by sharing the Docker image + objective function.
Benefit: Eliminates the "works on my machine" problem.
-
Persistence & Checkpointing with Database Integration
- Optuna stores results in the database, ensuring that:
- Partial runs are saved (no data loss if a run crashes).
- Results can be reviewed and analyzed later.
- The database runs in a persistent volume, making the calibration fault-tolerant.
Benefit: Researchers can stop and restart calibration without losing progress.
- Optuna stores results in the database, ensuring that:
-
Familiarity
- Our Infra Team is familiar with this approach and has experience supporting it.
Alternatives: Pros & Cons
| Approach | Pros | Cons |
|---|---|---|
| Manually running Optuna trials in Python | ✅ Simple for small-scale runs | ❌ Hard to parallelize |
| (no containers) | ✅ No Docker complexity | ❌ No reproducibility |
| ❌ No fault tolerance | ||
| ------------------------------------------------ | ---------------------------------------- | ------------------------------------ |
| Running everything on a single large server | ✅ No need for container orchestration | ❌ Doesn't scale easily |
| (no K8s) | ✅ Easier debugging | ❌ Wastes resources if idle |
| ❌ Harder to share results | ||
| ------------------------------------------------ | ---------------------------------------- | ------------------------------------ |
| Using cloud-only tools (Azure AutoML, etc.) | ✅ Fully managed (no infra to manage) | ❌ Less customization |
| ✅ Scales well | ❌ Vendor lock-in | |
| ❌ Learning curve for Python users | ||
| ------------------------------------------------ | ---------------------------------------- | ------------------------------------ |
| HPC clusters with SLURM | ✅ Well-suited for supercomputing | ❌ Requires learning SLURM |
| ✅ Can handle complex jobs | ❌ Not as flexible as Kubernetes | |
| ❌ Harder to reproduce runs |
Python-Friendly
- Python-first approach: No need to learn Kubernetes or Docker in-depth.
- Local-to-cloud flexibility: Start on a laptop, scale up when needed.
- Scalability without complexity: Kubernetes manages parallel workers.
- Reproducibility & sharing: Docker ensures that everyone gets the same environment.
- Fault tolerance: Optuna + DB ensures no lost work.