Synthetic Spatial Multi‐Modal Calibration for Documentation - laser-base/laser-measles GitHub Wiki
Three-cluster chain . stochastic invasion . progressive multi-model calibration . AI-assisted workflow
A new laser-measles user typically arrives with a question like "can I calibrate my model to my data?" The honest answer is "yes, but stochastic spatial-ABM calibration is fundamentally a multi-stage process -- and it has to be, given how expensive a single ABM evaluation is and how noisy a single trial estimate is." This scenario is designed to make that concrete.
The central methodological argument is a progressive multi-model cascade: each stage uses a less expensive, more constrained model to provide a prior that the next, more expensive, less constrained model refines. Cheap deterministic CMP locates the basin; expensive stochastic ABM, run with progressively higher M (seeds per trial), pins the parameters down to within the irreducible binomial sampling noise. Crucially, every stage builds on the trials of every previous stage -- the search history is cumulative across models, never thrown away.
We chose a deliberately non-trivial scenario -- a chain of three clusters with custom mixing geometry, multi-level SIA campaigns, and bimodal stochastic invasion -- because the lessons of multi-model cascading don't show up clearly on toy single-population fits. The audience is experienced ABM measles modellers, so we don't shy away from the complexity; we showcase it.

Three-cluster chain layout; 45 patches, 1.6 M total population, three years of simulated time. Top-3 A patches are seeded on day 0; everywhere else starts naive. ChainMixing forbids A<->C, A<->B_near, and B_far<->C shortcuts -- all transmission must travel A -> B_far -> B_near -> C.
Generated by generate_reference_v4.py with hidden TRUE parameters
beta=0.5, k=0.01, c=1.5 (the calibration targets) and eps_far=0.85,
eps_near=0.50 (intervention parameters held fixed and known throughout).
Run for 1095 days x 20 RNG seeds.
The day-10 SIA design is what creates the interesting dynamics: eps_far=0.85
on cluster_b:far drives B_far's R_0_eff to 0.6 -- a stochastic
bottleneck -- and eps_near=0.50 on cluster_b:near leaves B_near's R_0_eff
at 2.0, supercritical only if seeded. Result: 9 of 20 seeds invade C,
11 don't. The mean attack rate AR_C ~ 0.44 is unphysical for any single
realization; the diagnostic signature is std(AR_C) ~ 0.50 -- the
bimodality fingerprint.
Note on what is and isn't calibrated. Only (beta, k, c) -- the transmission and mixing parameters -- are calibration targets. SIA efficacies are treated as known intervention parameters, fixed at their TRUE values during calibration. This matches real-world practice: campaign efficacies are usually constrained by program records or assumed from the literature, while transmission and mixing parameters are inferred from the dynamics the campaigns produce. Adding eps_far/eps_near to the search space would have been a different and harder experiment (with step-function regime transitions), and it would have obscured the methodological lessons we wanted this scenario to teach.

The calibrator sees only six summary statistics across the 20-seed ensemble:
| stat | reference | role |
|---|---|---|
c_inv_frac |
0.45 | fraction of seeds invading C -- the bottleneck signal |
mean_AR_A |
0.980 | within-cluster A attack rate (pins beta) |
mean_AR_BN |
0.679 | chain coupling reaching B_near |
std_AR_C |
0.500 | bimodality fingerprint -- stochastic-only target |
mean_peak_A |
96.7 | epidemic timing (pins beta + k jointly) |
std_peak_A |
2.4 | inter-seed timing variability |
mean_AR_C and mean_AR_BF are deliberately not used in the loss -- the
former because the mean of a bimodal distribution is unphysical, the latter
because the SIA pins it independent of (beta, k, c).

Before launching any optimizer, both CMP and ABM identifiability sweeps
(cmp_identifiability_v4.py, abm_identifiability_v4.py) one-D and two-D
across each parameter at fixed others. These reveal what each parameter
can be expected to do to each summary statistic, before we ever ask Optuna
to find an optimum.
Two findings shape every subsequent stage:
-
cis essentially unidentifiable in CMP -- the loss varies < 30 % across c in [0.5, 3.0] because ChainMixing has already collapsed the long-distance routes thatcwould otherwise control. We hold c=1.5 throughout Stage 1. -
CMP at TRUE peaks 17 ticks late versus the ABM ensemble mean. The
deterministic limit is slower than the stochastic mean; the only knob
CMP has to compensate is
k. This is structural cross-model bias, diagnosed up front so we don't chase it later.
The biennial-calibration adage applies here: the sweep is the science, calibration is the finishing touch.

CMP sweep: 1D scans of beta, k, c (rows) against loss components, attack rates, and peak timing (columns), plus 2D log10(loss) heatmap in (beta, k) at fixed c. The bottom row makes c's flatness visible.

ABM 2D sweep at c=c_true: log10(loss), c_inv_frac (red contour at the reference 0.45), mean peak_A (cyan contour at reference 97), and AR_A (red contour at reference 0.98). The three target ridges intersect at TRUE (red star) -- the loss landscape really does have a well-defined minimum at the right place, even if M=3 noise made it hard to find directly.
CompartmentalModel, ChainMixing as mixer=, c held fixed at 1.5. Loss is
R_A(t)/N_A landmark crossing times + peak_A timing + final AR_A -- the
parts of the dynamics that a deterministic model can match. 100 TPE
trials, no warm-start at TRUE.
Stage 1 result: beta = 0.529 (1.06x TRUE), k = 0.0259 (2.59x TRUE -- biased high).
The CMP's k recovery is biased -- and that's by design, not by failure.
CMP cannot reproduce the ABM ensemble peak timing without inflating mixing.
Stage 1's role isn't to recover TRUE; it's to deliver a useful prior --
specifically, a (beta, k, c) point in the right neighbourhood -- for the
expensive Stage 2 to refine. beta within 6 % qualifies. The bias on k
is the deterministic-vs-stochastic structural mismatch we expected.
This stage costs about 5 minutes of compute. That cheapness is the whole reason it's the first stage.

CMP cold-start (no TRUE warm-start). The 100-trial Optuna convergence panel (top-left) shows the running min plateauing around trial 31 -- our defined convergence point. Param-recovery bars (top-right) show the biased k. Bottom row: trial cloud in (beta, k) and 1D loss profiles showing how each parameter responds.
Stage 2 is where the calibration story actually plays out. Optuna is
warm-started at the Stage 1 best (CMP cold-best, not TRUE), and runs
M-seed ABM ensembles per trial under a summary-statistics loss that includes
the bimodality term std(AR_C). We tested four allocations of the same
total compute budget -- and the headline lesson is at fixed total compute,
seeds-per-trial dominates n_trials:
| variant | trials x M | total ABM runs | loss(M=20) | beta rec | k rec | c rec |
|---|---|---|---|---|---|---|
| original | 24 x 12 | 288 | 65.4 | 1.17x | 0.21x [x] | 0.59x |
| A (more trials) | 60 x 12 | 720 | 16.3 | 1.30x | 0.16x [x] | 0.47x |
| B (more seeds) | 24 x 20 | 480 | 1.29 | 1.06x | 0.70x | 0.77x |
| F (cumulative) | 20 x 60 | 1,200 | 0.568 | 0.99x | 0.91x | 0.72x |
A's failure is the lesson made visible. With M=12 the binomial standard
deviation on c_inv_frac (~ 0.144) exceeds the loss-function scale (0.15) --
TPE happily exploits favourable noise rather than searching gradient. More
trials at insufficient M let the optimizer overfit harder, not converge
faster. Spend the budget on M first, then on n_trials.

The M-vs-trials trade-off, side by side. Left: convergence curves for each variant (running min in solid). Middle: per-parameter recovery ratios. F is the only variant where all three parameters land near 1.00. Right: final M=20 loss vs total ABM runs spent -- A spent 50 % more compute than B and got 12x worse loss. F's cumulative TPE prior (108 prior trials inherited) lets it land below B with only 2.5x the compute.
This is the second methodological move that's worth emphasizing for
experienced modellers. Variant F is launched with all 108 prior cascade
trials pre-loaded into the TPE study via optuna.add_trial(). The
search history accumulates across runs:
- The CMP cold-start gave Stage 2 its initial warm-start point.
- The original 24-trial cascade told us where M=12 fails.
- Variant A (60 trials at M=12) confirmed the M-not-trials diagnosis.
- Variant B (24 trials at M=20) found the basin within ~30 %.
- Variant F (20 trials at M=60, with all 108 prior trials as TPE prior)
reached 0.99x x 0.91x x 0.72x recovery with
loss(M=20) = 0.568.

Cascade F internals: 20-trial Optuna convergence (top-left), trial cloud in (beta, k) coloured by log-loss with the TRUE point starred (top-middle), c distribution among low-loss trials, per-loss-component decomposition, and parameter recovery -- all evaluated at M=60 per trial.
A new practitioner shouldn't re-discover the same low-loss basins from scratch every time they re-run an experiment. Cumulative TPE preserves every previous evaluation as prior knowledge -- even ones at lower M -- so each new run sharpens the answer rather than restarting the search.
We took variant F's best (beta=0.4963, k=0.00910, c=1.087), ran 50 fresh ABM
seeds at those parameters, and compared to the 20-seed reference at every
level -- trajectories, distributions, and summary scalars
(v4_calibration_validation.py).

Per-cluster I/N and R/N bands overlap. Calibrated ensemble (50 seeds, dashed) falls within the reference band (20 seeds, solid) across all four clusters, including the bimodal "some invade, some don't" splay in B_near and C visible in the per-seed spaghetti overlay.
Per-cluster AR distributions match shape. Calibrated AR_C is bimodal with peaks near 0 and 1 -- this is the part no deterministic model can produce.
Bimodality preserved: std(AR_C) = 0.488 vs reference 0.500 (within ~3 %).
peak_A: 97.65 vs 96.65 -- within one tick.
Parameter recovery without TRUE leakage: beta = 0.4963 (0.99x TRUE) and k = 0.00910 (0.91x TRUE). This is the load-bearing result.
For a sophisticated audience, being explicit about residual bias matters:
-
cis biased low: 1.087 vs TRUE 1.5 (0.72x). This shows up in the validation asc_inv_frac~ 0.68 in the 50-seed calibrated ensemble vs 0.45 in the reference -- a real ~3-sigma overshoot, not noise. Weaker distance decay -> stronger A->B_far->B_near coupling -> more frequent C invasion. - The summary-stats loss didn't punish this hard enough, because the
c_inv_fracterm is scaled at 0.15 (matching its irreducible binomial noise floor). Tightening the scale or adding a per-patch spatial loss term -- e.g. arrival-time differences for patches invaded in >= 50 % of seeds -- would constraincbetter. -
Resolution is finite, not infinite. The noise floor at TRUE under
M=10 ranged 2.92-61.21 across replicas (
v4_loss_curves.png); a 1D beta profile shows that values in [0.48, 0.52] are statistically indistinguishable from TRUE. Practical resolution at this loss design and compute budget: beta within ~5 %, k within ~10 %, c within ~25 %.

-
Every spatial-ABM feature in one scenario -- custom mixing geometry, multi-level SIA filtering, sub/supercritical R_0_eff configuration, per-patch state tracking, summary-statistics calibration targets, and the multi-model scenario builder -- small enough to run end-to-end on a workstation, complex enough that the calibration lessons actually surface.
-
The cascade pattern is teachable here. The user sees Stage 1 (cheap, biased, useful), Stage 2 (expensive, stochastic, refining), and the cumulative build-up of trials across stages. They also see the failure modes (cascade A vs B) that make the M-vs-trials lesson stick.
-
Synthetic-recovery framing with hidden TRUE. The deliverable is honest precision claims (beta within ~5 %, etc.) rather than a fitted point estimate without uncertainty.
-
std(AR_C)~ 0.5 is a target only a stochastic model can meet. A new user gets to see why the ABM stage is necessary, not just told. -
The CMP-bias diagnosis is itself a deliverable. Showing that deterministic CMP biases
kby 2-3x under a particular loss design -- and that a stochastic Stage 2 can correct it -- is a transferable modelling lesson, not just a calibration result.
This entire scenario -- reference generation, identifiability sweeps, four
cascade variants, a cumulative variant, validation, figures, and this
write-up -- was built and run interactively with a Claude Code agent
(Anthropic) in roughly one working day. The
agent did the documentation lookups (JENNER-Measles MCP Server), proposed and edited scripts, ran experiments
in parallel on the local machine, monitored progress with 10-min stream
updates, parsed logs, generated diagnostic figures, diagnosed bugs (the
silently-dropped mixer= field on the ABM InfectionParams, a
prior_files typo from a global rename), and proposed the M-vs-trials
follow-up that produced the working calibration.
Two specific demonstrations worth highlighting:
-
Iterative collaborative diagnosis. When the warm-start cascade produced "exact recovery" that looked too clean, the user pushed back; the agent proposed and ran a noise-floor study, a 1D beta profile around TRUE, and a full cold-start cascade that produced the honest result and the M lesson. Each step was a small ~10-min round-trip, not a multi-day delay.
-
Cumulative experiment design. The cascade variants weren't a pre-designed factorial -- they emerged in response to discovering that variant A was worse than the original despite more compute. Hypothesis -> experiment -> result -> next hypothesis, in real time, with the agent maintaining the bookkeeping.
The first-day-productive bar that this lifts is meaningful: a new user can produce a defensible spatial-ABM calibration with explicit uncertainty bounds and a published-quality figure pack in hours, not weeks -- and they can do it while learning the methodological lessons, because the journey through the failure modes is itself part of the workflow.
sandbox/2026-04-23/
+-- generate_reference_v4.py # 20-seed reference (~10 min)
+-- reference_v4/ # I_by_patch.npy, R_by_patch.npy, meta
|
+-- cmp_identifiability_v4.py # CMP sweep (run first)
+-- calibrate_cmp_v4_coldstart.py # Stage 1 -- CMP cold-start (~5 min)
|
+-- abm_identifiability_v4.py # ABM sweep
+-- calibrate_abm_v4_cascade_F.py # Stage 2 -- winning cumulative cascade
+-- calibrate_abm_v4_cascade_{B,A,...} # earlier variants -- kept for the M lesson
|
+-- v4_loss_curves.py # noise-floor + beta profile diagnostic
+-- v4_calibration_validation.py # validation figure above
+-- INTRO_SCENARIO.md # this document
End-to-end reproduction from a fresh checkout: ~10 min reference, 5 min Stage 1, ~2 hours Stage 2 (1 trial = 60 ABM seeds = ~5 min on a 16-core machine), 5 min validation. About three hours of compute total, less of human time.
A natural extension for a user wanting to push c recovery: add a
per-patch spatial loss term (e.g. sum of (arrival_tick_p_model - arrival_tick_p_ref)^2 for patches invaded in >= 50 % of seeds). c controls
which patches in B_far the chain reaches first, and that signal is washed
out in cluster-aggregate summary statistics.