Synthetic Spatial Multi‐Modal Calibration for Documentation - laser-base/laser-measles GitHub Wiki

A synthetic spatial-calibration scenario for new laser-measles users

Three-cluster chain . stochastic invasion . progressive multi-model calibration . AI-assisted workflow


What this scenario is, and why we built it

A new laser-measles user typically arrives with a question like "can I calibrate my model to my data?" The honest answer is "yes, but stochastic spatial-ABM calibration is fundamentally a multi-stage process -- and it has to be, given how expensive a single ABM evaluation is and how noisy a single trial estimate is." This scenario is designed to make that concrete.

The central methodological argument is a progressive multi-model cascade: each stage uses a less expensive, more constrained model to provide a prior that the next, more expensive, less constrained model refines. Cheap deterministic CMP locates the basin; expensive stochastic ABM, run with progressively higher M (seeds per trial), pins the parameters down to within the irreducible binomial sampling noise. Crucially, every stage builds on the trials of every previous stage -- the search history is cumulative across models, never thrown away.

We chose a deliberately non-trivial scenario -- a chain of three clusters with custom mixing geometry, multi-level SIA campaigns, and bimodal stochastic invasion -- because the lessons of multi-model cascading don't show up clearly on toy single-population fits. The audience is experienced ABM measles modellers, so we don't shy away from the complexity; we showcase it.

reference geography

Three-cluster chain layout; 45 patches, 1.6 M total population, three years of simulated time. Top-3 A patches are seeded on day 0; everywhere else starts naive. ChainMixing forbids A<->C, A<->B_near, and B_far<->C shortcuts -- all transmission must travel A -> B_far -> B_near -> C.


The synthetic reference

Generated by generate_reference_v4.py with hidden TRUE parameters beta=0.5, k=0.01, c=1.5 (the calibration targets) and eps_far=0.85, eps_near=0.50 (intervention parameters held fixed and known throughout). Run for 1095 days x 20 RNG seeds.

The day-10 SIA design is what creates the interesting dynamics: eps_far=0.85 on cluster_b:far drives B_far's R_0_eff to 0.6 -- a stochastic bottleneck -- and eps_near=0.50 on cluster_b:near leaves B_near's R_0_eff at 2.0, supercritical only if seeded. Result: 9 of 20 seeds invade C, 11 don't. The mean attack rate AR_C ~ 0.44 is unphysical for any single realization; the diagnostic signature is std(AR_C) ~ 0.50 -- the bimodality fingerprint.

Note on what is and isn't calibrated. Only (beta, k, c) -- the transmission and mixing parameters -- are calibration targets. SIA efficacies are treated as known intervention parameters, fixed at their TRUE values during calibration. This matches real-world practice: campaign efficacies are usually constrained by program records or assumed from the literature, while transmission and mixing parameters are inferred from the dynamics the campaigns produce. Adding eps_far/eps_near to the search space would have been a different and harder experiment (with step-function regime transitions), and it would have obscured the methodological lessons we wanted this scenario to teach.

bimodal C invasion

The calibrator sees only six summary statistics across the 20-seed ensemble:

stat reference role
c_inv_frac 0.45 fraction of seeds invading C -- the bottleneck signal
mean_AR_A 0.980 within-cluster A attack rate (pins beta)
mean_AR_BN 0.679 chain coupling reaching B_near
std_AR_C 0.500 bimodality fingerprint -- stochastic-only target
mean_peak_A 96.7 epidemic timing (pins beta + k jointly)
std_peak_A 2.4 inter-seed timing variability

mean_AR_C and mean_AR_BF are deliberately not used in the loss -- the former because the mean of a bimodal distribution is unphysical, the latter because the SIA pins it independent of (beta, k, c).

reference epidemic curves


Stage 0 -- Identifiability sweeps before any calibration

Before launching any optimizer, both CMP and ABM identifiability sweeps (cmp_identifiability_v4.py, abm_identifiability_v4.py) one-D and two-D across each parameter at fixed others. These reveal what each parameter can be expected to do to each summary statistic, before we ever ask Optuna to find an optimum.

Two findings shape every subsequent stage:

  1. c is essentially unidentifiable in CMP -- the loss varies < 30 % across c in [0.5, 3.0] because ChainMixing has already collapsed the long-distance routes that c would otherwise control. We hold c=1.5 throughout Stage 1.
  2. CMP at TRUE peaks 17 ticks late versus the ABM ensemble mean. The deterministic limit is slower than the stochastic mean; the only knob CMP has to compensate is k. This is structural cross-model bias, diagnosed up front so we don't chase it later.

The biennial-calibration adage applies here: the sweep is the science, calibration is the finishing touch.

CMP identifiability sweep

CMP sweep: 1D scans of beta, k, c (rows) against loss components, attack rates, and peak timing (columns), plus 2D log10(loss) heatmap in (beta, k) at fixed c. The bottom row makes c's flatness visible.

ABM identifiability sweep -- 2D heatmaps

ABM 2D sweep at c=c_true: log10(loss), c_inv_frac (red contour at the reference 0.45), mean peak_A (cyan contour at reference 97), and AR_A (red contour at reference 0.98). The three target ridges intersect at TRUE (red star) -- the loss landscape really does have a well-defined minimum at the right place, even if M=3 noise made it hard to find directly.


Stage 1 -- Cheap deterministic prior (CMP cold-start)

CompartmentalModel, ChainMixing as mixer=, c held fixed at 1.5. Loss is R_A(t)/N_A landmark crossing times + peak_A timing + final AR_A -- the parts of the dynamics that a deterministic model can match. 100 TPE trials, no warm-start at TRUE.

Stage 1 result: beta = 0.529 (1.06x TRUE), k = 0.0259 (2.59x TRUE -- biased high).

The CMP's k recovery is biased -- and that's by design, not by failure. CMP cannot reproduce the ABM ensemble peak timing without inflating mixing. Stage 1's role isn't to recover TRUE; it's to deliver a useful prior -- specifically, a (beta, k, c) point in the right neighbourhood -- for the expensive Stage 2 to refine. beta within 6 % qualifies. The bias on k is the deterministic-vs-stochastic structural mismatch we expected.

This stage costs about 5 minutes of compute. That cheapness is the whole reason it's the first stage.

CMP cold-start diagnostics

CMP cold-start (no TRUE warm-start). The 100-trial Optuna convergence panel (top-left) shows the running min plateauing around trial 31 -- our defined convergence point. Param-recovery bars (top-right) show the biased k. Bottom row: trial cloud in (beta, k) and 1D loss profiles showing how each parameter responds.


Stage 2 -- Stochastic refinement, with the M-vs-trials lesson

Stage 2 is where the calibration story actually plays out. Optuna is warm-started at the Stage 1 best (CMP cold-best, not TRUE), and runs M-seed ABM ensembles per trial under a summary-statistics loss that includes the bimodality term std(AR_C). We tested four allocations of the same total compute budget -- and the headline lesson is at fixed total compute, seeds-per-trial dominates n_trials:

variant trials x M total ABM runs loss(M=20) beta rec k rec c rec
original 24 x 12 288 65.4 1.17x 0.21x [x] 0.59x
A (more trials) 60 x 12 720 16.3 1.30x 0.16x [x] 0.47x
B (more seeds) 24 x 20 480 1.29 1.06x 0.70x 0.77x
F (cumulative) 20 x 60 1,200 0.568 0.99x 0.91x 0.72x

A's failure is the lesson made visible. With M=12 the binomial standard deviation on c_inv_frac (~ 0.144) exceeds the loss-function scale (0.15) -- TPE happily exploits favourable noise rather than searching gradient. More trials at insufficient M let the optimizer overfit harder, not converge faster. Spend the budget on M first, then on n_trials.

cascade comparison

The M-vs-trials trade-off, side by side. Left: convergence curves for each variant (running min in solid). Middle: per-parameter recovery ratios. F is the only variant where all three parameters land near 1.00. Right: final M=20 loss vs total ABM runs spent -- A spent 50 % more compute than B and got 12x worse loss. F's cumulative TPE prior (108 prior trials inherited) lets it land below B with only 2.5x the compute.

Each Stage 2 step is cumulative

This is the second methodological move that's worth emphasizing for experienced modellers. Variant F is launched with all 108 prior cascade trials pre-loaded into the TPE study via optuna.add_trial(). The search history accumulates across runs:

  • The CMP cold-start gave Stage 2 its initial warm-start point.
  • The original 24-trial cascade told us where M=12 fails.
  • Variant A (60 trials at M=12) confirmed the M-not-trials diagnosis.
  • Variant B (24 trials at M=20) found the basin within ~30 %.
  • Variant F (20 trials at M=60, with all 108 prior trials as TPE prior) reached 0.99x x 0.91x x 0.72x recovery with loss(M=20) = 0.568.

cascade F diagnostics

Cascade F internals: 20-trial Optuna convergence (top-left), trial cloud in (beta, k) coloured by log-loss with the TRUE point starred (top-middle), c distribution among low-loss trials, per-loss-component decomposition, and parameter recovery -- all evaluated at M=60 per trial.

A new practitioner shouldn't re-discover the same low-loss basins from scratch every time they re-run an experiment. Cumulative TPE preserves every previous evaluation as prior knowledge -- even ones at lower M -- so each new run sharpens the answer rather than restarting the search.


Validation against reference

We took variant F's best (beta=0.4963, k=0.00910, c=1.087), ran 50 fresh ABM seeds at those parameters, and compared to the 20-seed reference at every level -- trajectories, distributions, and summary scalars (v4_calibration_validation.py).

calibration validation

Per-cluster I/N and R/N bands overlap. Calibrated ensemble (50 seeds, dashed) falls within the reference band (20 seeds, solid) across all four clusters, including the bimodal "some invade, some don't" splay in B_near and C visible in the per-seed spaghetti overlay.

Per-cluster AR distributions match shape. Calibrated AR_C is bimodal with peaks near 0 and 1 -- this is the part no deterministic model can produce.

Bimodality preserved: std(AR_C) = 0.488 vs reference 0.500 (within ~3 %).

peak_A: 97.65 vs 96.65 -- within one tick.

Parameter recovery without TRUE leakage: beta = 0.4963 (0.99x TRUE) and k = 0.00910 (0.91x TRUE). This is the load-bearing result.


What the cascade did not fully recover

For a sophisticated audience, being explicit about residual bias matters:

  • c is biased low: 1.087 vs TRUE 1.5 (0.72x). This shows up in the validation as c_inv_frac ~ 0.68 in the 50-seed calibrated ensemble vs 0.45 in the reference -- a real ~3-sigma overshoot, not noise. Weaker distance decay -> stronger A->B_far->B_near coupling -> more frequent C invasion.
  • The summary-stats loss didn't punish this hard enough, because the c_inv_frac term is scaled at 0.15 (matching its irreducible binomial noise floor). Tightening the scale or adding a per-patch spatial loss term -- e.g. arrival-time differences for patches invaded in >= 50 % of seeds -- would constrain c better.
  • Resolution is finite, not infinite. The noise floor at TRUE under M=10 ranged 2.92-61.21 across replicas (v4_loss_curves.png); a 1D beta profile shows that values in [0.48, 0.52] are statistically indistinguishable from TRUE. Practical resolution at this loss design and compute budget: beta within ~5 %, k within ~10 %, c within ~25 %.

loss curves and noise floor


Why this scenario works as an onboarding case

  1. Every spatial-ABM feature in one scenario -- custom mixing geometry, multi-level SIA filtering, sub/supercritical R_0_eff configuration, per-patch state tracking, summary-statistics calibration targets, and the multi-model scenario builder -- small enough to run end-to-end on a workstation, complex enough that the calibration lessons actually surface.

  2. The cascade pattern is teachable here. The user sees Stage 1 (cheap, biased, useful), Stage 2 (expensive, stochastic, refining), and the cumulative build-up of trials across stages. They also see the failure modes (cascade A vs B) that make the M-vs-trials lesson stick.

  3. Synthetic-recovery framing with hidden TRUE. The deliverable is honest precision claims (beta within ~5 %, etc.) rather than a fitted point estimate without uncertainty.

  4. std(AR_C) ~ 0.5 is a target only a stochastic model can meet. A new user gets to see why the ABM stage is necessary, not just told.

  5. The CMP-bias diagnosis is itself a deliverable. Showing that deterministic CMP biases k by 2-3x under a particular loss design -- and that a stochastic Stage 2 can correct it -- is a transferable modelling lesson, not just a calibration result.


What AI tooling brought to this workflow

This entire scenario -- reference generation, identifiability sweeps, four cascade variants, a cumulative variant, validation, figures, and this write-up -- was built and run interactively with a Claude Code agent (Anthropic) in roughly one working day. The agent did the documentation lookups (JENNER-Measles MCP Server), proposed and edited scripts, ran experiments in parallel on the local machine, monitored progress with 10-min stream updates, parsed logs, generated diagnostic figures, diagnosed bugs (the silently-dropped mixer= field on the ABM InfectionParams, a prior_files typo from a global rename), and proposed the M-vs-trials follow-up that produced the working calibration.

Two specific demonstrations worth highlighting:

  • Iterative collaborative diagnosis. When the warm-start cascade produced "exact recovery" that looked too clean, the user pushed back; the agent proposed and ran a noise-floor study, a 1D beta profile around TRUE, and a full cold-start cascade that produced the honest result and the M lesson. Each step was a small ~10-min round-trip, not a multi-day delay.

  • Cumulative experiment design. The cascade variants weren't a pre-designed factorial -- they emerged in response to discovering that variant A was worse than the original despite more compute. Hypothesis -> experiment -> result -> next hypothesis, in real time, with the agent maintaining the bookkeeping.

The first-day-productive bar that this lifts is meaningful: a new user can produce a defensible spatial-ABM calibration with explicit uncertainty bounds and a published-quality figure pack in hours, not weeks -- and they can do it while learning the methodological lessons, because the journey through the failure modes is itself part of the workflow.


How to reproduce or build on this

sandbox/2026-04-23/
+-- generate_reference_v4.py            # 20-seed reference (~10 min)
+-- reference_v4/                       # I_by_patch.npy, R_by_patch.npy, meta
|
+-- cmp_identifiability_v4.py           # CMP sweep (run first)
+-- calibrate_cmp_v4_coldstart.py       # Stage 1 -- CMP cold-start (~5 min)
|
+-- abm_identifiability_v4.py           # ABM sweep
+-- calibrate_abm_v4_cascade_F.py       # Stage 2 -- winning cumulative cascade
+-- calibrate_abm_v4_cascade_{B,A,...}  # earlier variants -- kept for the M lesson
|
+-- v4_loss_curves.py                   # noise-floor + beta profile diagnostic
+-- v4_calibration_validation.py        # validation figure above
+-- INTRO_SCENARIO.md                   # this document

End-to-end reproduction from a fresh checkout: ~10 min reference, 5 min Stage 1, ~2 hours Stage 2 (1 trial = 60 ABM seeds = ~5 min on a 16-core machine), 5 min validation. About three hours of compute total, less of human time.

A natural extension for a user wanting to push c recovery: add a per-patch spatial loss term (e.g. sum of (arrival_tick_p_model - arrival_tick_p_ref)^2 for patches invaded in >= 50 % of seeds). c controls which patches in B_far the chain reaches first, and that signal is washed out in cluster-aggregate summary statistics.

⚠️ **GitHub.com Fallback** ⚠️