Formalized Testing of MCP Server Code Generation - laser-base/laser-measles GitHub Wiki

Overview

We designed a suite of 20 prompts to stress-test the MCP's ability to translate plain-language epidemiological research questions into runnable laser-measles code. The prompts were written deliberately without any knowledge of the API -- no class names, no module paths, no parameter names. They read the way a modeler would write a request to a colleague, not a developer querying documentation.

The suite spans a complexity gradient: early prompts (p01-p05) are single-population basics; mid-range prompts (p06-p15) introduce spatial structure, multiple model types, vital dynamics, interventions, and visualization; late prompts (p16-p20) include advanced features -- custom components, three-model comparisons, WPP demographics, and parallel execution.

This is a reusable test suite.

Overall result: 13/20 pass (65%)

The Prompts and Results

Note on "output correct?" column: We validated that scripts ran to completion without errors. We did not run a separate automated correctness check against known-good reference values. The "output correct?" column reflects our retrospective assessment of the printed outputs against epidemiological expectations -- where the output was plausible and internally consistent we mark it PASS; where it ran but produced suspect values we mark it SUSPECT; where it ran but the prompt's actual question went unanswered we mark it INCOMPLETE.

#	Prompt	Model type	Ran?	Output correct?	Notes
01	Simulate a measles outbreak in a single city of 100,000 over one year. Track peak infectious individuals.	ABM	PASS	PASS	Peak 51,548 on day 60 -- plausible for naive pop with measles R0
02	Simulate measles in a single community of 50,000 for one year using a biweekly model. Count recovered at end.	BWK	FAIL	n/a	Polars boolean Series mismatch in chained `.filter()` call
03	ABM, single pop 100K, 180 days, 10 seeds. Report peak and day of peak.	ABM	PASS	PASS	Peak 51,925 on day 48 -- consistent with p01
04	ABM, 200K, births/deaths, 85% MCV1, 1 year. Report average and max infectious.	ABM	PASS	SUSPECT	Max 103,302 (52% of pop) simultaneous infectious seems high with 85% coverage; avg 4,392 more plausible
05	Two connected communities (100K city, 50K town), gravity travel. Report fraction of each infected.	ABM	PASS	PASS	Near-total infection (~99.98%) expected with no vaccination and high R0
06	5-patch biweekly, endemic equilibrium, vital dynamics, importation, 5 years. Per-patch mean infectious.	BWK	FAIL	n/a	`StateTrackerParams` does not exist in `biweekly.components`
07	3-patch ABM, gravity travel, seed patch 0 only, 6 months. Global peak and per-community attack rate.	ABM	PASS	PASS	Global peak 108,839; all patches ~99.96% -- expected without vaccination
08	5-patch biweekly, vital dynamics, importation, 4 years. Save epidemic curve PNG.	BWK	PASS	PASS	PNG saved successfully; ran 104 ticks as expected
09	Compartmental, 500K, 82% MCV1, 30% case detection, 1 year. True infections, detected cases, under-reporting ratio.	CMP	FAIL	n/a	`CaseSurveillanceTracker.get_dataframe()` column name unknown; fallback hit a string-dtype column
10	8-patch ABM, MCV1 gradient 0-90%, gravity travel, endemic equilibrium, 1 year. Attack rate per patch.	ABM	PASS	WRONG	All patches reported identical 87.5% attack rate regardless of MCV1; lowest incorrectly attributed to 0% coverage patch
11	3-patch ABM, SIA at 3 months (80% coverage), 1 year. Report peak infectious.	ABM	PASS	INCOMPLETE	Peak 43,228 reported, but no baseline comparison -- cannot verify the SIA had any effect
12	10-patch biweekly (10K each), births/deaths, importation, 10 years. Count fadeouts and timing.	BWK	FAIL	n/a	Same `StateTrackerParams` failure as p06
13	ABM, 500K, 85% MCV1, 30% seasonal amplitude, 10 years. Compute biennial ratio.	ABM	PASS	SUSPECT	Annual peaks declining (764 to 122) suggest epidemic burndown, not stable endemic cycle; biennial ratio 1.416 from non-stationary series
14	5-patch ABM, radiation mixing model, seed patch 0, 6 months. Print mixing matrix and attack rates.	ABM	PASS	PASS	Mixing matrix printed (near-diagonal, plausible for radiation model); all patches ~99.96%
15	Biweekly, 5x5 grid of R0 (4-20) x MCV1 (0-90%), 100K, 10 years. Print endemic burden grid.	BWK	PASS	PASS	R0 x MCV1 grid epidemiologically consistent; herd immunity threshold visible as contour
16	Two-cluster synthetic scenario, biweekly, 5 years. Save spatial attack-rate heatmap PNG.	BWK	FAIL	n/a	Same `StateTrackerParams` failure as p06 and p12
17	5-patch ABM, custom component tracking day of first infection per patch, seed patch 0, 6 months.	ABM	PASS	PASS	Wavefront: patch_0 day 4, others days 20-32; spatial progression plausible
18	ABM, age-structured demographics, 200K, 85% MCV1, 5 years. Age pyramid tracker, report change.	ABM	PASS	INCOMPLETE	Tracker ran and produced annual snapshots; age distribution change not quantitatively printed
19	ABM, WPP vital dynamics, 300K, 5 years. Report start/end population, verify WPP-driven change.	ABM	FAIL	n/a	`WPPVitalDynamicsProcess` exceeds `LaserFrame` capacity ~4% into run
20	Compare all three model types in parallel (multiprocessing), 20-patch, 500K, 85% MCV1, 5 years. 3-panel PNG.	ABM + BWK + CMP	FAIL	n/a	Lambda function not picklable in `multiprocessing.Pool.starmap`

Summary Statistics

Outcome	Count
Total prompts	20
Ran without errors	13 (65%)
Ran AND output looks correct	8 (40%)
Ran but output suspect or incomplete	5 (25%)
Failed to run	7 (35%)

By model type

Model type	Prompts	Pass rate
ABM	12	92% (11/12)
Biweekly	6	33% (2/6)
Compartmental	1	0% (0/1)
All three	1	0% (0/1)

The ABM model is by far the most reliable target for code generation. The biweekly model is the weakest -- three of its four failures share a single root cause (see below).

Failure Analysis

Root causes

Root cause	Failures	Scripts
`StateTrackerParams` absent from `biweekly.components`	3	p06, p12, p16
`LaserFrame` capacity overflow (WPP vital dynamics)	1	p19
Polars boolean Series in chained `.filter()`	1	p02
`CaseSurveillanceTracker` output schema undocumented	1	p09
Lambda not picklable in `multiprocessing`	1	p20

1 -- `StateTrackerParams` missing from `biweekly.components` (p06, p12, p16)

Three prompts asked for per-patch tracking in biweekly models. All three generated components.StateTrackerParams(aggregation_level=0) -- which exists in the ABM and Compartmental namespaces but not in laser.measles.biweekly.components. The MCP retrieved the correct pattern from ABM documentation and applied it to the wrong model type.

This is the single highest-leverage documentation fix available: one Gotchas entry clarifying which param classes exist in which model namespace would resolve 43% of all failures.

2 -- Incorrect epidemiological output: p10

Script p10 ran successfully but produced scientifically wrong output. Eight communities with MCV1 ranging from 0% to 90% all reported identical attack rates of 87.5%, and the "lowest attack rate" was incorrectly attributed to the community with zero vaccination. The MCV1 gradient had no visible effect on transmission.

This is the most concerning class of failure: the script ran and printed plausible-looking numbers, but the answer is wrong. It is not clear whether this is a code generation error (mcv1 values not wired into the right parameter), a model behavior issue (short run time prevents vaccination effect from showing), or something else. A passing runtime check is not a correctness check.

3 -- `LaserFrame` capacity overflow (p19)

WPPVitalDynamicsProcess crashed at 4% of the run (tick 80/1827):

ValueError: frame.add() exceeds capacity (self._count=301695 + count=478758 > self._capacity=360219)

The ABM pre-allocates a fixed LaserFrame at construction time. Vital dynamics components that add births require headroom beyond the initial population. This is undocumented, and the generated code provided no capacity buffer. This failure mode was also observed in the original issues audit.

4 -- Polars filter API misuse (p02)

The generated code chained two .filter() calls, passing df["col"] == value boolean Series from the original unfiltered DataFrame into the second call. Polars rejects the length mismatch. The correct pattern is df.filter((pl.col("a") == x) & (pl.col("b") == y)). This is a general Polars pitfall, not specific to laser-measles.

5 -- `CaseSurveillanceTracker` schema (p09)

The model ran successfully and the tracker was retrieved correctly via model.get_instance(). The failure occurred when the code tried to sum the "detected_cases" column -- either that column does not exist under that name or it has unexpected type. The tracker's output DataFrame schema is not documented in a form the RAG can retrieve.

6 -- Lambda pickling in multiprocessing (p20)

pool.starmap(lambda f: f(sd), [...]) fails because Python's multiprocessing module uses pickle for IPC, and anonymous lambdas are not picklable. The fix is a named top-level function. This prompt was designed to be the hardest in the suite and was expected to fail -- it tests a Python-level gotcha that documentation cannot easily prevent.

Observations on Passing Scripts

Several results are worth highlighting positively:

p15 (R0 x MCV1 grid, 25 biweekly runs) produced an epidemiologically coherent heatmap showing increasing endemic burden with R0 and decreasing burden with MCV1 coverage, with the herd immunity threshold visible as a contour. This is non-trivial to get right and the output is scientifically interpretable.

p17 (custom wavefront tracker) asked the MCP to write a completely novel tracking component from scratch -- recording the day of first infection per patch. It passed and produced a sensible spatial progression. This is the strongest positive result in the suite.

p14 (radiation mixing model) correctly generated and printed a mixing matrix for the radiation travel model, an alternative to the more common gravity model. All five patches' attack rates were reported correctly.

What We Did Not Test

Correctness validation: Pass/fail was determined by runtime exit code, not by comparison to reference values. p10 illustrates why this matters -- it passed the runtime test and failed the science test.
Stochastic stability: Each prompt was run once with a fixed seed. We did not assess whether results are stable across seeds.
Long runs: Most prompts used short or medium durations. Memory and performance at large scale were not tested.
All three models on the same scenario: p20 was designed to do this but failed. A working version of that comparison would be the most valuable single test case to add.

Priority	Fix	Addresses
High	Add Gotchas entry: `StateTrackerParams` is not available in `biweekly.components`; document the biweekly per-patch tracking pattern explicitly	p06, p12, p16 (3 failures)
High	Add Gotchas entry: ABM with vital dynamics requires pre-allocated `LaserFrame` headroom -- specify the parameter and recommended multiplier	p19, and all ABM vital dynamics
Medium	Document `CaseSurveillanceTracker.get_dataframe()` column schema in the API reference	p09
Medium	Add note to tutorials: use `pl.col()` expressions in `.filter()` calls, not boolean Series derived from the original frame	p02
Low	Add a note that `laser.measles` models are not designed for `multiprocessing.Pool`; recommend sequential or thread-based approaches	p20

Setup

Date: 2026-03-20

MCP server: jenner-measles-mcp-prod (laser-measles 0.9.2)

Method: 20 prompts written in researcher language (no API class names), generated via generate_code, executed under python3.11 with the laser-measles venv.