Formalized Testing of MCP Server Code Generation - laser-base/laser-measles GitHub Wiki
Overview
We designed a suite of 20 prompts to stress-test the MCP's ability to translate plain-language epidemiological research questions into runnable laser-measles code. The prompts were written deliberately without any knowledge of the API -- no class names, no module paths, no parameter names. They read the way a modeler would write a request to a colleague, not a developer querying documentation.
The suite spans a complexity gradient: early prompts (p01-p05) are single-population basics; mid-range prompts (p06-p15) introduce spatial structure, multiple model types, vital dynamics, interventions, and visualization; late prompts (p16-p20) include advanced features -- custom components, three-model comparisons, WPP demographics, and parallel execution.
This is a reusable test suite.
Overall result: 13/20 pass (65%)
The Prompts and Results
Note on "output correct?" column: We validated that scripts ran to completion without errors. We did not run a separate automated correctness check against known-good reference values. The "output correct?" column reflects our retrospective assessment of the printed outputs against epidemiological expectations -- where the output was plausible and internally consistent we mark it PASS; where it ran but produced suspect values we mark it SUSPECT; where it ran but the prompt's actual question went unanswered we mark it INCOMPLETE.
| # | Prompt | Model type | Ran? | Output correct? | Notes |
|---|---|---|---|---|---|
| 01 | Simulate a measles outbreak in a single city of 100,000 over one year. Track peak infectious individuals. | ABM | PASS | PASS | Peak 51,548 on day 60 -- plausible for naive pop with measles R0 |
| 02 | Simulate measles in a single community of 50,000 for one year using a biweekly model. Count recovered at end. | BWK | FAIL | n/a | Polars boolean Series mismatch in chained .filter() call |
| 03 | ABM, single pop 100K, 180 days, 10 seeds. Report peak and day of peak. | ABM | PASS | PASS | Peak 51,925 on day 48 -- consistent with p01 |
| 04 | ABM, 200K, births/deaths, 85% MCV1, 1 year. Report average and max infectious. | ABM | PASS | SUSPECT | Max 103,302 (52% of pop) simultaneous infectious seems high with 85% coverage; avg 4,392 more plausible |
| 05 | Two connected communities (100K city, 50K town), gravity travel. Report fraction of each infected. | ABM | PASS | PASS | Near-total infection (~99.98%) expected with no vaccination and high R0 |
| 06 | 5-patch biweekly, endemic equilibrium, vital dynamics, importation, 5 years. Per-patch mean infectious. | BWK | FAIL | n/a | StateTrackerParams does not exist in biweekly.components |
| 07 | 3-patch ABM, gravity travel, seed patch 0 only, 6 months. Global peak and per-community attack rate. | ABM | PASS | PASS | Global peak 108,839; all patches ~99.96% -- expected without vaccination |
| 08 | 5-patch biweekly, vital dynamics, importation, 4 years. Save epidemic curve PNG. | BWK | PASS | PASS | PNG saved successfully; ran 104 ticks as expected |
| 09 | Compartmental, 500K, 82% MCV1, 30% case detection, 1 year. True infections, detected cases, under-reporting ratio. | CMP | FAIL | n/a | CaseSurveillanceTracker.get_dataframe() column name unknown; fallback hit a string-dtype column |
| 10 | 8-patch ABM, MCV1 gradient 0-90%, gravity travel, endemic equilibrium, 1 year. Attack rate per patch. | ABM | PASS | WRONG | All patches reported identical 87.5% attack rate regardless of MCV1; lowest incorrectly attributed to 0% coverage patch |
| 11 | 3-patch ABM, SIA at 3 months (80% coverage), 1 year. Report peak infectious. | ABM | PASS | INCOMPLETE | Peak 43,228 reported, but no baseline comparison -- cannot verify the SIA had any effect |
| 12 | 10-patch biweekly (10K each), births/deaths, importation, 10 years. Count fadeouts and timing. | BWK | FAIL | n/a | Same StateTrackerParams failure as p06 |
| 13 | ABM, 500K, 85% MCV1, 30% seasonal amplitude, 10 years. Compute biennial ratio. | ABM | PASS | SUSPECT | Annual peaks declining (764 to 122) suggest epidemic burndown, not stable endemic cycle; biennial ratio 1.416 from non-stationary series |
| 14 | 5-patch ABM, radiation mixing model, seed patch 0, 6 months. Print mixing matrix and attack rates. | ABM | PASS | PASS | Mixing matrix printed (near-diagonal, plausible for radiation model); all patches ~99.96% |
| 15 | Biweekly, 5x5 grid of R0 (4-20) x MCV1 (0-90%), 100K, 10 years. Print endemic burden grid. | BWK | PASS | PASS | R0 x MCV1 grid epidemiologically consistent; herd immunity threshold visible as contour |
| 16 | Two-cluster synthetic scenario, biweekly, 5 years. Save spatial attack-rate heatmap PNG. | BWK | FAIL | n/a | Same StateTrackerParams failure as p06 and p12 |
| 17 | 5-patch ABM, custom component tracking day of first infection per patch, seed patch 0, 6 months. | ABM | PASS | PASS | Wavefront: patch_0 day 4, others days 20-32; spatial progression plausible |
| 18 | ABM, age-structured demographics, 200K, 85% MCV1, 5 years. Age pyramid tracker, report change. | ABM | PASS | INCOMPLETE | Tracker ran and produced annual snapshots; age distribution change not quantitatively printed |
| 19 | ABM, WPP vital dynamics, 300K, 5 years. Report start/end population, verify WPP-driven change. | ABM | FAIL | n/a | WPPVitalDynamicsProcess exceeds LaserFrame capacity ~4% into run |
| 20 | Compare all three model types in parallel (multiprocessing), 20-patch, 500K, 85% MCV1, 5 years. 3-panel PNG. | ABM + BWK + CMP | FAIL | n/a | Lambda function not picklable in multiprocessing.Pool.starmap |
Summary Statistics
| Outcome | Count |
|---|---|
| Total prompts | 20 |
| Ran without errors | 13 (65%) |
| Ran AND output looks correct | 8 (40%) |
| Ran but output suspect or incomplete | 5 (25%) |
| Failed to run | 7 (35%) |
By model type
| Model type | Prompts | Pass rate |
|---|---|---|
| ABM | 12 | 92% (11/12) |
| Biweekly | 6 | 33% (2/6) |
| Compartmental | 1 | 0% (0/1) |
| All three | 1 | 0% (0/1) |
The ABM model is by far the most reliable target for code generation. The biweekly model is the weakest -- three of its four failures share a single root cause (see below).
Failure Analysis
Root causes
| Root cause | Failures | Scripts |
|---|---|---|
StateTrackerParams absent from biweekly.components |
3 | p06, p12, p16 |
LaserFrame capacity overflow (WPP vital dynamics) |
1 | p19 |
Polars boolean Series in chained .filter() |
1 | p02 |
CaseSurveillanceTracker output schema undocumented |
1 | p09 |
Lambda not picklable in multiprocessing |
1 | p20 |
1 -- StateTrackerParams missing from biweekly.components (p06, p12, p16)
Three prompts asked for per-patch tracking in biweekly models. All three generated components.StateTrackerParams(aggregation_level=0) -- which exists in the ABM and Compartmental namespaces but not in laser.measles.biweekly.components. The MCP retrieved the correct pattern from ABM documentation and applied it to the wrong model type.
This is the single highest-leverage documentation fix available: one Gotchas entry clarifying which param classes exist in which model namespace would resolve 43% of all failures.
2 -- Incorrect epidemiological output: p10
Script p10 ran successfully but produced scientifically wrong output. Eight communities with MCV1 ranging from 0% to 90% all reported identical attack rates of 87.5%, and the "lowest attack rate" was incorrectly attributed to the community with zero vaccination. The MCV1 gradient had no visible effect on transmission.
This is the most concerning class of failure: the script ran and printed plausible-looking numbers, but the answer is wrong. It is not clear whether this is a code generation error (mcv1 values not wired into the right parameter), a model behavior issue (short run time prevents vaccination effect from showing), or something else. A passing runtime check is not a correctness check.
3 -- LaserFrame capacity overflow (p19)
WPPVitalDynamicsProcess crashed at 4% of the run (tick 80/1827):
ValueError: frame.add() exceeds capacity (self._count=301695 + count=478758 > self._capacity=360219)
The ABM pre-allocates a fixed LaserFrame at construction time. Vital dynamics components that add births require headroom beyond the initial population. This is undocumented, and the generated code provided no capacity buffer. This failure mode was also observed in the original issues audit.
4 -- Polars filter API misuse (p02)
The generated code chained two .filter() calls, passing df["col"] == value boolean Series from the original unfiltered DataFrame into the second call. Polars rejects the length mismatch. The correct pattern is df.filter((pl.col("a") == x) & (pl.col("b") == y)). This is a general Polars pitfall, not specific to laser-measles.
5 -- CaseSurveillanceTracker schema (p09)
The model ran successfully and the tracker was retrieved correctly via model.get_instance(). The failure occurred when the code tried to sum the "detected_cases" column -- either that column does not exist under that name or it has unexpected type. The tracker's output DataFrame schema is not documented in a form the RAG can retrieve.
6 -- Lambda pickling in multiprocessing (p20)
pool.starmap(lambda f: f(sd), [...]) fails because Python's multiprocessing module uses pickle for IPC, and anonymous lambdas are not picklable. The fix is a named top-level function. This prompt was designed to be the hardest in the suite and was expected to fail -- it tests a Python-level gotcha that documentation cannot easily prevent.
Observations on Passing Scripts
Several results are worth highlighting positively:
p15 (R0 x MCV1 grid, 25 biweekly runs) produced an epidemiologically coherent heatmap showing increasing endemic burden with R0 and decreasing burden with MCV1 coverage, with the herd immunity threshold visible as a contour. This is non-trivial to get right and the output is scientifically interpretable.
p17 (custom wavefront tracker) asked the MCP to write a completely novel tracking component from scratch -- recording the day of first infection per patch. It passed and produced a sensible spatial progression. This is the strongest positive result in the suite.
p14 (radiation mixing model) correctly generated and printed a mixing matrix for the radiation travel model, an alternative to the more common gravity model. All five patches' attack rates were reported correctly.
What We Did Not Test
- Correctness validation: Pass/fail was determined by runtime exit code, not by comparison to reference values. p10 illustrates why this matters -- it passed the runtime test and failed the science test.
- Stochastic stability: Each prompt was run once with a fixed seed. We did not assess whether results are stable across seeds.
- Long runs: Most prompts used short or medium durations. Memory and performance at large scale were not tested.
- All three models on the same scenario: p20 was designed to do this but failed. A working version of that comparison would be the most valuable single test case to add.
Recommended Documentation Improvements
| Priority | Fix | Addresses |
|---|---|---|
| High | Add Gotchas entry: StateTrackerParams is not available in biweekly.components; document the biweekly per-patch tracking pattern explicitly |
p06, p12, p16 (3 failures) |
| High | Add Gotchas entry: ABM with vital dynamics requires pre-allocated LaserFrame headroom -- specify the parameter and recommended multiplier |
p19, and all ABM vital dynamics |
| Medium | Document CaseSurveillanceTracker.get_dataframe() column schema in the API reference |
p09 |
| Medium | Add note to tutorials: use pl.col() expressions in .filter() calls, not boolean Series derived from the original frame |
p02 |
| Low | Add a note that laser.measles models are not designed for multiprocessing.Pool; recommend sequential or thread-based approaches |
p20 |
Setup
Date: 2026-03-20
MCP server: jenner-measles-mcp-prod (laser-measles 0.9.2)
Method: 20 prompts written in researcher language (no API class names), generated via generate_code, executed under python3.11 with the laser-measles venv.