Formalized Testing of MCP Server Code Generation - laser-base/laser-measles GitHub Wiki

Overview

We designed a suite of 20 prompts to stress-test the MCP's ability to translate plain-language epidemiological research questions into runnable laser-measles code. The prompts were written deliberately without any knowledge of the API -- no class names, no module paths, no parameter names. They read the way a modeler would write a request to a colleague, not a developer querying documentation.

The suite spans a complexity gradient: early prompts (p01-p05) are single-population basics; mid-range prompts (p06-p15) introduce spatial structure, multiple model types, vital dynamics, interventions, and visualization; late prompts (p16-p20) include advanced features -- custom components, three-model comparisons, WPP demographics, and parallel execution.

This is a reusable test suite.

Overall result: 13/20 pass (65%)


The Prompts and Results

Note on "output correct?" column: We validated that scripts ran to completion without errors. We did not run a separate automated correctness check against known-good reference values. The "output correct?" column reflects our retrospective assessment of the printed outputs against epidemiological expectations -- where the output was plausible and internally consistent we mark it PASS; where it ran but produced suspect values we mark it SUSPECT; where it ran but the prompt's actual question went unanswered we mark it INCOMPLETE.

# Prompt Model type Ran? Output correct? Notes
01 Simulate a measles outbreak in a single city of 100,000 over one year. Track peak infectious individuals. ABM PASS PASS Peak 51,548 on day 60 -- plausible for naive pop with measles R0
02 Simulate measles in a single community of 50,000 for one year using a biweekly model. Count recovered at end. BWK FAIL n/a Polars boolean Series mismatch in chained .filter() call
03 ABM, single pop 100K, 180 days, 10 seeds. Report peak and day of peak. ABM PASS PASS Peak 51,925 on day 48 -- consistent with p01
04 ABM, 200K, births/deaths, 85% MCV1, 1 year. Report average and max infectious. ABM PASS SUSPECT Max 103,302 (52% of pop) simultaneous infectious seems high with 85% coverage; avg 4,392 more plausible
05 Two connected communities (100K city, 50K town), gravity travel. Report fraction of each infected. ABM PASS PASS Near-total infection (~99.98%) expected with no vaccination and high R0
06 5-patch biweekly, endemic equilibrium, vital dynamics, importation, 5 years. Per-patch mean infectious. BWK FAIL n/a StateTrackerParams does not exist in biweekly.components
07 3-patch ABM, gravity travel, seed patch 0 only, 6 months. Global peak and per-community attack rate. ABM PASS PASS Global peak 108,839; all patches ~99.96% -- expected without vaccination
08 5-patch biweekly, vital dynamics, importation, 4 years. Save epidemic curve PNG. BWK PASS PASS PNG saved successfully; ran 104 ticks as expected
09 Compartmental, 500K, 82% MCV1, 30% case detection, 1 year. True infections, detected cases, under-reporting ratio. CMP FAIL n/a CaseSurveillanceTracker.get_dataframe() column name unknown; fallback hit a string-dtype column
10 8-patch ABM, MCV1 gradient 0-90%, gravity travel, endemic equilibrium, 1 year. Attack rate per patch. ABM PASS WRONG All patches reported identical 87.5% attack rate regardless of MCV1; lowest incorrectly attributed to 0% coverage patch
11 3-patch ABM, SIA at 3 months (80% coverage), 1 year. Report peak infectious. ABM PASS INCOMPLETE Peak 43,228 reported, but no baseline comparison -- cannot verify the SIA had any effect
12 10-patch biweekly (10K each), births/deaths, importation, 10 years. Count fadeouts and timing. BWK FAIL n/a Same StateTrackerParams failure as p06
13 ABM, 500K, 85% MCV1, 30% seasonal amplitude, 10 years. Compute biennial ratio. ABM PASS SUSPECT Annual peaks declining (764 to 122) suggest epidemic burndown, not stable endemic cycle; biennial ratio 1.416 from non-stationary series
14 5-patch ABM, radiation mixing model, seed patch 0, 6 months. Print mixing matrix and attack rates. ABM PASS PASS Mixing matrix printed (near-diagonal, plausible for radiation model); all patches ~99.96%
15 Biweekly, 5x5 grid of R0 (4-20) x MCV1 (0-90%), 100K, 10 years. Print endemic burden grid. BWK PASS PASS R0 x MCV1 grid epidemiologically consistent; herd immunity threshold visible as contour
16 Two-cluster synthetic scenario, biweekly, 5 years. Save spatial attack-rate heatmap PNG. BWK FAIL n/a Same StateTrackerParams failure as p06 and p12
17 5-patch ABM, custom component tracking day of first infection per patch, seed patch 0, 6 months. ABM PASS PASS Wavefront: patch_0 day 4, others days 20-32; spatial progression plausible
18 ABM, age-structured demographics, 200K, 85% MCV1, 5 years. Age pyramid tracker, report change. ABM PASS INCOMPLETE Tracker ran and produced annual snapshots; age distribution change not quantitatively printed
19 ABM, WPP vital dynamics, 300K, 5 years. Report start/end population, verify WPP-driven change. ABM FAIL n/a WPPVitalDynamicsProcess exceeds LaserFrame capacity ~4% into run
20 Compare all three model types in parallel (multiprocessing), 20-patch, 500K, 85% MCV1, 5 years. 3-panel PNG. ABM + BWK + CMP FAIL n/a Lambda function not picklable in multiprocessing.Pool.starmap

Summary Statistics

Outcome Count
Total prompts 20
Ran without errors 13 (65%)
Ran AND output looks correct 8 (40%)
Ran but output suspect or incomplete 5 (25%)
Failed to run 7 (35%)

By model type

Model type Prompts Pass rate
ABM 12 92% (11/12)
Biweekly 6 33% (2/6)
Compartmental 1 0% (0/1)
All three 1 0% (0/1)

The ABM model is by far the most reliable target for code generation. The biweekly model is the weakest -- three of its four failures share a single root cause (see below).


Failure Analysis

Root causes

Root cause Failures Scripts
StateTrackerParams absent from biweekly.components 3 p06, p12, p16
LaserFrame capacity overflow (WPP vital dynamics) 1 p19
Polars boolean Series in chained .filter() 1 p02
CaseSurveillanceTracker output schema undocumented 1 p09
Lambda not picklable in multiprocessing 1 p20

1 -- StateTrackerParams missing from biweekly.components (p06, p12, p16)

Three prompts asked for per-patch tracking in biweekly models. All three generated components.StateTrackerParams(aggregation_level=0) -- which exists in the ABM and Compartmental namespaces but not in laser.measles.biweekly.components. The MCP retrieved the correct pattern from ABM documentation and applied it to the wrong model type.

This is the single highest-leverage documentation fix available: one Gotchas entry clarifying which param classes exist in which model namespace would resolve 43% of all failures.

2 -- Incorrect epidemiological output: p10

Script p10 ran successfully but produced scientifically wrong output. Eight communities with MCV1 ranging from 0% to 90% all reported identical attack rates of 87.5%, and the "lowest attack rate" was incorrectly attributed to the community with zero vaccination. The MCV1 gradient had no visible effect on transmission.

This is the most concerning class of failure: the script ran and printed plausible-looking numbers, but the answer is wrong. It is not clear whether this is a code generation error (mcv1 values not wired into the right parameter), a model behavior issue (short run time prevents vaccination effect from showing), or something else. A passing runtime check is not a correctness check.

3 -- LaserFrame capacity overflow (p19)

WPPVitalDynamicsProcess crashed at 4% of the run (tick 80/1827):

ValueError: frame.add() exceeds capacity (self._count=301695 + count=478758 > self._capacity=360219)

The ABM pre-allocates a fixed LaserFrame at construction time. Vital dynamics components that add births require headroom beyond the initial population. This is undocumented, and the generated code provided no capacity buffer. This failure mode was also observed in the original issues audit.

4 -- Polars filter API misuse (p02)

The generated code chained two .filter() calls, passing df["col"] == value boolean Series from the original unfiltered DataFrame into the second call. Polars rejects the length mismatch. The correct pattern is df.filter((pl.col("a") == x) & (pl.col("b") == y)). This is a general Polars pitfall, not specific to laser-measles.

5 -- CaseSurveillanceTracker schema (p09)

The model ran successfully and the tracker was retrieved correctly via model.get_instance(). The failure occurred when the code tried to sum the "detected_cases" column -- either that column does not exist under that name or it has unexpected type. The tracker's output DataFrame schema is not documented in a form the RAG can retrieve.

6 -- Lambda pickling in multiprocessing (p20)

pool.starmap(lambda f: f(sd), [...]) fails because Python's multiprocessing module uses pickle for IPC, and anonymous lambdas are not picklable. The fix is a named top-level function. This prompt was designed to be the hardest in the suite and was expected to fail -- it tests a Python-level gotcha that documentation cannot easily prevent.


Observations on Passing Scripts

Several results are worth highlighting positively:

p15 (R0 x MCV1 grid, 25 biweekly runs) produced an epidemiologically coherent heatmap showing increasing endemic burden with R0 and decreasing burden with MCV1 coverage, with the herd immunity threshold visible as a contour. This is non-trivial to get right and the output is scientifically interpretable.

p17 (custom wavefront tracker) asked the MCP to write a completely novel tracking component from scratch -- recording the day of first infection per patch. It passed and produced a sensible spatial progression. This is the strongest positive result in the suite.

p14 (radiation mixing model) correctly generated and printed a mixing matrix for the radiation travel model, an alternative to the more common gravity model. All five patches' attack rates were reported correctly.


What We Did Not Test

  • Correctness validation: Pass/fail was determined by runtime exit code, not by comparison to reference values. p10 illustrates why this matters -- it passed the runtime test and failed the science test.
  • Stochastic stability: Each prompt was run once with a fixed seed. We did not assess whether results are stable across seeds.
  • Long runs: Most prompts used short or medium durations. Memory and performance at large scale were not tested.
  • All three models on the same scenario: p20 was designed to do this but failed. A working version of that comparison would be the most valuable single test case to add.

Recommended Documentation Improvements

Priority Fix Addresses
High Add Gotchas entry: StateTrackerParams is not available in biweekly.components; document the biweekly per-patch tracking pattern explicitly p06, p12, p16 (3 failures)
High Add Gotchas entry: ABM with vital dynamics requires pre-allocated LaserFrame headroom -- specify the parameter and recommended multiplier p19, and all ABM vital dynamics
Medium Document CaseSurveillanceTracker.get_dataframe() column schema in the API reference p09
Medium Add note to tutorials: use pl.col() expressions in .filter() calls, not boolean Series derived from the original frame p02
Low Add a note that laser.measles models are not designed for multiprocessing.Pool; recommend sequential or thread-based approaches p20

Setup

Date: 2026-03-20

MCP server: jenner-measles-mcp-prod (laser-measles 0.9.2)

Method: 20 prompts written in researcher language (no API class names), generated via generate_code, executed under python3.11 with the laser-measles venv.