Benchmark testing plan (1.0 readiness)¶

The goal: before 1.0, know that coding + mcp-bridge + full-mcp model testing is solid, across all installed models. Data sensitivity gates which stack a deployment uses (aggregate -> cloud + full MCP; PII -> local + bridge), so all three benchmarks must be in place and trustworthy.

Models under test¶

The installed LLMs (make bench-list) — embeddings and roleplay models excluded:

google/gemma-4-26b-a4b-qat — the oracle (256K ctx, tool-trained)
google/gemma-4-12b-qat (256K ctx)
google/gemma-4-e4b (128K ctx)
qwen/qwen3.5-4b (256K ctx, tool-trained — strong tool format, weak discovery)

Add candidates (lms get ...) as they come up — notably the agentic-coder MoEs (Qwen3-Coder-30B-A3B, GLM-4.7-Flash). No roster edits needed; just pass MODELS=.

The three benchmarks¶

#	Benchmark	Command	What it measures	Status
1	coding	`make bench-general`	python (14) + cli (3) + multi-turn tooling (7), no DHIS2	ready (extended; discriminates)
2	mcp-bridge	`make bench-bridge` (read+write), `bench-matrix` (discovery), `bench-composite` (hard writes)	single-tool `dhis2_cli`; the model must discover the ~200-command surface	ready
3	full mcp	`make bench-mcp`	the full dhis2-mcp server, all ~311 typed tools loaded up front	ready (loads at 128k; oracle passes)
1c	cloud claude on coding	`make bench-claude-general`	the same coding suite (python + cli + tooling), driven by a cloud Claude model (one-shot code-gen + in-process SDK mock tools)	ready (ambient subscription auth)
2c	cloud claude over the bridge	`make bench-claude-bridge`	the single `dhis2_cli` bridge, but driven by a cloud Claude model through the Agent SDK's native loop	ready (read+write+composite; ambient subscription auth)
3c	cloud claude over full mcp	`make bench-claude-mcp`	the full server, but driven by a cloud Claude model through the Agent SDK's native loop (not the local OpenAI loop)	ready (read+write+composite; ambient subscription auth)

The cloud lanes (2c, 3c) reuse the local tasks + scoring, so cloud-vs-local is directly comparable. They run a read round (play42, read-only gate), a single-setting write round (local_basic, restored) that is a smoke test — a plumbing/auth canary that ranks nothing because everyone passes — and the hard composite authoring round (local_basic), which is the discriminator that actually separates capable from weak agents.

bench-mcp — to build (with a safety guard)¶

Mirrors bench-bridge but drives the full FastMCP server (uv run dhis2w-mcp) instead of the single tool. Two design points:

Write safety (critical): the full server has no readonly mode, and its typed write tools do NOT pass through the bridge's host-guard. So the read round on play42 must expose only read-verb tools (*_get/*_list/*_count/*_find/*_search/system_* info), never write tools — else a stray call could mutate the public demo. Writes run against local_basic only.
Scale (measured): read-only tools = 119 / ~16k tokens; all = 311 / ~49k tokens. That overflows LM Studio's default 8192 load context (HTTP 400). So bench-mcp loads each model at BENCH_CONTEXT (default 128k) via ModelBackend.load(model, context). Finding so far: the oracle (26B MoE, 256k-capable) passes all reads + the write at 128k — a strong local model with a big context CAN drive full MCP. Whether smaller models can is what the sweep settles. Context is now a test dimension (vary BENCH_CONTEXT).

Sequence (runs are inherently serial — one model loaded at a time)¶

Coding across all models — in progress. Oracle 62/62 (clean); qwen3.5-4b 54/58.
mcp-bridge across all models (read + write) — next (start with the bridge).
Build bench-mcp (safe), then full mcp across all models.

Each run sets BENCH_ORACLE=google/gemma-4-26b-a4b-qat so the oracle flags any mis-specified task (a oracle failure = fix the task, not the model).

Safety (all benchmarks)¶

Reads -> play42 (DHIS2_MCP_READONLY=1 on the bridge; read-tool-only filter on full mcp); writes -> local_basic (self-restoring). The bridge host-guard refuses writes to public hosts structurally. make dhis2-run must be up for any write round.

Output¶

A per-model x per-benchmark scorecard (correctness + timing + tok/s). "All testing is still good" = the oracle is clean on every benchmark and the table reproduces across re-runs.