Benchmark testing plan (1.0 readiness)¶
The goal: before 1.0, know that coding + mcp-bridge + full-mcp model testing is solid, across all installed models. Data sensitivity gates which stack a deployment uses (aggregate -> cloud + full MCP; PII -> local + bridge), so all three benchmarks must be in place and trustworthy.
Models under test¶
The installed LLMs (make bench-list) — embeddings and roleplay models excluded:
google/gemma-4-26b-a4b-qat— the oracle (256K ctx, tool-trained)google/gemma-4-12b-qat(256K ctx)google/gemma-4-e4b(128K ctx)qwen/qwen3.5-4b(256K ctx, tool-trained — strong tool format, weak discovery)
Add candidates (lms get ...) as they come up — notably the agentic-coder MoEs
(Qwen3-Coder-30B-A3B, GLM-4.7-Flash). No roster edits needed; just pass MODELS=.
The three benchmarks¶
| # | Benchmark | Command | What it measures | Status |
|---|---|---|---|---|
| 1 | coding | make bench-general |
python (14) + cli (3) + multi-turn tooling (7), no DHIS2 | ready (extended; discriminates) |
| 2 | mcp-bridge | make bench-bridge (read+write), bench-matrix (discovery), bench-composite (hard writes) |
single-tool dhis2_cli; the model must discover the ~200-command surface |
ready |
| 3 | full mcp | make bench-mcp |
the full dhis2-mcp server, all ~311 typed tools loaded up front | ready (loads at 128k; oracle passes) |
| 1c | cloud claude on coding | make bench-claude-general |
the same coding suite (python + cli + tooling), driven by a cloud Claude model (one-shot code-gen + in-process SDK mock tools) | ready (ambient subscription auth) |
| 2c | cloud claude over the bridge | make bench-claude-bridge |
the single dhis2_cli bridge, but driven by a cloud Claude model through the Agent SDK's native loop |
ready (read+write+composite; ambient subscription auth) |
| 3c | cloud claude over full mcp | make bench-claude-mcp |
the full server, but driven by a cloud Claude model through the Agent SDK's native loop (not the local OpenAI loop) | ready (read+write+composite; ambient subscription auth) |
The cloud lanes (2c, 3c) reuse the local tasks + scoring, so cloud-vs-local is directly comparable. They run a read round (play42, read-only gate), a single-setting write round (local_basic, restored) that is a smoke test — a plumbing/auth canary that ranks nothing because everyone passes — and the hard composite authoring round (local_basic), which is the discriminator that actually separates capable from weak agents.
bench-mcp — to build (with a safety guard)¶
Mirrors bench-bridge but drives the full FastMCP server (uv run dhis2w-mcp) instead of the single
tool. Two design points:
- Write safety (critical): the full server has no readonly mode, and its typed write tools do
NOT pass through the bridge's host-guard. So the read round on
play42must expose only read-verb tools (*_get/*_list/*_count/*_find/*_search/system_*info), never write tools — else a stray call could mutate the public demo. Writes run againstlocal_basiconly. - Scale (measured): read-only tools = 119 / ~16k tokens; all = 311 / ~49k tokens. That overflows
LM Studio's default 8192 load context (HTTP 400). So
bench-mcploads each model atBENCH_CONTEXT(default 128k) viaModelBackend.load(model, context). Finding so far: the oracle (26B MoE, 256k-capable) passes all reads + the write at 128k — a strong local model with a big context CAN drive full MCP. Whether smaller models can is what the sweep settles. Context is now a test dimension (varyBENCH_CONTEXT).
Sequence (runs are inherently serial — one model loaded at a time)¶
- Coding across all models — in progress. Oracle 62/62 (clean); qwen3.5-4b 54/58.
- mcp-bridge across all models (read + write) — next (start with the bridge).
- Build bench-mcp (safe), then full mcp across all models.
Each run sets BENCH_ORACLE=google/gemma-4-26b-a4b-qat so the oracle flags any mis-specified task
(a oracle failure = fix the task, not the model).
Safety (all benchmarks)¶
Reads -> play42 (DHIS2_MCP_READONLY=1 on the bridge; read-tool-only filter on full mcp); writes ->
local_basic (self-restoring). The bridge host-guard refuses writes to public hosts structurally.
make dhis2-run must be up for any write round.
Output¶
A per-model x per-benchmark scorecard (correctness + timing + tok/s). "All testing is still good" = the oracle is clean on every benchmark and the table reproduces across re-runs.