Local-model bridge benchmark¶
Living benchmark for the models we drive dhis2w-mcp-bridge with. Re-run with make bench-bridge
(harness: packages/dhis2w-bench/src/dhis2w_bench/bridge.py). Companion to the discovery notes in
small-model-bridge.md.
This is axis 2 of model validation (driving the bridge: read + write). Axis 1 — general
capability (python / cli / tooling, no DHIS2) — is general-benchmark.md.
Validate one model across both axes at once with make bench-validate MODEL=<key>.
Choosing models¶
There is no hardcoded roster — you name the model(s) to benchmark. List what's installed with
make bench-list, then pass one or more keys via MODELS= (one model = a single run; several = a
side-by-side comparison, run one at a time). The harness skips and logs any named model that isn't
installed. Set BENCH_ORACLE=<key> to designate an oracle (see below).
Method¶
- Reads →
play42(DHIS2_MCP_READONLY=1); writes →local_basic(READONLY=0, the write round-tripsminPasswordLengthand restores it). Never the shared public demo. - Each task runs through the real bridge (FastMCP client →
dhis2_cli) with LM Studio's OpenAI-compatible API as the brain. Per task: pass/fail (heuristic), tool-call count, wall-clock, completion tokens.read tok/sis total read completion-tokens / total read wall-clock (approx). - Tasks:
count("how many data elements" → 1037) ·schema("what fields does a data element have" → must calld2w schemaand name real fields) ·filter(ANC indicators) ·write(setminPasswordLengthto 10 with a command hint, then verify). Single representative runs; expect run-to-run variance, especially on the write (model nondeterminism).
Latest results — 2026-06-17 (play42 / local_basic)¶
| model | count | schema | filter | write | read tok/s |
|---|---|---|---|---|---|
google/gemma-4-26b-a4b-qat (oracle) |
PASS 8.9s | PASS 9.9s | PASS 9.0s | PASS 15.7s | ~21 |
google/gemma-4-12b-qat |
PASS 10.5s | PASS 14.9s | PASS 10.7s | PASS 20.3s | ~17 |
google/gemma-4-e4b |
PASS 10.4s | PASS 12.6s | PASS 13.5s | PASS 21.4s | ~30 |
qwen/qwen3.5-4b |
PASS 8.2s | PASS 9.1s | PASS 25.8s | PASS 10.1s | ~33 |
All four serious candidates pass every read + the write round (oracle clean); qwen3.5-4b writes
fastest. Note: this run also caught + fixed a stale write-command in the task itself
(dev customize set → system settings set) — see
benchmark-results.md for the full three-benchmark write-up.
The oracle (opt-in)¶
Set BENCH_ORACLE=<key> to designate one model in the run as the correctness reference — the
should-pass bar. The harness then asserts that model passed every task and prints a loud
SUSPECT TASK(S) banner if it didn't, because an oracle failure almost always means the task is
mis-specified, not the model. With no BENCH_ORACLE set there is no oracle check. The natural
choice is the strongest local model, google/gemma-4-26b-a4b-qat. This is a local, reproducible
oracle (no cloud agent in the loop).
Writes need local infra (and the bridge refuses public hosts)¶
The write round targets local_basic, so the harness preflights it with a cheap read and exits
loud (make dhis2-run) if the stack is down — it never silently skips the write. Independently, the
bridge itself refuses any mutating command whose resolved host is a shared public DHIS2 demo
(play.*.dhis2.org, debug.dhis2.org) regardless of read-only mode — a structural backstop so no
harness can write to the public demo by mis-wiring a profile. Override with
DHIS2_MCP_PROTECTED_HOSTS (empty value disables it).
Takeaways¶
- All four gemmas pass both axes. On reads + the (easy) write round they're equivalent on correctness; only speed separates them. So neither axis at full budget is a capability ranking for these four — it's a confirmation that all are viable bridge drivers.
- Caveat on the
writecolumn — it is an easy task. A single hinted setting plus a verify read; it measures execution + arg-formatting, not discovery or multi-object composition. "PASS write" is a low bar. The hard writes are the composite scenarios (make bench-composite). e4bis the efficiency surprise: fastest reads (~28 tok/s) and the fastest write (31.2s, in a single tool call) despite being the smallest (4B / 6.86 GB).- bf16
gemma-4-12bis strictly dominated by its qat sibling: same correctness, ~2x slower write (112.8s vs 55.9s), nearly 2x the disk (12.84 vs 7.15 GB).
Prune decision — 2026-06-16¶
Validation across both axes (this page + general-benchmark.md) settles the
keep/delete question:
- Keep
google/gemma-4-26b-a4b-qat(oracle),google/gemma-4-12b-qat(runner-up),google/gemma-4-e4b(fast, low-RAM). - Delete
google/gemma-4-12b(bf16) — strictly dominated by12b-qaton speed and size at identical correctness. The only reason it was kept was the qat-vs-bf16 head-to-head, which is now settled.
Because all four tie on correctness at full token budget, the general suite no longer separates
them — to do that, tighten BENCH_MAX_TOKENS (see general-benchmark.md).
Re-running¶
make bench-list # what's installed
make bench-bridge MODELS="google/gemma-4-12b-qat" # one model
make bench-bridge MODELS="gemma-4-12b-qat gemma-4-e4b" # compare several
BENCH_ORACLE=google/gemma-4-26b-a4b-qat make bench-bridge MODELS="..." # with an oracle
Needs the backend running and local_basic up for the write round (make dhis2-run). The harness
loads/unloads each model itself (one instance at a time, to avoid the ambiguous-model-id 400).
Per-model JSON is appended to /tmp/bench_bridge_results.jsonl.
The bench-matrix grid is NOT a capability ranking¶
The full metadata×roster grid (docs/notes/cli-matrix.md, 1230 cells) finished. The "found the right
command" rates: bf16-12b 12%, 12b-qat 10%, qwen3.5-4b 8%, 26b-a4b-qat 4%, qwen2.5-7b 4%,
e4b 2%. The oracle on read+write+perf (26b-a4b-qat) scored near the bottom — proof the grid
measures vague-goal disambiguation (pick the exact command among ~200 siblings from a one-line
goal), which is interpretation-noise-dominated, not capability. Use bench-bridge (read+write+perf)
to judge models; the matrix is a discoverability stress-test of the help surface, not a leaderboard.