Skip to content

Benchmark results — 2026-06-17

A full sweep of every installed local LLM across all three benchmarks (coding, mcp-bridge, full mcp), run overnight with google/gemma-4-26b-a4b-qat as the oracle. The oracle was clean on all three (it passed every task), so the tasks are well-formed and the weaker-model numbers are trustworthy.

Reproduce: make bench-general, make bench-bridge, make bench-mcp (each MODELS="..."); the raw tables are in /tmp/sweep_{coding,bridge,mcp}.out.

TL;DR

  • All four serious candidates — gemma-4-26b-a4b-qat, gemma-4-12b-qat, gemma-4-e4b, qwen3.5-4b — passed all three benchmarks, including the full MCP server (~311 tools) at 128k context. The assumption that "only cloud frontier models can drive full MCP" does not hold for these read+write tasks: capable local models handle it, given enough context.
  • gemma-4-e4b (4B, 6.9 GB) wins reads, easy writes, and speed — near-perfect everywhere cheap, fastest on the easy/MCP writes. But it cannot do the hard composite writes (0/3 — see below).
  • Easy writes don't rank models; the composite (hard) writes do. Over 3 runs, the two strong models (26b-a4b-qat, 12b-qat) author reliably-ish (⅔ dataset, 3/3 program) while e4b and qwen3.5-4b cannot (0/3 dataset). So for authoring (the PII-write job), the 26B is the pick (it ties the 12B on composites and is faster + more general); the 12B is the half-the-RAM equivalent.
  • Authoring is flaky (~⅔) even for the strong models — a real 1.0 caveat; a local PII-author needs retries + verification. (Single composite runs are misleading — always run N times.)
  • Long-context retrieval is not size-bound: the 4B qwen3.5-4b and the 26B oracle both retrieve a planted fact cleanly at 100k tokens; the 12B is too slow past 16k and e4b caps at its 128k max (effective 64k). See the long-context section.
  • Two real bugs were caught by the oracle and fixed mid-run (see Findings): a stale bridge write-command, and the full-MCP context-window requirement. A third (the composite oracle itself) was caught when its deterministic run failed.

Coding (bench-general) — 62 objective cases (python 52, cli 3, tooling 7)

model python cli tooling total time tok/s
gemma-4-26b-a4b-qat (oracle) 52/52 3/3 7/7 62/62 612s 60
gemma-4-12b-qat 52/52 3/3 7/7 62/62 936s 35
gemma-4-e4b 48/49 3/3 7/7 58/59 346s 52
qwen3.5-4b 44/46 7/7 53/56 471s 85
  • The two qats are perfect; e4b drops one python case; qwen3.5-4b drops a couple of python cases (the LRU-cache class) plus a wc command.
  • All four pass 7/7 tooling — including the four multi-turn agentic chains (look-up-then-email, read-then-count, fetch-rate-then-calc, look-up-then-ticket). So a 4B model (qwen3.5-4b, e4b) handles multi-turn tool chaining fine; size is not the bottleneck there.
  • Cloud Claude (bench-claude-general) scores a perfect 62/62 (python 52/52, cli 3/3, tooling 7/7; ~$1.58, session-default) — matching the local oracle (gemma-4-26b-a4b-qat, also 62/62). So coding is not where cloud's edge shows: the strong models, local or cloud, all ace it; the 4B locals drop a few. Cloud's real advantage is the composite authoring round (below), where the local 4B models fail — not the coding suite. (The tooling suite runs Claude over an in-process SDK mock toolbox, scored by the same full-transcript checker as the local lane.)

mcp-bridge (bench-bridge) — single-tool discovery, read + write

model count schema filter write read tok/s
gemma-4-26b-a4b-qat PASS PASS PASS PASS 15.7s ~21
gemma-4-12b-qat PASS PASS PASS PASS 20.3s ~17
gemma-4-e4b PASS PASS PASS PASS 21.4s ~30
qwen3.5-4b PASS PASS PASS PASS 10.1s ~33
  • All four candidates pass every read and the write (after the write-command fix below). Notably qwen3.5-4b passes the write fastest (10.1s) — earlier in the session it "couldn't find" the command, but that was the stale-command bug, not the model.

Full MCP (bench-mcp) — the whole dhis2-mcp server, read-only-filtered reads, 128k context

model count filter whoami write tools read tok/s
gemma-4-26b-a4b-qat PASS PASS PASS PASS 80.3s 119 ~21
gemma-4-12b-qat PASS PASS PASS PASS 222.1s 119 ~7
gemma-4-e4b PASS PASS PASS PASS 54.1s 119 ~33
qwen3.5-4b PASS PASS PASS PASS 122.8s 119 ~13
  • All four candidates drive the full MCP server end-to-end — selecting the right tool among 119 read tools (and ~311 on the write round) and completing a write. This is the surprising result of the night: full MCP is not exclusively a cloud-frontier capability.
  • It is much slower than the bridge (e.g. oracle MCP write 80s vs bridge 16s) and needs a big load context — so the bridge is still the right default for PII/local, but full MCP is viable.

Composite writes (bench-composite) — the HARD, multi-object authoring test

The single-setting write round above is not a benchmark — it's a smoke test. Everyone passes it, so it ranks nothing. Its job is diagnostic: it confirms the write path works (the agent can call a write tool, the profile has write authority, the round-trip read confirms it landed), and it decomposes a composite failure into "can't write at all" vs "can write but can't orchestrate". (It earned that keep once — when the task hinted a removed command, every model failed including the oracle, which fired the SUSPECT-task flag and caught benchmark drift; see Findings.) Read the single-write PASS as a sanity light, not a result.

The real write test is a multi-object authoring workflow driven through the bridge: "create a Monthly data set with two new data elements attached", and "create an event program with two stages". This is the round that actually ranks models. Each run is verified (find the named object, count its children) and cleaned up. Run 3× per model, because a single run is misleading (see below):

model dataset + elements (3 runs) program + stages (3 runs)
gemma-4-26b-a4b-qat 3/3
gemma-4-12b-qat 3/3
gemma-4-e4b 0/3
qwen3.5-4b 0/3

Three things this says — and a methodology warning:

  • Composite authoring genuinely separates the models (unlike easy writes, where all four pass): the two strong models author reliably-ish; e4b and qwen3.5-4b cannot (0/3 on the dataset).
  • 26b-a4b-qat and 12b-qat are a tie — both ⅔ dataset, 3/3 program. So the 26B is the better pick for authoring too (it's faster + more general); the 12B is the half-the-RAM equivalent. (An earlier single-run version of this table showed the oracle failing and the 12B winning — that was noise; the 3× run corrects it.)
  • The dataset composite is flaky even for the strong models (~⅔). Multi-object authoring is doable but not yet dependable on local models — a real 1.0 caveat: a local PII-authoring agent needs retries + verification (which this harness models).
  • Methodology: single-run composite results are not trustworthy — always run the hard-write dimension N times.

Cloud Claude vs local (bench-claude-mcp / bench-claude-bridge)

Since we have Claude access (most deployments won't), we ran a cloud Claude model (session default = Opus) over both surfaces through the Claude Agent SDK's native loop — reusing the same tasks + scoring, so it's directly comparable to the local rows. Auth is ambient (the logged-in Claude Code subscription); the read round is read-only-gated on play42, write/composite run on local_basic.

surface read write dataset+elements program+stages cost
bridge (dhis2_cli) 3/3 PASS 1/1 1/1 $1.81
full mcp (typed tools) 3/3 PASS 1/1 1/1 $1.56

The column that matters here is composite — reads and the single write are the easy floor (every serious agent passes them, cloud or local). Cloud Claude cleared both composite authoring scenarios first try (RUNS=1), which is the round where local 4B models score 0/3 and the strong local qats only manage ~⅔. So the headline isn't "100%", it's "Claude reliably authors multi-object metadata where local models are flaky". (The local composite bench runs N times precisely because local models are flaky there.)

  • The composite (hard authoring) round is where this matters, and cloud Claude clears it cleanly. Over the bridge, Claude built the Monthly data set with two new data elements and the event program with two stages — both verified 2/2 children — first try. That is exactly the round where the local 4B models score 0/3 and even the strong local qats only manage ~⅔ (flaky). So the local/cloud gap is real and lands precisely on multi-object authoring, not on reads or easy writes (where local models already pass).
  • Cost is the trade: ~$1.81 for the whole bridge suite (read+write+2 composites) of subscription budget — versus free-but-flaky local. This is the quantified case for the data-sensitivity split: aggregate/non-PII -> cloud (reliable authoring); PII -> local + bridge (free, private, needs retry+verify on composites).
  • Methodology mirrors the local composite bench: each scenario is verified (find the named object, count its children) and cleaned up; the write round restores the baseline. Repeat the flaky round with RUNS=3.

Router lane (bench-router) — local models over search+dispatch at tiny context

The router (dhis2w-mcp-router) fronts the full 311-tool surface behind two meta-tools (search_tools + call_tool), so a local model can drive it loaded at 16k context — where the full-MCP payload (~49k tokens) won't even fit. Read suite, play42, router read-only:

model context count filter whoami passed
gemma-4-26b-a4b-qat 16K PASS PASS PASS 3/3
gemma-4-e4b 16K FAIL FAIL FAIL 0/3
qwen3.5-4b 16K PASS PASS PASS 3/3
  • The router works — two small models drive the full surface at 16k, which they cannot do over full-MCP (the payload overflows that context). gemma-4-26b-a4b-qat and the 4B qwen3.5-4b both go 3/3; the win is real and matches the cloud-ToolSearch behaviour, portably.
  • But the indirection has a capability floor. gemma-4-e4b scores 0/3, one turn each — it answers without ever calling search_tools. It can drive direct typed tools (it passes the full-MCP read round), but the two-step search→dispatch indirection is a step too far for the weakest model. So the router is not a free win for every local model — it asks the model to grasp "search, then call," and the floor sits above e4b.
  • Implication for "router as the universal default" (see roadmap): it's the right default for capable small models and growing surfaces, but the weakest models still want direct tools. Feeding a failed call_tool back as model feedback matters — gemma recovered a bad metadata_list call (filter: 8 turns) instead of dead-ending, which is why the harness surfaces tool errors rather than raising.

Context-window dimension (full MCP, gemma-4-e4b)

The full-MCP payload is the gate, so loaded context decides what works (read tools ~16k tokens, the write round's full toolset ~49k):

loaded context reads (119 tools) write (311 tools)
8k (LM Studio default) fail (HTTP 400 — payload > context) fail
32k pass fail (loops; 49k payload doesn't fit)
64k pass* pass
128k pass pass

So full-MCP reads need roughly ≥32k context and writes need ≥64k; 128k is comfortably safe (the default BENCH_CONTEXT). (*the 64k read FAILs in one run were small-model answer-variance, not a context effect — they pass at 32k and 128k.) Vary BENCH_CONTEXT to test other models/levels.

Long-context retrieval (bench-longcontext) — effective context

Needle-in-a-haystack: plant one fact in a filler log at increasing lengths, ask for it, score exact retrieval. Each model loads at min(BENCH_CONTEXT, its own max), 256k target. This measures the effective context — what a model can actually use, vs its advertised max.

model 2k 16k 64k 100k effective
gemma-4-26b-a4b-qat (oracle) PASS PASS PASS PASS* 100k
gemma-4-12b-qat PASS PASS timeout timeout 16k
gemma-4-e4b PASS PASS PASS n/a (128k cap) 64k
qwen3.5-4b PASS PASS PASS PASS 100k
  • qwen3.5-4b and the 26B oracle both retrieve cleanly at 100k — the two long-context leaders, and qwen (a 4B) does it fastest. Long-context retrieval is not size-bound.
  • e4b genuinely caps at 128k (its model max), so 100k can't fit alongside the answer — reported as a skip, not a failure. Its real ceiling here is 64k.
  • gemma-4-12b-qat is too slow past 16k — 64k/100k hit the 600s timeout (not a retrieval miss, a latency wall). Effective 16k in practice on this hardware.
  • *Methodology — a transient cold-allocation 400. In the batched sweep the 26B's 100k first returned an HTTP 400 (not a context-overflow message); an isolated retry retrieved correctly in ~15s (warm KV cache). So a large cold prompt can 400 on first allocation then succeed — bench-longcontext now retries once on a fast non-200 (but never on a timeout, which means the model is simply slow). The 26B's true effective context is 100k, not the 64k the raw sweep first showed.

Token-budget dimension (coding at BENCH_MAX_TOKENS=2048)

Tightening the generation budget from 16384 to 2048 inverts the ranking:

model full (16384) tight (2048) behaviour
gemma-4-26b-a4b-qat 62/62 47/51 collapses
gemma-4-12b-qat 62/62 42/48 collapses
gemma-4-e4b 58/59 54/56 holds
qwen3.5-4b 53/56 52/56 holds (140s — fastest)

The reasoning models (oracle, 12b-qat) spend the budget on chain-of-thought and truncate the actual code on the hard tasks (lru_cache, rpn_eval, word_break, edit_distance), so under token pressure the small models wine4b and qwen3.5-4b barely move, and qwen is ~3x faster (140s vs 471s). Practical rule: pick the model for the budget — generous budget favours the qats (perfect but slow); tight budget / low latency favours e4b or qwen.

Note on the oracle: the SUSPECT banner did fire at 2048 (the oracle failed 4 tasks). That is a true signal here, not a task bug — the oracle's pass-everything assumption only holds at a generous budget; deliberately handicapping the oracle is expected to break it.

Findings (what the night taught us)

  1. The oracle caught a real test bug. The bridge write task hinted dev customize set <key> <value> — a command the CLI removed (it moved to the discoverable system settings set). Every model parroted the dead command and looped to the step limit, and the oracle failed too → SUSPECT banner. Fixed the task; the oracle now passes the write in 3 calls. This is exactly what the oracle is for: catching benchmark drift against an evolving CLI.
  2. Context window is the gate for full MCP, not raw capability. The tool payload is ~49k tokens (311 tools); LM Studio's default 8192 load context rejects it outright (HTTP 400). Loading at 128k (BENCH_CONTEXT) makes full MCP work for every candidate. So "can this model do full MCP" is really "is it loaded with enough context" — a config decision, now a test dimension.
  3. Multi-turn tool chaining is not size-bound. Both 4B models pass all four multi-turn tool scenarios. The thing that actually separates models on the bridge/MCP is discovery + speed, not tool mechanics.

Recommendations

Use case Pick Why
PII authoring (multi-object writes — the real local job) gemma-4-26b-a4b-qat (or 12b-qat) The two strong models tie at ⅔ dataset, 3/3 program over 3 runs; the 26B is faster + more general, the 12B is the half-the-RAM equal. e4b/qwen can't author (0/3).
PII reads + easy writes, fast gemma-4-e4b Passes reads + the easy/MCP writes, smallest (6.9 GB), fastest — but not for composite authoring.
Full MCP, if going local gemma-4-26b-a4b-qat Passes reads + write at 128k and is ~2.8× faster than the 12B; e4b if RAM is tight.
Coding oracle or e4b 62/62 and 58/59; qwen3.5-4b for speed with weaker class/edge-case coding.

Prune decision (2026-06-17)

  • Keep gemma-4-26b-a4b-qat, gemma-4-12b-qat, gemma-4-e4b, qwen3.5-4b — each wins an axis: the qats author (and the 26B is the faster all-rounder), e4b is efficiency/reads, qwen3.5-4b is the fastest tool driver.
  • The 12B is not redundant despite being slower: it ties the 26B on composite authoring at half the RAM. (An earlier single-run table over-claimed the 12B as the unique author and the oracle as failing — the 3× confidence run corrected both to a tie.)