General-capability benchmark (axis 1)¶
Living benchmark for general model capability — independent of DHIS2. This is axis 1 of model
validation; axis 2 (driving the bridge) lives in model-benchmark.md. Re-run
with make bench-general (harness: packages/dhis2w-bench/src/dhis2w_bench/general.py).
Choosing models¶
There is no hardcoded roster — you name the model(s) to benchmark. List what's installed with
make bench-list, then pass one or more keys via MODELS=:
- one model -> a single-model run
- several -> a side-by-side comparison table (they run one at a time through the same suites)
Optionally set BENCH_ORACLE=<key> to mark one of them as the oracle (see below). The harness skips
and logs any named model that isn't installed.
Method¶
Three suites, all scored by execution or structural match — never an AI judge:
- python — the model writes one function/class; the harness extracts the fenced code block,
execs it, and runs hidden test cases. Tasks span easy (Roman numerals, balanced brackets) to harder (LRU cache class, longest-common-subsequence) so models actually separate. - cli — the model writes a single shell command for a goal; the harness runs it in a
curated-PATH temp sandbox and checks the effect (a created file, stdout). Only an allowlist of
read/format tools is reachable (
echo,cat,wc,awk, ... — norm/curl/sudo), and commands with absolute paths /~/..are rejected before running. This bounds — it does not perfectly isolate — model shell; run it on a machine you trust. - tooling — the model is given mock tool specs (
get_weather,send_email,search) and a goal; the harness checks it emits the right tool call with the right args. This is the function-calling foundation the bridge depends on, so it predicts axis-2 competence.
Per-task: pass/fail (cases passed / total), wall-clock, completion tokens. Per-model JSON is appended
to /tmp/bench_general_results.jsonl.
Token budget is a knob. The roster models are reasoning models — they spend a long
chain-of-thought before the answer, so a low generation cap truncates the actual code/command (the
closing code fence never arrives). The cap defaults to a generous 16384 but is configurable via
BENCH_MAX_TOKENS. Lowering it is a deliberate stress test: at a generous budget the gemmas all
pass everything (no separation), so to tell them apart you tighten the budget and see which degrade
gracefully and which break first.
The oracle (opt-in)¶
Set BENCH_ORACLE=<key> to designate one model in the run as the oracle — the should-pass bar.
The harness then asserts that model passed every task and prints a loud SUSPECT TASK(S) banner if
not; an oracle failure almost always means the task is mis-specified, not the model, so fix the
task before trusting the other columns. With no BENCH_ORACLE set there is no oracle check. Our
strongest local model, google/gemma-4-26b-a4b-qat, is the natural choice.
Latest results — 2026-06-17 (extended suite: 62 cases)¶
| model | python | cli | tooling | total | time | tok/s |
|---|---|---|---|---|---|---|
google/gemma-4-26b-a4b-qat (oracle) |
52/52 | 3/3 | 7/7 | 62/62 | 612s | 60 |
google/gemma-4-12b-qat |
52/52 | 3/3 | 7/7 | 62/62 | 936s | 35 |
google/gemma-4-e4b |
48/49 | 3/3 | 7/7 | 58/59 | 346s | 52 |
qwen/qwen3.5-4b |
44/46 | ⅔ | 7/7 | 53/56 | 471s | 85 |
Oracle clean. The two qats are perfect; e4b near-perfect and fast; qwen3.5-4b fast but drops the
LRU-cache class and a wc command. Both 4B models pass all four multi-turn agentic tool chains, so
chaining is not size-bound. Full write-up across all three benchmarks:
benchmark-results.md.
Re-running¶
make bench-list # what's installed
make bench-general MODELS="google/gemma-4-12b-qat" # one model
make bench-general MODELS="gemma-4-12b-qat gemma-4-e4b" # compare several
BENCH_MAX_TOKENS=2048 make bench-general MODELS="..." # tighter token budget
BENCH_ORACLE=google/gemma-4-26b-a4b-qat make bench-general MODELS="..." # with an oracle
make bench-validate MODEL=google/gemma-4-12b-qat # this axis + the bridge axis together
Needs a running backend (LM Studio by default; set MODEL_BACKEND to switch — see
packages/dhis2w-bench/src/dhis2w_bench/backend.py). The harness loads/unloads each model itself, one at a time.