AI agent testing¶

This toolkit is built to be driven by AI agents — the CLI, the MCP server, and the single-tool bridge all exist so a model can operate a DHIS2 instance. This page is how we verify that actually works, which models can do it, and what we learned. The detailed run logs are linked at the bottom.

Two questions¶

1. Does every command work for an agent? — answered deterministically, not by a model:

A test renders --help for all ~363 leaf commands (exit 0). No broken or unregistered command. This is the structural 100% baseline (packages/dhis2w-cli/tests/test_cli_surface.py::test_every_command_renders_help).
A capable agent is the oracle: Claude Code / Codex should form every command correctly. Any command a capable agent can't drive is a real CLI defect (bad help, undiscoverable) — not a model limitation. Composite write workflows (make bench-composite) are proven oracle-first.

2. Which local models can drive it, and how well? — measured as a gradient (small local models will never be 100%). This is the privacy use case: a small model on-box, against data that can't leave the machine, driving the bridge.

The harnesses¶

All runnable from the Makefile; the model roster lives in packages/dhis2w-bench/src/dhis2w_bench/bridge.py (ROSTER). Reads run against play42 (read-only); writes against local_basic (self-cleaning).

Command	What it measures
`make bench-list`	List the installed models (with context window + tool-use flag) available to benchmark.
`make bench-bridge`	The roster over the single-tool bridge — read + write + performance, the primary capability benchmark.
`make bench-matrix`	A command × model grid: does each model find and form each CLI command.
`make bench-composite`	Multi-object write workflows (data set + elements, program + stages), oracle reference.
`make bench-round`	Drive one model through a read / write / benchmark round interactively.
`make bench-general`	Coding axis (python + cli + multi-turn tooling, no DHIS2) — predicts tool competence.
`make bench-mcp`	The full dhis2-mcp server (~311 typed tools): read + write at a configurable load context.
`make bench-longcontext`	Effective context (needle-in-a-haystack): how many tokens a model can actually use.
`make bench-validate`	One model across both the coding and bridge axes in a single run.
`make bench-claude-general`	Cloud Claude on the coding suite (python + cli + tooling) — the cloud peer of `bench-general`.
`make bench-claude-mcp`	Cloud Claude over the full dhis2-mcp server via the Agent SDK (read + write + composite).
`make bench-claude-bridge`	Cloud Claude over the single-tool bridge via the Agent SDK (read + write + composite).

The cloud bench-claude-* lanes use ambient Claude Code subscription auth (no API key) and exist so local-vs-cloud is directly comparable on the same tasks. See docs/notes/benchmark-plan.md and docs/notes/benchmark-results.md for the full methodology and latest numbers.

Headline findings¶

Best local driver: google/gemma-4-26b-a4b-qat. On read + write + performance it passes everything and is fast — the MoE (26B / 4B active) keeps 12b-class speed with more capability. The qat builds beat their bf16 siblings at equal correctness. The qwens are strong, fast read-only drivers — they stall on writes.
Writes are the ceiling. A single hinted setting is easy for everyone. A multi-object write (a data set + 10 data elements, wired together) defeats every local model — the attach step (correlating many just-created UIDs) is the wall, and more turns don't help (it's a coherence limit, not a budget one). The capable-agent oracle does the same write 100%. That gap is the story: trivial for a capable agent, a wall for local models.
The command×model grid is a stress-test, not a leaderboard. On the 1,230-cell bench-matrix, "found the right command" sits at ~10% for everyone, and the best driver (26b-a4b-qat) scored near the bottom — because the metric is "pick the exact command among ~200 siblings from a vague one-line goal", which is interpretation noise, not capability. Judge models with bench-bridge; read the grid as a discoverability stress-test of the help surface.

Why this shapes the design¶

The findings drive the bridge design: because a small model can't carry ~304 tool schemas or pick among hundreds of tools, the bridge gives it one tool and a self-describing CLI to discover progressively — which is only as good as the help/errors, hence the read-surface hardening (did-you-mean, metadata type list, d2w schema <type>, --fields warnings). See MCP servers — which one?.

Detailed logs¶

The full run records (working notes — raw data, per-round findings):

Model benchmark — the roster, the read/write/perf table, the rankings.
Small-model bridge notes — the design log: read-surface hardening and Rounds 1-6 (incl. the multi-object write ceiling).
CLI command × model matrix — the 1,230-cell discovery grid + the not-a-ranking caveat.
Bridge verification — the earlier capable-agent + qwen benchmark (tool-call parse reliability).