Troubleshooting¶
What we found chasing macOS hard-freezes during multi-Claude-Code workloads. The repo's own code is clean — every loop and async fan-out is bounded. The freezes came from elsewhere; the rest of this doc is what to check.
What actually caused the macOS freezes¶
Three signals taken together:
- No kernel panic file in
/Library/Logs/DiagnosticReports/. The system hung; the kernel didn't formally panic. That rules out a driver crash (Docker's vmnet kext, etc.) — those leave a.panicfile. - 9 Python 3.13 SIGSEGV crashes in user crash logs over 4 days, all
EXC_BAD_ACCESS / KERN_INVALID_ADDRESS. That's a native C extension memory violation (pydantic-core / cryptography / lxml / Pillow / httpcore-rs are the usual suspects). Crashes alone don't freeze the OS, but a constant trickle of native-extension corruption alongside heavy memory pressure means processes die in unhelpful ways. - One docker container with no memory limit ate everything during an INLA-style workload. A live
docker statssnapshot showedchap-core-worker-1running unbounded (Docker reports the host RAM ceiling, 15.6 GiB, when nomem_limitis set in compose). Idle, the worker is ~1.5 GiB. During an INLA evaluation it spikes to 4-8 GiB. With DHIS2 (~3 GiB) + Postgres (~4 GiB) + idle chap workload + the host running 4-5 Claude sessions + Chrome, the host went into severe memory pressure. macOS could not page out fast enough; user input stopped being scheduled; only a hard restart recovered.
So: memory pressure from one unbounded container during a workload spike, not a fork bomb, not a loop, not a kext panic.
What recurring native segfaults mean¶
python3.13 SIGSEGVs in ~/Library/Logs/DiagnosticReports/python3.13-*.ips are coming from a C extension. Look at the most recent one:
latest=$(ls -t ~/Library/Logs/DiagnosticReports/python3.13-*.ips 2>/dev/null | head -1)
head -60 "$latest"
The interesting fields:
parentProc— what spawned the process (ghostty,pytest,pycharm, …). Tells you which workflow is the trigger.responsibleProc— the highest-level process responsible. Often a terminal.exception.subtype—KERN_INVALID_ADDRESS at 0x...says native pointer corruption.- The thread backtrace below the header — first non-Python frame names the offending C library.
Common culprits:
| Library | Smell | Mitigation |
|---|---|---|
pydantic-core (Rust) |
Crashes during model validation under heavy concurrency | Pin to latest; older releases had real bugs |
cryptography (Rust+OpenSSL) |
Crashes during TLS handshake | uv lock --upgrade-package cryptography && uv sync |
lxml (libxml2) |
Crashes during XML parsing | Rare for our workload, but possible if an example hits a tracker XML import |
Pillow (libpng/libjpeg) |
Crashes during image decode | dhis2w-browser screenshot capture is the only path that does this |
httpcore-rs / h11 |
Crashes during connection close | Switching to httpx 0.28+ helps |
Once-or-twice in a week is unusual but tolerable. Daily, with a stable repro, file a bug at the offending library. Until you have a fix, run Python with faulthandler on so you get a Python-level traceback at SIGSEGV time:
PYTHONFAULTHANDLER=1 uv run pytest ...
# or in code: `import faulthandler; faulthandler.enable()` at startup
Resource budgets that survive contention¶
For a typical Apple Silicon dev machine (32-64 GB RAM) running 4-5 Claude Code sessions + browser + IDE:
| Budget | Policy |
|---|---|
| Docker Desktop memory cap | 6-8 GB unless you specifically need the full DHIS2 e2e seed. The default 16 GB is too generous given everything else you run. Set in Settings → Resources → Memory. |
Per-container mem_limit |
Always set one in compose.yml for any container that runs computation. Without it, Docker reports the VM ceiling as the limit, which means the container can eat all of Docker's memory. The DHIS2 container in this repo's infra/compose.yml is capped at 5g already; if you wire up new compose stacks (chap-core, chap-scheduler, anything similar), audit each services: block for a deploy.resources.limits.memory or mem_limit key. |
| Multiple compose stacks running simultaneously | Avoid. Docker's VM doesn't shrink when containers stop; running two stacks in a day means the VM stays at the high-water mark even after docker compose down. |
uv tool install over uvx |
Each uvx <pkg> spawns a fresh resident interpreter per shell. 4 sessions × ~220 MB = ~900 MB just for one tool. uv tool install keeps a single resident copy across sessions. |
The "Docker doesn't release RAM" gotcha¶
docker compose down releases the container, NOT the RAM the Linux VM holds. After a day of make dhis2-run cycles plus other stacks, the VM sits at 20+ GB resident even with no containers running. Force release:
Diagnostic commands when the system is misbehaving¶
# 1. Live container memory — the single most useful command. Watch for any `MEM USAGE` near the host ceiling.
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.CPUPerc}}"
# 2. Macos memory pressure right now
memory_pressure | tail -10
# 3. Top RAM users
ps aux | sort -k6 -rn | head -10 | awk '{printf "%6.1f MB %s\n", $6/1024, substr($0, index($0,$11), 80)}'
# 4. Kernel panic log (will be empty if you haven't had a real panic)
ls -lah /Library/Logs/DiagnosticReports/ | grep -i panic
# 5. Recent Python user-process crashes
ls -t ~/Library/Logs/DiagnosticReports/python3.13-*.ips 2>/dev/null | head -5
# 6. Disk free on the volume holding the repo
df -h /Users/$USER
fseventsd / Spotlight protection¶
fseventsd watches every directory by default. Codegen writing 1000 files in a burst, plus Spotlight indexing each one, plus iCloud sync, can saturate the kernel's event pipeline. Exclude noisy paths — drop a hidden marker file fseventsd respects:
touch /Users/$USER/dev/dhis2-utils/.venv/.metadata_never_index 2>/dev/null
touch /Users/$USER/dev/dhis2-utils/site/.metadata_never_index 2>/dev/null
touch /Users/$USER/dev/dhis2-utils/dist/.metadata_never_index 2>/dev/null
mkdir -p /Users/$USER/dev/dhis2-utils/.mypy_cache && touch /Users/$USER/dev/dhis2-utils/.mypy_cache/.metadata_never_index
mkdir -p /Users/$USER/dev/dhis2-utils/.pytest_cache && touch /Users/$USER/dev/dhis2-utils/.pytest_cache/.metadata_never_index
Or System Settings → Spotlight → Spotlight Privacy → drag in ~/dev/dhis2-utils/.venv, ~/dev/dhis2-utils/site, ~/dev/dhis2-utils/dist, ~/dev/dhis2-utils/.mypy_cache, ~/dev/dhis2-utils/.pytest_cache.
Verify the repo isn't on iCloud Drive / Dropbox / OneDrive. A synced location amplifies every codegen / uv sync burst.
Recovery without a hard restart¶
# 1. Stop every docker compose stack on this machine
docker stop $(docker ps -q) 2>/dev/null
# 2. Force Docker Desktop to release its VM RAM (this is the biggest single recovery win)
osascript -e 'quit app "Docker"'
# 3. Clean up any leftover MCP / uvx / playwright processes
pkill -f "dhis2w-mcp" 2>/dev/null
pkill -f "uvx" 2>/dev/null
pkill -f "playwright-mcp" 2>/dev/null
# 4. If memory pressure is still high — close unused Claude Code sessions (each ~700-850 MB) + Chrome tabs
Patterns that bite under multiple Claude Code sessions¶
These aren't this-repo-specific, but this repo gives you many ways to trigger them:
| Pattern | Why it spirals | What helps |
|---|---|---|
Containers with no mem_limit |
One bad evaluation eats all RAM. Today's freeze. | Set deploy.resources.limits.memory on every service that runs computation. |
| Docker VM doesn't shrink | After a day of compose cycles the VM holds 20+ GB even with no containers. | Quit + relaunch Docker Desktop occasionally. Trim its RAM budget. |
mkdocs serve reload storm |
One session runs make docs-serve, another edits docs/. mkdocs rebuilds, triggers make docs-cli writing to docs/cli-reference.md, which mkdocs sees, rebuilds again. Self-feeding. |
Don't run mkdocs serve while another agent edits the same paths. Use mkdocs build (one-shot). |
Concurrent make test from two sessions |
Both write to .coverage, both compete for the uv lock at ~/.cache/uv/.lock. Stale lock → next command hangs. |
Run heavy commands from one session at a time. |
uvx <pkg> cold-start per session |
Each Claude shell spawns a fresh ~220 MB interpreter. 4 sessions × multiple tools = quickly into the GBs. | Prefer uv tool install for things you use often. |
| Background tasks left from prior sessions | run_in_background: true tool calls that an agent never explicitly stops. A cancelled session can leak a 7-minute test run. |
Always TaskStop background work when you context-switch. ps aux \| grep <thing> after closing a session. |
make dhis2-run Ctrl-C without make dhis2-down |
Ctrl-C tears down the foreground process but containers stay up. Repeated cycles accumulate orphans + volumes. | make dhis2-down before closing the shell. |
| Codegen runs while mkdocs is watching | Codegen produces ~1000 file events in a few seconds. mkdocs processes each → rebuild → kicks off doc generation → more files. | Stop mkdocs serve before running codegen. |
Loop / fan-out audit for this repo¶
Every while True / while not and every asyncio.gather in dhis2w-* source is bounded. No infinite loops, no unbounded fan-out. Verified at v0.6.0:
while loops (7 total)¶
| Location | Loop | Bound |
|---|---|---|
dhis2w-browser/src/.../oauth2.py (_read_auth_url) |
Reads stderr lines from dhis2 profile login |
EOF guard (if not line_bytes: raise RuntimeError); caller wraps in asyncio.wait_for(timeout=60s) |
dhis2w-client/src/.../tasks.py (iter_notifications) |
Task-completion poll | timeout arg, default 600s, raises TaskTimeoutError |
dhis2w-client/src/.../category_combos.py (wait_for_coc_generation) |
Poll for COC count | deadline = loop.time() + timeout_seconds (default 60s), raises TimeoutError |
dhis2w-client/src/.../data_values.py (_async_file_chunks) |
File chunk reader | EOF guard (if not chunk: return) |
dhis2w-core/src/.../maintenance/service.py (wait_for_task) |
Task-status poll | deadline = loop.time() + timeout (default 600s), raises TimeoutError |
dhis2w-core/src/.../metadata/service.py (stream_list) |
Pagination | Breaks when models == [] or len(models) < page_size (server-driven) |
dhis2w-core/src/.../oauth2_redirect.py (server-started wait) |
Wait for uvicorn server.started |
Caller wraps in asyncio.wait_for(timeout=300s) |
asyncio.gather / parallel fan-out (10 sites)¶
| Location | What it fans out | Bound |
|---|---|---|
dhis2w-client/.../metadata.py:309 (search) |
One request per search field | Bounded by the number of search fields (~5) |
dhis2w-client/.../metadata.py:385 (usage) |
One request per _USAGE_PATTERNS template |
Bounded by the static pattern table |
dhis2w-client/.../metadata.py:514 (patch_bulk) |
One PATCH per (resource, uid, ops) tuple | asyncio.Semaphore(concurrency) caps concurrent in-flight requests |
dhis2w-client/.../metadata.py:593 (apply_sharing_bulk) |
One POST per (resource, uid) tuple | asyncio.Semaphore(concurrency) caps concurrent in-flight requests |
dhis2w-core/.../doctor/service.py:46,50,84 |
Fixed list of metadata + bug probes | Bounded by the static probe lists |
dhis2w-core/.../doctor/probes_metadata.py:606 |
3 read calls (programs, stages, DEs) | Bounded by the literal tuple |
dhis2w-core/.../files/service.py:94 (get_many) |
One request per uid | No semaphore. Mitigation: caller controls list size; would need to be 10k+ uids before it became a real problem |
dhis2w-core/.../metadata/service.py:643 |
2 metadata-export calls | Bounded by the literal pair |
dhis2w-core/.../oauth2_redirect.py:113 |
Single create_task(server.serve()) |
Single task, awaited |
Subprocess spawn (5 sites)¶
All single-spawn-and-wait, no loop:
dhis2w-browser/.../oauth2.py:65—asyncio.create_subprocess_exec("dhis2 profile login"), awaited withwait_for(timeout=60s).dhis2w-codegen/.../emit.py:503,508—subprocess.run(["ruff", ...])after each codegen pass. One-shot.dhis2w-codegen/.../_shared.py:35,40— same pattern, one-shot.
No subprocess.Popen in any loop. No asyncio.create_task in any loop. No multiprocessing or concurrent.futures anywhere in the source.
Heavy-churn patterns this repo can produce¶
Worth knowing what each command's footprint is so you can avoid running them at the same time as something else memory-heavy:
| Action | Footprint | Mitigation |
|---|---|---|
make dhis2-codegen-play |
~1000 generated .py files |
Don't run during mkdocs serve; ensure .venv etc. are excluded from Spotlight |
make dhis2-codegen-all |
~4000 generated files (4 versions) | Run when nothing else is touching the repo |
uv sync after a pyproject change |
Rewrites ~/.venv (thousands of file ops) |
Unavoidable, but doesn't compound if Spotlight-excluded |
make refresh-and-verify |
Heavy block I/O against the Docker VM disk image | The single most likely trigger of Docker driver hangs. Don't run while other compose stacks share the same Docker VM. |
verify_examples.py |
~165 subprocesses in series, each ~1-30s | Bounded by per-script --timeout (default 300s); SKIP_BY_DEFAULT excludes Playwright-driven ones |
MCP server (dhis2w-mcp) |
Long-lived stdio process per host connection, ~220 MB | Constant cost, not a loop. Prefer uv tool install dhis2w-mcp over uvx to share one resident copy across sessions |
When in doubt — the smallest reset that helps¶
# Exit unused Claude Code sessions (each is ~700-850 MB of RAM)
# Quit Chrome tabs you're not actively using
osascript -e 'quit app "Docker"' && open -a Docker
# Activity Monitor → Memory tab. "Memory Pressure" graph should be green.
If memory pressure goes yellow / red and stays there, restart Docker rather than the whole machine — almost always Docker's VM not releasing RAM. If kernel.panic files exist in /Library/Logs/DiagnosticReports/, that's the moment to swap Docker to QEMU mode in Settings.