Health Checks and Monitoring¶
Chapkit provides comprehensive health check capabilities for service monitoring, including one-time health checks and continuous streaming for real-time monitoring.
Quick Start¶
Enable health checks in your service:
from chapkit.api import ServiceBuilder, ServiceInfo
app = (
ServiceBuilder(info=ServiceInfo(display_name="My Service"))
.with_health() # Enables /health endpoint
.build()
)
Your service now exposes health endpoints at /health
and /health/$stream
.
Endpoints¶
One-Time Health Check¶
Endpoint: GET /health
Returns current health status in a single response:
Response:
Use Cases: - Kubernetes liveness/readiness probes - Load balancer health checks - Manual health verification - CI/CD deployment validation
Continuous Health Monitoring (SSE)¶
Endpoint: GET /health/$stream
Streams health status updates continuously using Server-Sent Events (SSE):
# Stream with default 1.0s interval
curl -N http://localhost:8000/health/\$stream
# Stream with custom 2.0s interval
curl -N "http://localhost:8000/health/\$stream?poll_interval=2.0"
Response Format (text/event-stream):
Query Parameters:
- poll_interval
(float): Seconds between health checks. Default: 1.0
Use Cases: - Real-time dashboard monitoring - Continuous integration tests - Service health visualization - Alert detection systems - Development/debugging
Note: Stream continues indefinitely until client disconnects. Use Ctrl+C to stop.
Custom Health Checks¶
Add custom health checks to monitor specific subsystems:
from chapkit.api import ServiceBuilder, ServiceInfo
from chapkit.core.api.routers.health import HealthState
async def check_database() -> tuple[HealthState, str | None]:
"""Check database connectivity."""
try:
# Test database connection
async with get_session() as session:
await session.execute("SELECT 1")
return (HealthState.HEALTHY, None)
except Exception as e:
return (HealthState.UNHEALTHY, f"Database error: {str(e)}")
async def check_redis() -> tuple[HealthState, str | None]:
"""Check Redis connectivity."""
try:
# Test Redis connection
await redis_client.ping()
return (HealthState.HEALTHY, None)
except Exception as e:
return (HealthState.DEGRADED, f"Redis unavailable: {str(e)}")
app = (
ServiceBuilder(info=ServiceInfo(display_name="My Service"))
.with_health(checks={
"database": check_database,
"redis": check_redis,
})
.build()
)
Response with Custom Checks:
{
"status": "degraded",
"checks": {
"database": {
"state": "healthy"
},
"redis": {
"state": "degraded",
"message": "Redis unavailable: Connection refused"
}
}
}
Health States¶
healthy
: All checks passed, service fully operationaldegraded
: Some non-critical checks failed, service partially operationalunhealthy
: Critical checks failed, service not operational
Aggregation Logic:
- Overall status = worst state among all checks
- unhealthy
> degraded
> healthy
- Exception in check = unhealthy
with error message
Kubernetes Integration¶
Liveness and Readiness Probes¶
Use health checks for Kubernetes pod lifecycle management:
deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: chapkit-service
spec:
replicas: 3
template:
spec:
containers:
- name: app
image: your-chapkit-app
ports:
- containerPort: 8000
# Liveness probe - restart if unhealthy
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Readiness probe - remove from service if not ready
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
Best Practices:
- Liveness: Checks if app is stuck/deadlocked (longer intervals, higher threshold)
- Readiness: Checks if app can serve traffic (shorter intervals, lower threshold)
- Use /health
for both probes (not /health/$stream
)
- Set appropriate timeouts (3-5 seconds recommended)
Service Mesh Integration¶
For service meshes like Istio or Linkerd, health checks are used for traffic routing:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: chapkit-service
spec:
host: chapkit-service
trafficPolicy:
outlierDetection:
consecutiveErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
minHealthPercent: 40
connectionPool:
http:
http1MaxPendingRequests: 100
http2MaxRequests: 100
Python Client Examples¶
One-Time Health Check¶
import httpx
async def check_service_health(base_url: str) -> bool:
"""Check if service is healthy."""
async with httpx.AsyncClient() as client:
response = await client.get(f"{base_url}/health")
data = response.json()
return data["status"] == "healthy"
Continuous Monitoring with SSE¶
import httpx
import json
async def monitor_service_health(base_url: str, poll_interval: float = 1.0):
"""Monitor service health via SSE stream."""
url = f"{base_url}/health/$stream?poll_interval={poll_interval}"
async with httpx.AsyncClient() as client:
async with client.stream("GET", url) as response:
async for line in response.aiter_lines():
if line.startswith("data: "):
data = json.loads(line[6:])
status = data["status"]
print(f"Health: {status}")
if status != "healthy":
# Alert or take action
await send_alert(f"Service unhealthy: {data}")
Health Check with Timeout¶
import httpx
import asyncio
async def health_check_with_timeout(base_url: str, timeout: float = 3.0) -> str:
"""Check health with timeout."""
try:
async with httpx.AsyncClient(timeout=timeout) as client:
response = await client.get(f"{base_url}/health")
response.raise_for_status()
data = response.json()
return data["status"]
except httpx.TimeoutException:
return "timeout"
except Exception as e:
return f"error: {str(e)}"
Load Balancer Integration¶
HAProxy Configuration¶
backend chapkit_servers
balance roundrobin
option httpchk GET /health
http-check expect status 200
http-check expect string healthy
server app1 10.0.1.10:8000 check inter 5s rise 2 fall 3
server app2 10.0.1.11:8000 check inter 5s rise 2 fall 3
server app3 10.0.1.12:8000 check inter 5s rise 2 fall 3
NGINX Configuration¶
upstream chapkit_backend {
server 10.0.1.10:8000 max_fails=3 fail_timeout=30s;
server 10.0.1.11:8000 max_fails=3 fail_timeout=30s;
server 10.0.1.12:8000 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
location /health {
proxy_pass http://chapkit_backend;
proxy_connect_timeout 2s;
proxy_read_timeout 5s;
}
location / {
proxy_pass http://chapkit_backend;
# Health check performed separately
}
}
Monitoring Dashboards¶
Grafana Dashboard with SSE¶
Create a custom panel using SSE streaming:
// Grafana panel plugin for SSE health monitoring
const eventSource = new EventSource('http://localhost:8000/health/$stream');
eventSource.onmessage = (event) => {
const data = JSON.parse(event.data);
updateHealthStatus(data.status);
if (data.checks) {
updateChecksTable(data.checks);
}
};
Simple HTML Dashboard¶
<!DOCTYPE html>
<html>
<head>
<title>Service Health Monitor</title>
</head>
<body>
<h1>Service Health</h1>
<div id="status">Connecting...</div>
<pre id="checks"></pre>
<script>
const eventSource = new EventSource('http://localhost:8000/health/$stream?poll_interval=2.0');
eventSource.onmessage = (event) => {
const data = JSON.parse(event.data);
const statusEl = document.getElementById('status');
const checksEl = document.getElementById('checks');
// Update status with color coding
statusEl.textContent = `Status: ${data.status}`;
statusEl.style.color = data.status === 'healthy' ? 'green' :
data.status === 'degraded' ? 'orange' : 'red';
// Show detailed checks
if (data.checks) {
checksEl.textContent = JSON.stringify(data.checks, null, 2);
}
};
eventSource.onerror = () => {
document.getElementById('status').textContent = 'Connection lost';
document.getElementById('status').style.color = 'red';
};
</script>
</body>
</html>
Best Practices¶
Recommended Practices¶
- Enable in all services: Health checks are essential for production reliability
- Use custom checks: Monitor critical dependencies (database, cache, external APIs)
- Keep checks fast: Health checks should complete in <1 second
- Avoid expensive operations: Don't run migrations, heavy queries, or external calls
- Use appropriate states:
degraded
for non-critical,unhealthy
for critical failures - SSE for dashboards: Use
/health/$stream
for real-time monitoring UIs - One-time for probes: Use
/health
for Kubernetes and load balancers - Unauthenticated: Keep health endpoints public for infrastructure access
Avoid¶
- Expensive checks: Heavy database queries, full table scans, complex computations
- External dependencies in liveness: Don't make liveness depend on external services
- High frequency polling: Don't poll
/health
more than once per second - Authenticated health: Health endpoints should be unauthenticated for infrastructure
- Incomplete aggregation: Always include all critical subsystems in checks
Combining with Other Features¶
With Authentication¶
Health endpoints should remain unauthenticated:
app = (
ServiceBuilder(info=info)
.with_health()
.with_auth(
unauthenticated_paths=[
"/health", # Health check
"/health/$stream" # Health monitoring
]
)
.build()
)
With Monitoring¶
Combine health checks with Prometheus metrics:
app = (
ServiceBuilder(info=info)
.with_health(checks={"database": check_database})
.with_monitoring() # Prometheus metrics at /metrics
.build()
)
Operational Endpoints:
- /health
- Kubernetes liveness/readiness (one-time)
- /health/$stream
- Real-time monitoring (continuous)
- /metrics
- Prometheus metrics (scraping)
All operational endpoints use root-level paths for easy discovery.
Troubleshooting¶
Health Check Times Out¶
Problem: Health check takes too long or times out.
Solution: 1. Review custom health check functions 2. Ensure checks complete in <1 second 3. Remove expensive operations (heavy queries, external calls) 4. Use connection pooling for database checks
Health Returns Unhealthy¶
Problem: Service reports unhealthy but seems functional.
Solution: 1. Check logs for health check errors 2. Review custom check implementations 3. Test dependencies manually (database, cache, APIs) 4. Verify network connectivity to dependencies
SSE Stream Disconnects¶
Problem: /health/$stream
disconnects frequently.
Solution: 1. Check nginx/proxy timeout settings 2. Increase client timeout 3. Verify network stability 4. Check for reverse proxy buffering (should be disabled)
Kubernetes Pod Restarts¶
Problem: Pods restart frequently due to liveness probe failures.
Solution:
1. Increase initialDelaySeconds
for slower startup
2. Increase failureThreshold
to allow temporary failures
3. Increase timeoutSeconds
for slower responses
4. Review health check performance
Examples¶
examples/monitoring_api.py
- Service with health checks and monitoringexamples/docs/monitoring_api.postman_collection.json
- Postman collection with health endpoints
Next Steps¶
- Metrics: Add Prometheus monitoring with
.with_monitoring()
- Alerting: Set up alerts based on health status
- Dashboards: Create real-time monitoring dashboards with SSE
- Custom Checks: Implement checks for your specific dependencies
For related features, see: - Monitoring Guide - Prometheus metrics and OpenTelemetry - Job Scheduler Guide - Background job health monitoring