Monitoring with OpenTelemetry and Prometheus¶
Chapkit provides built-in monitoring through OpenTelemetry instrumentation with automatic Prometheus metrics export.
Quick Start¶
Enable monitoring in your service with a single method call:
from chapkit.api import ServiceBuilder, ServiceInfo
app = (
ServiceBuilder(info=ServiceInfo(display_name="My Service"))
.with_monitoring() # Enables OpenTelemetry + Prometheus endpoint
.with_database()
.with_health()
.build()
)
Your service now exposes Prometheus metrics at /metrics
.
Features¶
Automatic Instrumentation¶
- FastAPI: HTTP request metrics (duration, status codes, paths)
- SQLAlchemy: Database query metrics (connection pool, query duration)
- Python Runtime: Garbage collection, memory usage, CPU time
Metrics Endpoint¶
- Path:
/metrics
(operational endpoint, root level) - Format: Prometheus text format
- Content-Type:
text/plain; version=0.0.4; charset=utf-8
Zero Configuration¶
No manual instrumentation needed - Chapkit automatically:
- Instruments all FastAPI routes
- Tracks SQLAlchemy database operations
- Exposes Python runtime metrics
- Handles OpenTelemetry lifecycle
Configuration¶
Basic Configuration¶
Defaults:
- Metrics endpoint: /metrics
- Service name: From ServiceInfo.display_name
- Tags: ["monitoring"]
Custom Configuration¶
.with_monitoring(
prefix="/custom/metrics", # Custom endpoint path
tags=["Observability", "Telemetry"], # Custom OpenAPI tags
service_name="production-api", # Override service name
)
Parameters¶
- prefix (
str
): Metrics endpoint path. Default:/metrics
- tags (
List[str]
): OpenAPI tags for metrics endpoint. Default:["monitoring"]
- service_name (
str | None
): Service name in metrics labels. Default: fromServiceInfo
Metrics Endpoint¶
Testing the Endpoint¶
# Get metrics
curl http://localhost:8000/metrics
# Filter specific metrics
curl http://localhost:8000/metrics | grep http_request
# Monitor continuously
watch -n 1 'curl -s http://localhost:8000/metrics | grep http_request_duration'
Expected Output¶
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 234.0
# HELP http_server_request_duration_seconds HTTP request duration
# TYPE http_server_request_duration_seconds histogram
http_server_request_duration_seconds_bucket{http_method="GET",http_status_code="200",le="0.005"} 45.0
# HELP db_client_connections_usage Number of connections that are currently in use
# TYPE db_client_connections_usage gauge
db_client_connections_usage{pool_name="default",state="used"} 2.0
Kubernetes Integration¶
Deployment with Service Monitor¶
deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: chapkit-service
spec:
replicas: 3
template:
metadata:
labels:
app: chapkit-service
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
containers:
- name: app
image: your-chapkit-app
ports:
- containerPort: 8000
name: http
service.yaml:
apiVersion: v1
kind: Service
metadata:
name: chapkit-service
labels:
app: chapkit-service
spec:
ports:
- port: 8000
targetPort: 8000
name: http
selector:
app: chapkit-service
servicemonitor.yaml (Prometheus Operator):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: chapkit-service
labels:
app: chapkit-service
spec:
selector:
matchLabels:
app: chapkit-service
endpoints:
- port: http
path: /metrics
interval: 30s
Prometheus Configuration¶
Scrape Configuration¶
Add to prometheus.yml
:
scrape_configs:
- job_name: 'chapkit-services'
scrape_interval: 15s
static_configs:
- targets: ['localhost:8000']
labels:
service: 'chapkit-api'
environment: 'production'
Docker Compose Setup¶
docker-compose.yml:
version: '3.8'
services:
chapkit-app:
build: .
ports:
- "8000:8000"
environment:
- LOG_LEVEL=INFO
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_AUTH_ANONYMOUS_ENABLED=true
volumes:
prometheus-data:
grafana-data:
Grafana Dashboards¶
Adding Prometheus Data Source¶
- Navigate to Configuration → Data Sources
- Click Add data source
- Select Prometheus
- Set URL:
http://prometheus:9090
(Docker) orhttp://localhost:9090
(local) - Click Save & Test
Example Queries¶
HTTP Request Rate:
Request Duration (p95):
Database Connection Pool Usage:
Error Rate:
ML Training Job Rate:
ML Prediction Job Rate:
Total ML Jobs (Train + Predict):
sum(rate(ml_train_jobs_total{job="chapkit-services"}[5m])) +
sum(rate(ml_predict_jobs_total{job="chapkit-services"}[5m]))
Available Metrics¶
HTTP Metrics (FastAPI)¶
http_server_request_duration_seconds
- Request duration histogramhttp_server_requests_total
- Total requests counterhttp_server_active_requests
- Active requests gauge
Labels: http_method
, http_status_code
, http_route
Database Metrics (SQLAlchemy)¶
db_client_connections_usage
- Connection pool usagedb_client_connections_limit
- Connection pool limitdb_client_operation_duration_seconds
- Query duration
Labels: pool_name
, state
, operation
Python Runtime Metrics¶
python_gc_objects_collected_total
- GC collectionspython_gc_collections_total
- GC runspython_info
- Python version infoprocess_cpu_seconds_total
- CPU timeprocess_resident_memory_bytes
- Memory usage
ML Metrics (when using .with_ml()
)¶
ml_train_jobs_total
- Total number of ML training jobs submittedml_predict_jobs_total
- Total number of ML prediction jobs submitted
Labels: service_name
Best Practices¶
Recommended Practices¶
- Enable monitoring in production for observability
- Set meaningful service names to identify services in multi-service setups
- Monitor key metrics: request rate, error rate, duration (RED method)
- Set up alerts for error rates, high latency, and resource exhaustion
- Use service labels to tag metrics with environment, version, region
- Keep
/metrics
unauthenticated for Prometheus access (use network policies)
Avoid¶
- Exposing metrics publicly (use internal network or auth proxy)
- Scraping too frequently (15-30s interval is usually sufficient)
- Ignoring high cardinality (avoid unbounded label values)
- Skipping resource limits (monitor and limit Prometheus storage growth)
Combining with Other Features¶
With Authentication¶
app = (
ServiceBuilder(info=info)
.with_monitoring()
.with_auth(
unauthenticated_paths=[
"/health", # Health check
"/metrics", # Prometheus scraping
"/docs" # API docs
]
)
.build()
)
With Health Checks¶
app = (
ServiceBuilder(info=info)
.with_health() # /health - Health check endpoint
.with_system() # /api/v1/system - System metadata
.with_monitoring() # /metrics - Prometheus metrics
.build()
)
Operational monitoring endpoints (/health
, /health/$stream
, /metrics
) use root-level paths for easy discovery by Kubernetes, monitoring dashboards, and Prometheus. Service metadata endpoints (/api/v1/system
, /api/v1/info
) use versioned API paths.
For detailed health check configuration and usage, see the Health Checks Guide.
Troubleshooting¶
Metrics Endpoint Returns 404¶
Problem: /metrics
endpoint not found.
Solution: Ensure you called .with_monitoring()
in your ServiceBuilder chain.
No Metrics Appear¶
Problem: Endpoint returns empty or minimal metrics.
Solution:
1. Make some requests to your API endpoints
2. Verify FastAPI instrumentation with: curl http://localhost:8000/api/v1/configs
3. Check metrics again: curl http://localhost:8000/metrics | grep http_request
Prometheus Cannot Scrape¶
Problem: Prometheus shows targets as "DOWN".
Solution:
1. Verify service is running: curl http://localhost:8000/health
2. Check network connectivity
3. Verify scrape config matches service port and path
4. Check for firewall/network policies blocking access
High Memory Usage¶
Problem: Prometheus uses too much memory.
Solution:
1. Reduce retention time: --storage.tsdb.retention.time=15d
2. Increase scrape interval: scrape_interval: 30s
3. Limit metric cardinality (check for unbounded labels)
Next Steps¶
- Health Checks: Add health monitoring with
.with_health()
- see Health Checks Guide - Alerting: Set up Prometheus Alertmanager for notifications
- Distributed Tracing: Future support for OpenTelemetry traces (see ROADMAP.md)
- Custom Metrics: Use
get_meter()
for application-specific metrics - SLOs: Define Service Level Objectives based on metrics
Examples¶
examples/monitoring_api.py
- Complete monitoring exampleexamples/docs/monitoring_api.postman_collection.json
- Postman collection
For more details, see: - Health Checks Guide - Health check configuration - OpenTelemetry Documentation - Prometheus Documentation - Grafana Documentation