Observability & Monitoring
Introduction
The Three Pillars
1. Metrics (Prometheus)
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Define metrics
request_count = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)
active_users = Gauge(
'active_users_total',
'Number of active users'
)
# Instrument code
@app.get("/api/users/{user_id}")
async def get_user(user_id: str):
start_time = time.time()
try:
user = db.users.find_one({"id": user_id})
# Record metrics
request_count.labels(
method="GET",
endpoint="/api/users",
status=200
).inc()
return user
except Exception as e:
request_count.labels(
method="GET",
endpoint="/api/users",
status=500
).inc()
raise
finally:
duration = time.time() - start_time
request_duration.labels(
method="GET",
endpoint="/api/users"
).observe(duration)
# Start metrics server
start_http_server(9090)2. Logging (ELK Stack)
3. Distributed Tracing (Jaeger)
SLOs, SLIs, and SLAs
Service Level Indicators (SLIs)
Service Level Objectives (SLOs)
Alerting
Dashboards
Health Checks
Lessons Learned
What's Next
Last updated