Observability & Monitoring

← Back to System Design 101 | ← Previous: Distributed Systems

Introduction

"You can't fix what you can't see." Observability is the practice of instrumenting systems to understand their internal state from external outputs. This article covers monitoring patterns I use to keep production systems healthy.

The Three Pillars

1. Metrics (Prometheus)

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint']
)

active_users = Gauge(
    'active_users_total',
    'Number of active users'
)

# Instrument code
@app.get("/api/users/{user_id}")
async def get_user(user_id: str):
    start_time = time.time()
    
    try:
        user = db.users.find_one({"id": user_id})
        
        # Record metrics
        request_count.labels(
            method="GET",
            endpoint="/api/users",
            status=200
        ).inc()
        
        return user
    
    except Exception as e:
        request_count.labels(
            method="GET",
            endpoint="/api/users",
            status=500
        ).inc()
        raise
    
    finally:
        duration = time.time() - start_time
        request_duration.labels(
            method="GET",
            endpoint="/api/users"
        ).observe(duration)

# Start metrics server
start_http_server(9090)

Prometheus query examples:

2. Logging (ELK Stack)

3. Distributed Tracing (Jaeger)

SLOs, SLIs, and SLAs

Service Level Indicators (SLIs)

Service Level Objectives (SLOs)

Alerting

Dashboards

Health Checks

Lessons Learned

What worked:

  1. Structured logging from day one

  2. Distributed tracing for microservices

  3. SLO-based alerting (not threshold-based)

  4. Comprehensive dashboards

  5. Regular review of metrics and alerts

What didn't work:

  1. Too many alerts (alert fatigue)

  2. Logging everything (log bloat)

  3. No log aggregation

  4. Missing distributed tracing

  5. Not documenting runbooks

What's Next

With observability in place, let's explore security best practices:


Navigation:

Last updated