Observability

Introduction

In distributed systems, debugging is fundamentally different. A single request might traverse dozens of services, any of which could be the source of a problem. From operating microservices in production, I've learned that without proper observability, you're flying blind.

This article covers the three pillars of observability: distributed tracing, centralized logging, and metrics collection, along with practical implementations using OpenTelemetry.

The Three Pillars

Pillar

Purpose

Example Tools

Traces

Follow request across services

Jaeger, Zipkin

Logs

Detailed event records

ELK Stack, Loki

Metrics

Aggregated measurements

Prometheus, Grafana

OpenTelemetry Setup

Installation and Configuration

# requirements.txt
opentelemetry-api==1.21.0
opentelemetry-sdk==1.21.0
opentelemetry-exporter-otlp==1.21.0
opentelemetry-instrumentation-fastapi==0.42b0
opentelemetry-instrumentation-httpx==0.42b0
opentelemetry-instrumentation-sqlalchemy==0.42b0
opentelemetry-instrumentation-redis==0.42b0

# telemetry.py
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor


def setup_telemetry(
    service_name: str,
    service_version: str = "1.0.0",
    otlp_endpoint: str = "http://localhost:4317",
):
    """Configure OpenTelemetry for the service."""
    # Resource identifying this service
    resource = Resource.create({
        SERVICE_NAME: service_name,
        SERVICE_VERSION: service_version,
        "deployment.environment": os.getenv("ENVIRONMENT", "development"),
    })
    
    # Tracing
    tracer_provider = TracerProvider(resource=resource)
    tracer_provider.add_span_processor(
        BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint))
    )
    trace.set_tracer_provider(tracer_provider)
    
    # Metrics
    metric_reader = PeriodicExportingMetricReader(
        OTLPMetricExporter(endpoint=otlp_endpoint),
        export_interval_millis=60000,
    )
    meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
    metrics.set_meter_provider(meter_provider)
    
    return trace.get_tracer(service_name), metrics.get_meter(service_name)


# Initialize in your FastAPI app
from fastapi import FastAPI

app = FastAPI()
tracer, meter = setup_telemetry("order-service", "1.0.0")

# Auto-instrument FastAPI
FastAPIInstrumentor.instrument_app(app)

# Auto-instrument HTTP clients
HTTPXClientInstrumentor().instrument()

# Auto-instrument database
SQLAlchemyInstrumentor().instrument(engine=engine)

Distributed Tracing

Trace Context Propagation

from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
from functools import wraps


tracer = trace.get_tracer(__name__)
propagator = TraceContextTextMapPropagator()


def traced(name: str = None, attributes: dict = None):
    """Decorator to create spans for functions."""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            span_name = name or func.__name__
            
            with tracer.start_as_current_span(span_name) as span:
                # Add custom attributes
                if attributes:
                    for key, value in attributes.items():
                        span.set_attribute(key, value)
                
                try:
                    result = await func(*args, **kwargs)
                    span.set_status(Status(StatusCode.OK))
                    return result
                except Exception as e:
                    span.set_status(Status(StatusCode.ERROR, str(e)))
                    span.record_exception(e)
                    raise
        
        return wrapper
    return decorator


# Usage
class OrderService:
    @traced("create_order", attributes={"operation": "write"})
    async def create_order(self, order_data: dict) -> Order:
        # Get current span to add dynamic attributes
        span = trace.get_current_span()
        span.set_attribute("order.customer_id", order_data["customer_id"])
        span.set_attribute("order.item_count", len(order_data["items"]))
        
        order = await self._save_order(order_data)
        
        span.set_attribute("order.id", order.id)
        span.set_attribute("order.total", order.total)
        
        # Process payment in separate span
        with tracer.start_as_current_span("process_payment") as payment_span:
            payment_span.set_attribute("payment.amount", order.total)
            payment = await self.payment_client.charge(order)
            payment_span.set_attribute("payment.id", payment.id)
        
        return order


# HTTP client with context propagation
class TracedHTTPClient:
    def __init__(self, base_url: str):
        self.base_url = base_url
        self.client = httpx.AsyncClient()
    
    async def request(self, method: str, path: str, **kwargs) -> httpx.Response:
        headers = kwargs.pop("headers", {})
        
        # Inject trace context into headers
        propagator.inject(headers)
        
        with tracer.start_as_current_span(f"HTTP {method} {path}") as span:
            span.set_attribute("http.method", method)
            span.set_attribute("http.url", f"{self.base_url}{path}")
            
            response = await self.client.request(
                method,
                f"{self.base_url}{path}",
                headers=headers,
                **kwargs,
            )
            
            span.set_attribute("http.status_code", response.status_code)
            
            if response.status_code >= 400:
                span.set_status(Status(StatusCode.ERROR))
            
            return response

Span Events and Annotations

from opentelemetry.trace import SpanKind


class PaymentProcessor:
    async def process(self, order_id: str, amount: float) -> Payment:
        with tracer.start_as_current_span(
            "payment.process",
            kind=SpanKind.CLIENT,
        ) as span:
            span.set_attribute("payment.order_id", order_id)
            span.set_attribute("payment.amount", amount)
            
            # Add event for authorization attempt
            span.add_event("authorization_started", {
                "payment.gateway": "stripe",
            })
            
            try:
                auth_result = await self._authorize(amount)
                
                span.add_event("authorization_completed", {
                    "authorization.id": auth_result.id,
                    "authorization.status": auth_result.status,
                })
                
                if auth_result.status == "declined":
                    span.add_event("payment_declined", {
                        "decline.reason": auth_result.reason,
                    })
                    raise PaymentDeclinedError(auth_result.reason)
                
                # Capture payment
                span.add_event("capture_started")
                capture = await self._capture(auth_result.id)
                
                span.add_event("capture_completed", {
                    "capture.id": capture.id,
                })
                
                return Payment(
                    id=capture.id,
                    order_id=order_id,
                    amount=amount,
                    status="completed",
                )
            
            except Exception as e:
                span.add_event("payment_error", {
                    "error.type": type(e).__name__,
                    "error.message": str(e),
                })
                raise

Centralized Logging

Structured Logging

import logging
import json
from datetime import datetime
from opentelemetry import trace


class StructuredFormatter(logging.Formatter):
    """JSON formatter with trace context."""
    
    def format(self, record: logging.LogRecord) -> str:
        log_data = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": record.levelname,
            "logger": record.name,
            "message": record.getMessage(),
            "service": os.getenv("SERVICE_NAME", "unknown"),
        }
        
        # Add trace context
        span = trace.get_current_span()
        if span.is_recording():
            ctx = span.get_span_context()
            log_data["trace_id"] = format(ctx.trace_id, "032x")
            log_data["span_id"] = format(ctx.span_id, "016x")
        
        # Add extra fields
        if hasattr(record, "extra"):
            log_data.update(record.extra)
        
        # Add exception info
        if record.exc_info:
            log_data["exception"] = self.formatException(record.exc_info)
        
        return json.dumps(log_data)


def setup_logging(service_name: str, level: str = "INFO"):
    """Configure structured logging."""
    handler = logging.StreamHandler()
    handler.setFormatter(StructuredFormatter())
    
    logging.basicConfig(
        level=getattr(logging, level),
        handlers=[handler],
    )
    
    # Reduce noise from libraries
    logging.getLogger("httpx").setLevel(logging.WARNING)
    logging.getLogger("httpcore").setLevel(logging.WARNING)


# Usage
logger = logging.getLogger(__name__)


class OrderService:
    async def create_order(self, order_data: dict) -> Order:
        logger.info(
            "Creating order",
            extra={
                "customer_id": order_data["customer_id"],
                "item_count": len(order_data["items"]),
            },
        )
        
        try:
            order = await self._save_order(order_data)
            
            logger.info(
                "Order created successfully",
                extra={
                    "order_id": order.id,
                    "total": order.total,
                },
            )
            
            return order
        
        except Exception as e:
            logger.error(
                "Failed to create order",
                extra={
                    "customer_id": order_data["customer_id"],
                    "error": str(e),
                },
                exc_info=True,
            )
            raise

Context-Aware Logger

from contextvars import ContextVar
from typing import Any

request_context: ContextVar[dict] = ContextVar("request_context", default={})


class ContextLogger:
    """Logger that automatically includes context."""
    
    def __init__(self, name: str):
        self.logger = logging.getLogger(name)
    
    def _log(self, level: int, message: str, **kwargs):
        # Merge context with explicit kwargs
        extra = {**request_context.get(), **kwargs}
        self.logger.log(level, message, extra={"extra": extra})
    
    def debug(self, message: str, **kwargs):
        self._log(logging.DEBUG, message, **kwargs)
    
    def info(self, message: str, **kwargs):
        self._log(logging.INFO, message, **kwargs)
    
    def warning(self, message: str, **kwargs):
        self._log(logging.WARNING, message, **kwargs)
    
    def error(self, message: str, **kwargs):
        self._log(logging.ERROR, message, **kwargs)


# Middleware to set request context
from fastapi import Request


async def logging_context_middleware(request: Request, call_next):
    context = {
        "request_id": request.headers.get("X-Request-ID", str(uuid.uuid4())),
        "path": request.url.path,
        "method": request.method,
        "user_agent": request.headers.get("User-Agent"),
    }
    
    # Add user info if authenticated
    if hasattr(request.state, "user"):
        context["user_id"] = request.state.user.id
    
    token = request_context.set(context)
    
    try:
        response = await call_next(request)
        return response
    finally:
        request_context.reset(token)


# Usage
logger = ContextLogger(__name__)


async def handle_order(order_id: str):
    # request_id, user_id, etc. automatically included
    logger.info("Processing order", order_id=order_id)

Metrics Collection

Custom Metrics

from opentelemetry import metrics


meter = metrics.get_meter(__name__)


# Counter - for counting events
order_counter = meter.create_counter(
    "orders.created",
    description="Number of orders created",
    unit="1",
)

# Histogram - for measuring distributions
order_latency = meter.create_histogram(
    "orders.latency",
    description="Order processing latency",
    unit="ms",
)

# UpDownCounter - for values that go up and down
active_orders = meter.create_up_down_counter(
    "orders.active",
    description="Number of orders being processed",
    unit="1",
)

# Observable Gauge - for async measurements
def get_queue_size():
    return redis_client.llen("order_queue")

queue_size = meter.create_observable_gauge(
    "orders.queue_size",
    callbacks=[lambda options: [metrics.Observation(get_queue_size())]],
    description="Size of the order queue",
    unit="1",
)


# Usage in code
class OrderService:
    async def create_order(self, order_data: dict) -> Order:
        start_time = time.time()
        active_orders.add(1, {"service": "order"})
        
        try:
            order = await self._save_order(order_data)
            
            # Record success
            order_counter.add(1, {
                "status": "success",
                "payment_method": order_data.get("payment_method", "unknown"),
            })
            
            return order
        
        except Exception as e:
            order_counter.add(1, {"status": "error", "error_type": type(e).__name__})
            raise
        
        finally:
            active_orders.add(-1, {"service": "order"})
            duration_ms = (time.time() - start_time) * 1000
            order_latency.record(duration_ms, {"operation": "create"})

RED Metrics (Rate, Errors, Duration)

from dataclasses import dataclass
from typing import Callable
import time


class REDMetrics:
    """Rate, Errors, Duration metrics for a service."""
    
    def __init__(self, meter, service_name: str):
        self.request_counter = meter.create_counter(
            f"{service_name}.requests",
            description="Total requests",
        )
        
        self.error_counter = meter.create_counter(
            f"{service_name}.errors",
            description="Total errors",
        )
        
        self.duration_histogram = meter.create_histogram(
            f"{service_name}.duration",
            description="Request duration",
            unit="ms",
        )
    
    def record_request(self, endpoint: str, method: str):
        self.request_counter.add(1, {
            "endpoint": endpoint,
            "method": method,
        })
    
    def record_error(self, endpoint: str, error_type: str):
        self.error_counter.add(1, {
            "endpoint": endpoint,
            "error_type": error_type,
        })
    
    def record_duration(self, endpoint: str, duration_ms: float):
        self.duration_histogram.record(duration_ms, {
            "endpoint": endpoint,
        })


# Middleware for automatic RED metrics
class REDMiddleware:
    def __init__(self, app, red_metrics: REDMetrics):
        self.app = app
        self.metrics = red_metrics
    
    async def __call__(self, scope, receive, send):
        if scope["type"] != "http":
            await self.app(scope, receive, send)
            return
        
        request = Request(scope, receive)
        endpoint = request.url.path
        method = request.method
        
        self.metrics.record_request(endpoint, method)
        start_time = time.time()
        
        status_code = 500
        
        async def send_wrapper(message):
            nonlocal status_code
            if message["type"] == "http.response.start":
                status_code = message["status"]
            await send(message)
        
        try:
            await self.app(scope, receive, send_wrapper)
        except Exception as e:
            self.metrics.record_error(endpoint, type(e).__name__)
            raise
        finally:
            duration_ms = (time.time() - start_time) * 1000
            self.metrics.record_duration(endpoint, duration_ms)
            
            if status_code >= 400:
                self.metrics.record_error(endpoint, f"http_{status_code}")

Health Checks

Comprehensive Health Endpoint

from enum import Enum
from dataclasses import dataclass
from fastapi import APIRouter
import asyncio


class HealthStatus(Enum):
    HEALTHY = "healthy"
    DEGRADED = "degraded"
    UNHEALTHY = "unhealthy"


@dataclass
class ComponentHealth:
    name: str
    status: HealthStatus
    latency_ms: float | None = None
    error: str | None = None
    details: dict | None = None


class HealthChecker:
    """Check health of service dependencies."""
    
    def __init__(self):
        self.checks: list[tuple[str, Callable]] = []
    
    def register(self, name: str, check: Callable):
        """Register a health check."""
        self.checks.append((name, check))
    
    async def check_all(self) -> dict:
        """Run all health checks."""
        results = await asyncio.gather(*[
            self._run_check(name, check)
            for name, check in self.checks
        ])
        
        components = {r.name: r for r in results}
        
        # Determine overall status
        if all(r.status == HealthStatus.HEALTHY for r in results):
            overall = HealthStatus.HEALTHY
        elif any(r.status == HealthStatus.UNHEALTHY for r in results):
            overall = HealthStatus.UNHEALTHY
        else:
            overall = HealthStatus.DEGRADED
        
        return {
            "status": overall.value,
            "components": {
                name: {
                    "status": c.status.value,
                    "latency_ms": c.latency_ms,
                    "error": c.error,
                }
                for name, c in components.items()
            },
        }
    
    async def _run_check(self, name: str, check: Callable) -> ComponentHealth:
        start = time.time()
        try:
            await asyncio.wait_for(check(), timeout=5.0)
            return ComponentHealth(
                name=name,
                status=HealthStatus.HEALTHY,
                latency_ms=(time.time() - start) * 1000,
            )
        except asyncio.TimeoutError:
            return ComponentHealth(
                name=name,
                status=HealthStatus.UNHEALTHY,
                error="Timeout",
                latency_ms=5000,
            )
        except Exception as e:
            return ComponentHealth(
                name=name,
                status=HealthStatus.UNHEALTHY,
                error=str(e),
                latency_ms=(time.time() - start) * 1000,
            )


# Setup health checks
health_checker = HealthChecker()


async def check_database():
    await db.execute("SELECT 1")


async def check_redis():
    await redis_client.ping()


async def check_payment_service():
    async with httpx.AsyncClient(timeout=2.0) as client:
        response = await client.get("http://payment-service/health")
        response.raise_for_status()


health_checker.register("database", check_database)
health_checker.register("redis", check_redis)
health_checker.register("payment_service", check_payment_service)


# FastAPI routes
router = APIRouter()


@router.get("/health")
async def health():
    """Full health check."""
    return await health_checker.check_all()


@router.get("/health/live")
async def liveness():
    """Kubernetes liveness probe."""
    return {"status": "alive"}


@router.get("/health/ready")
async def readiness():
    """Kubernetes readiness probe."""
    result = await health_checker.check_all()
    if result["status"] == "unhealthy":
        raise HTTPException(503, "Not ready")
    return {"status": "ready"}

Alerting Rules

# prometheus/alerts.yml
groups:
  - name: microservice-alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(orders_errors_total[5m])) 
          / sum(rate(orders_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate in order service"
          description: "Error rate is {{ $value | humanizePercentage }}"
      
      # High latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, 
            rate(orders_duration_bucket[5m])
          ) > 2000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency in order service"
          description: "P95 latency is {{ $value }}ms"
      
      # Circuit breaker open
      - alert: CircuitBreakerOpen
        expr: circuit_breaker_state == 2  # 2 = open
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Circuit breaker open"
          description: "Circuit {{ $labels.circuit }} is open"
      
      # Service down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "{{ $labels.job }} has been down for more than 1 minute"

Dashboard Example

# Export metrics for Grafana dashboard
# Example Grafana dashboard JSON configuration

dashboard_config = {
    "title": "Order Service Dashboard",
    "panels": [
        {
            "title": "Request Rate",
            "type": "graph",
            "targets": [
                {
                    "expr": "sum(rate(orders_requests_total[5m]))",
                    "legendFormat": "Requests/sec",
                }
            ],
        },
        {
            "title": "Error Rate",
            "type": "graph",
            "targets": [
                {
                    "expr": "sum(rate(orders_errors_total[5m])) / sum(rate(orders_requests_total[5m])) * 100",
                    "legendFormat": "Error %",
                }
            ],
        },
        {
            "title": "Latency Percentiles",
            "type": "graph",
            "targets": [
                {
                    "expr": "histogram_quantile(0.50, rate(orders_duration_bucket[5m]))",
                    "legendFormat": "P50",
                },
                {
                    "expr": "histogram_quantile(0.95, rate(orders_duration_bucket[5m]))",
                    "legendFormat": "P95",
                },
                {
                    "expr": "histogram_quantile(0.99, rate(orders_duration_bucket[5m]))",
                    "legendFormat": "P99",
                },
            ],
        },
    ],
}

Key Takeaways

Trace every request - Use distributed tracing with context propagation
Structured logs with context - Include trace IDs for correlation
RED metrics everywhere - Rate, Errors, Duration for every service
Health checks for dependencies - Know when services are degraded
Alerting on symptoms - Alert on user-facing issues, not internal metrics

What's Next?

With observability in place, we need to package and deploy our services. In Article 11: Containerization & Deployment, we'll cover Docker best practices and Docker Compose for local development.

This article is part of the Microservice Architecture 101 series.

PreviousResilience Patterns NextContainerization & Deployment

Last updated 1 month ago

hashtagIntroduction

hashtagThe Three Pillars

hashtagOpenTelemetry Setup

hashtagInstallation and Configuration

hashtagDistributed Tracing

hashtagTrace Context Propagation

hashtagSpan Events and Annotations

hashtagCentralized Logging

hashtagStructured Logging

hashtagContext-Aware Logger

hashtagMetrics Collection

hashtagCustom Metrics

hashtagRED Metrics (Rate, Errors, Duration)

hashtagHealth Checks

hashtagComprehensive Health Endpoint

hashtagAlerting Rules

hashtagDashboard Example

hashtagKey Takeaways

hashtagWhat's Next?