Distributed Tracing

The Microservices Debugging Nightmare

I'll never forget the day a customer reported: "Checkout takes 12 seconds, but only sometimes."

My system had 7 microservices:

  • API Gateway β†’ routes requests

  • Order Service β†’ creates orders

  • Inventory Service β†’ checks stock

  • Payment Service β†’ processes payments

  • Loyalty Service β†’ calculates points

  • Notification Service β†’ sends emails

  • Analytics Service β†’ tracks events

Each service had perfect logs. Each service showed <200ms response times in isolation. But together? Sometimes 12 seconds of mystery.

This is where distributed tracing saved me. A single trace ID followed the request through all 7 services, showing exactly where those 12 seconds were hiding.

Context Propagation: The Magic Glue

The key to distributed tracing is context propagationβ€”passing trace context between services.

W3C Trace Context Standard

OpenTelemetry uses W3C Trace Context headers:

Every HTTP request automatically includes this header, linking spans across service boundaries.

Building a Distributed System

Let me show you a real multi-service architecture with proper tracing.

Service 1: API Gateway

Service 2: Order Service

Service 3: Inventory Service

Service 4: Payment Service

Service 5: Loyalty Service

Visualizing Distributed Traces

When you run this system and create an order, Jaeger shows:

The problem was obvious: Inventory service occasionally had 2-second database queries. Without distributed tracing, I would have blamed the API Gateway's "slow checkout endpoint."

Context Propagation in Message Queues

HTTP isn't the only communication method. Here's how to propagate context through message queues:

Now the trace flows through:

Debugging Production Issues

Issue 1: Cascading Failures

Symptom: Entire checkout flow failing

Distributed trace showed:

Root cause: Payment service was down. Order service waited 30 seconds per request. API Gateway timed out.

Fix: Add circuit breaker with 3-second timeout.

Issue 2: Hidden N+1 Problem

Symptom: Premium checkout slow

Trace revealed:

Fix: Created batch endpoint returning all data in one call.

Issue 3: Silent Failures

Symptom: Analytics data missing

Trace showed:

Issue: Analytics service was down, but we weren't monitoring fire-and-forget calls.

Fix: Added error tracking for async operations.

Sampling Strategies

At high volume, you can't keep every trace. Use sampling:

Better: Always sample errors:

Best Practices

  1. Always propagate context in HTTP headers, message properties

  2. Keep span names consistent across services

  3. Add service version to resource attributes

  4. Sample intelligently - always keep errors

  5. Set timeouts on downstream calls

  6. Monitor trace completion - are spans being dropped?

What's Next

Continue to Sampling Strategies for deep dive on:

  • Head-based vs tail-based sampling

  • Custom sampling logic

  • Managing telemetry costs

  • Sampling in production


Previous: ← Metrics Collection | Next: Sampling Strategies β†’

Distributed tracing connects the dots across your entire system.

Last updated