I'll never forget the day a customer reported: "Checkout takes 12 seconds, but only sometimes."
My system had 7 microservices:
API Gateway β routes requests
Order Service β creates orders
Inventory Service β checks stock
Payment Service β processes payments
Loyalty Service β calculates points
Notification Service β sends emails
Analytics Service β tracks events
Each service had perfect logs. Each service showed <200ms response times in isolation. But together? Sometimes 12 seconds of mystery.
This is where distributed tracing saved me. A single trace ID followed the request through all 7 services, showing exactly where those 12 seconds were hiding.
Context Propagation: The Magic Glue
The key to distributed tracing is context propagationβpassing trace context between services.
W3C Trace Context Standard
OpenTelemetry uses W3C Trace Context headers:
Every HTTP request automatically includes this header, linking spans across service boundaries.
Building a Distributed System
Let me show you a real multi-service architecture with proper tracing.
Service 1: API Gateway
Service 2: Order Service
Service 3: Inventory Service
Service 4: Payment Service
Service 5: Loyalty Service
Visualizing Distributed Traces
When you run this system and create an order, Jaeger shows:
The problem was obvious: Inventory service occasionally had 2-second database queries. Without distributed tracing, I would have blamed the API Gateway's "slow checkout endpoint."
Context Propagation in Message Queues
HTTP isn't the only communication method. Here's how to propagate context through message queues:
Now the trace flows through:
Debugging Production Issues
Issue 1: Cascading Failures
Symptom: Entire checkout flow failing
Distributed trace showed:
Root cause: Payment service was down. Order service waited 30 seconds per request. API Gateway timed out.
Fix: Add circuit breaker with 3-second timeout.
Issue 2: Hidden N+1 Problem
Symptom: Premium checkout slow
Trace revealed:
Fix: Created batch endpoint returning all data in one call.
Issue 3: Silent Failures
Symptom: Analytics data missing
Trace showed:
Issue: Analytics service was down, but we weren't monitoring fire-and-forget calls.
Fix: Added error tracking for async operations.
Sampling Strategies
At high volume, you can't keep every trace. Use sampling:
Better: Always sample errors:
Best Practices
Always propagate context in HTTP headers, message properties
Keep span names consistent across services
Add service version to resource attributes
Sample intelligently - always keep errors
Set timeouts on downstream calls
Monitor trace completion - are spans being dropped?
import { TraceIdRatioBasedSampler, ParentBasedSampler } from '@opentelemetry/sdk-trace-node';
const sdk = new NodeSDK({
sampler: new ParentBasedSampler({
// Sample 10% of root spans
root: new TraceIdRatioBasedSampler(0.1),
}),
// ... other config
});