Distributed Tracing

The Microservices Debugging Nightmare

I'll never forget the day a customer reported: "Checkout takes 12 seconds, but only sometimes."

My system had 7 microservices:

API Gateway → routes requests
Order Service → creates orders
Inventory Service → checks stock
Payment Service → processes payments
Loyalty Service → calculates points
Notification Service → sends emails
Analytics Service → tracks events

Each service had perfect logs. Each service showed <200ms response times in isolation. But together? Sometimes 12 seconds of mystery.

This is where distributed tracing saved me. A single trace ID followed the request through all 7 services, showing exactly where those 12 seconds were hiding.

Context Propagation: The Magic Glue

The key to distributed tracing is context propagation—passing trace context between services.

W3C Trace Context Standard

OpenTelemetry uses W3C Trace Context headers:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             │  │                                │                │
             │  └─ Trace ID (128-bit)            │                └─ Flags
             │                                   └─ Span ID (64-bit)
             └─ Version

Every HTTP request automatically includes this header, linking spans across service boundaries.

Building a Distributed System

Let me show you a real multi-service architecture with proper tracing.

Service 1: API Gateway

// api-gateway/src/instrumentation.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'api-gateway',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

// api-gateway/src/app.ts
import express from 'express';
import axios from 'axios';
import { trace, context, propagation } from '@opentelemetry/api';

const app = express();
const tracer = trace.getTracer('api-gateway', '1.0.0');

app.use(express.json());

app.post('/api/checkout', async (req, res) => {
  return await tracer.startActiveSpan('checkout', async (span) => {
    try {
      const { userId, items, amount } = req.body;
      
      span.setAttribute('checkout.user_id', userId);
      span.setAttribute('checkout.items_count', items.length);
      span.setAttribute('checkout.amount', amount);
      
      // Call Order Service
      const orderResponse = await axios.post('http://localhost:3001/api/orders', {
        userId,
        items,
        amount,
      }, {
        headers: {
          // Propagate trace context automatically
          ...propagation.inject(context.active(), {}),
        },
      });
      
      span.setAttribute('checkout.order_id', orderResponse.data.orderId);
      res.json(orderResponse.data);
      
    } catch (error) {
      span.recordException(error as Error);
      res.status(500).json({ error: 'Checkout failed' });
    } finally {
      span.end();
    }
  });
});

app.listen(3000, () => console.log('API Gateway on :3000'));

Service 2: Order Service

// order-service/src/instrumentation.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'order-service',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

// order-service/src/app.ts
import express from 'express';
import axios from 'axios';
import { trace, context, propagation, SpanStatusCode } from '@opentelemetry/api';
import { Pool } from 'pg';

const app = express();
const tracer = trace.getTracer('order-service', '1.0.0');

app.use(express.json());

const pool = new Pool({
  host: 'localhost',
  database: 'orders',
  user: 'postgres',
  password: 'password',
});

app.post('/api/orders', async (req, res) => {
  return await tracer.startActiveSpan('createOrder', async (span) => {
    try {
      const { userId, items, amount } = req.body;
      
      span.setAttribute('order.user_id', userId);
      span.setAttribute('order.amount', amount);
      
      // Create order in database
      const orderId = `ORD-${Date.now()}`;
      await pool.query(
        'INSERT INTO orders (id, user_id, amount, status) VALUES ($1, $2, $3, $4)',
        [orderId, userId, amount, 'pending']
      );
      
      span.setAttribute('order.id', orderId);
      
      // Parallel calls to downstream services
      const [inventoryResult, loyaltyResult] = await Promise.all([
        // Check inventory
        tracer.startActiveSpan('checkInventory', async (invSpan) => {
          try {
            invSpan.setAttribute('inventory.items_count', items.length);
            
            const response = await axios.post(
              'http://localhost:3002/api/inventory/check',
              { orderId, items },
              { headers: propagation.inject(context.active(), {}) }
            );
            
            invSpan.setAttribute('inventory.available', response.data.available);
            return response.data;
          } finally {
            invSpan.end();
          }
        }),
        
        // Calculate loyalty points
        tracer.startActiveSpan('calculateLoyalty', async (loySpan) => {
          try {
            loySpan.setAttribute('loyalty.user_id', userId);
            loySpan.setAttribute('loyalty.amount', amount);
            
            const response = await axios.post(
              'http://localhost:3003/api/loyalty/calculate',
              { userId, amount },
              { headers: propagation.inject(context.active(), {}) }
            );
            
            loySpan.setAttribute('loyalty.points', response.data.points);
            return response.data;
          } finally {
            loySpan.end();
          }
        }),
      ]);
      
      if (!inventoryResult.available) {
        span.setStatus({ code: SpanStatusCode.ERROR, message: 'Insufficient inventory' });
        return res.status(400).json({ error: 'Out of stock' });
      }
      
      // Process payment
      const paymentResult = await tracer.startActiveSpan('processPayment', async (paySpan) => {
        try {
          paySpan.setAttribute('payment.order_id', orderId);
          paySpan.setAttribute('payment.amount', amount);
          
          const response = await axios.post(
            'http://localhost:3004/api/payment/charge',
            { orderId, userId, amount },
            { headers: propagation.inject(context.active(), {}) }
          );
          
          paySpan.setAttribute('payment.success', response.data.success);
          return response.data;
        } finally {
          paySpan.end();
        }
      });
      
      if (!paymentResult.success) {
        span.setStatus({ code: SpanStatusCode.ERROR, message: 'Payment failed' });
        return res.status(402).json({ error: 'Payment declined' });
      }
      
      // Update order status
      await pool.query(
        'UPDATE orders SET status = $1 WHERE id = $2',
        ['completed', orderId]
      );
      
      // Send notification (fire and forget - don't await)
      axios.post(
        'http://localhost:3005/api/notifications/send',
        { userId, orderId, type: 'order_confirmation' },
        { headers: propagation.inject(context.active(), {}) }
      ).catch(err => console.error('Notification failed:', err));
      
      res.json({
        orderId,
        status: 'completed',
        loyaltyPoints: loyaltyResult.points,
      });
      
    } catch (error) {
      span.recordException(error as Error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message });
      res.status(500).json({ error: 'Order creation failed' });
    } finally {
      span.end();
    }
  });
});

app.listen(3001, () => console.log('Order Service on :3001'));

Service 3: Inventory Service

// inventory-service/src/app.ts
import express from 'express';
import { trace } from '@opentelemetry/api';
import { Pool } from 'pg';

const app = express();
const tracer = trace.getTracer('inventory-service', '1.0.0');

app.use(express.json());

const pool = new Pool({
  host: 'localhost',
  database: 'inventory',
  user: 'postgres',
  password: 'password',
});

app.post('/api/inventory/check', async (req, res) => {
  return await tracer.startActiveSpan('checkInventoryAvailability', async (span) => {
    try {
      const { orderId, items } = req.body;
      
      span.setAttribute('inventory.order_id', orderId);
      span.setAttribute('inventory.items_count', items.length);
      
      // Simulate checking each item
      let allAvailable = true;
      
      for (const item of items) {
        const result = await pool.query(
          'SELECT quantity FROM inventory WHERE sku = $1',
          [item.sku]
        );
        
        const available = result.rows[0]?.quantity >= item.quantity;
        
        if (!available) {
          allAvailable = false;
          span.addEvent('Item out of stock', {
            'inventory.sku': item.sku,
            'inventory.requested': item.quantity,
            'inventory.available': result.rows[0]?.quantity || 0,
          });
        }
      }
      
      span.setAttribute('inventory.all_available', allAvailable);
      
      // Simulate slow database operation occasionally
      if (Math.random() < 0.1) {
        span.addEvent('Slow inventory query detected');
        await new Promise(r => setTimeout(r, 2000)); // 2 second delay!
      }
      
      res.json({ available: allAvailable });
      
    } catch (error) {
      span.recordException(error as Error);
      res.status(500).json({ error: 'Inventory check failed' });
    } finally {
      span.end();
    }
  });
});

app.listen(3002, () => console.log('Inventory Service on :3002'));

Service 4: Payment Service

// payment-service/src/app.ts
import express from 'express';
import { trace, SpanStatusCode } from '@opentelemetry/api';

const app = express();
const tracer = trace.getTracer('payment-service', '1.0.0');

app.use(express.json());

app.post('/api/payment/charge', async (req, res) => {
  return await tracer.startActiveSpan('chargePayment', async (span) => {
    try {
      const { orderId, userId, amount } = req.body;
      
      span.setAttribute('payment.order_id', orderId);
      span.setAttribute('payment.user_id', userId);
      span.setAttribute('payment.amount', amount);
      span.setAttribute('payment.currency', 'USD');
      
      // Simulate payment gateway call
      await tracer.startActiveSpan('callPaymentGateway', async (gatewaySpan) => {
        try {
          gatewaySpan.setAttribute('gateway.provider', 'stripe');
          
          // Simulate external API call
          await new Promise(r => setTimeout(r, 300));
          
          gatewaySpan.addEvent('Payment gateway response received');
        } finally {
          gatewaySpan.end();
        }
      });
      
      // 95% success rate
      const success = Math.random() > 0.05;
      
      span.setAttribute('payment.success', success);
      
      if (!success) {
        span.setStatus({ code: SpanStatusCode.ERROR, message: 'Payment declined' });
        span.addEvent('Payment declined by gateway');
      } else {
        span.addEvent('Payment authorized', {
          'payment.transaction_id': `TXN-${Date.now()}`,
        });
      }
      
      res.json({ success });
      
    } catch (error) {
      span.recordException(error as Error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message });
      res.status(500).json({ error: 'Payment processing failed' });
    } finally {
      span.end();
    }
  });
});

app.listen(3004, () => console.log('Payment Service on :3004'));

Service 5: Loyalty Service

// loyalty-service/src/app.ts
import express from 'express';
import { trace } from '@opentelemetry/api';
import Redis from 'ioredis';

const app = express();
const tracer = trace.getTracer('loyalty-service', '1.0.0');

app.use(express.json());

const redis = new Redis({ host: 'localhost', port: 6379 });

app.post('/api/loyalty/calculate', async (req, res) => {
  return await tracer.startActiveSpan('calculateLoyaltyPoints', async (span) => {
    try {
      const { userId, amount } = req.body;
      
      span.setAttribute('loyalty.user_id', userId);
      span.setAttribute('loyalty.order_amount', amount);
      
      // Get user tier from cache
      const userTier = await redis.get(`user:${userId}:tier`) || 'standard';
      span.setAttribute('loyalty.user_tier', userTier);
      
      // Calculate points based on tier
      let multiplier = 1;
      if (userTier === 'premium') multiplier = 2;
      if (userTier === 'gold') multiplier = 3;
      
      const points = Math.floor(amount * 10 * multiplier);
      
      span.setAttribute('loyalty.points_awarded', points);
      span.setAttribute('loyalty.multiplier', multiplier);
      
      // Save points
      await redis.incrby(`user:${userId}:points`, points);
      
      span.addEvent('Loyalty points awarded');
      
      res.json({ points });
      
    } catch (error) {
      span.recordException(error as Error);
      res.status(500).json({ error: 'Loyalty calculation failed' });
    } finally {
      span.end();
    }
  });
});

app.listen(3003, () => console.log('Loyalty Service on :3003'));

Visualizing Distributed Traces

When you run this system and create an order, Jaeger shows:

POST /api/checkout [2.8s]
├─ api-gateway: checkout [2.8s]
   └─ HTTP POST order-service [2.7s]
      ├─ order-service: createOrder [2.7s]
      │  ├─ postgres: INSERT INTO orders [12ms]
      │  ├─ [Parallel Operations - 2.1s]
      │  │  ├─ order-service: checkInventory [2.1s]
      │  │  │  └─ HTTP POST inventory-service [2.1s]
      │  │  │     └─ inventory-service: checkInventoryAvailability [2.1s] ⚠️ SLOW!
      │  │  │        ├─ postgres: SELECT quantity [1.9s] ⚠️ BOTTLENECK
      │  │  │        └─ Event: Slow inventory query detected
      │  │  └─ order-service: calculateLoyalty [180ms]
      │  │     └─ HTTP POST loyalty-service [180ms]
      │  │        └─ loyalty-service: calculateLoyaltyPoints [180ms]
      │  │           ├─ redis: GET user:123:tier [3ms]
      │  │           └─ redis: INCRBY user:123:points [2ms]
      │  ├─ order-service: processPayment [350ms]
      │  │  └─ HTTP POST payment-service [350ms]
      │  │     └─ payment-service: chargePayment [350ms]
      │  │        └─ payment-service: callPaymentGateway [310ms]
      │  └─ postgres: UPDATE orders [8ms]
      └─ HTTP POST notification-service (async - not waited)

The problem was obvious: Inventory service occasionally had 2-second database queries. Without distributed tracing, I would have blamed the API Gateway's "slow checkout endpoint."

Context Propagation in Message Queues

HTTP isn't the only communication method. Here's how to propagate context through message queues:

import { trace, context, propagation } from '@opentelemetry/api';
import * as amqp from 'amqplib';

const tracer = trace.getTracer('publisher-service', '1.0.0');

async function publishOrderEvent(orderId: string, userId: string) {
  return await tracer.startActiveSpan('publishOrderEvent', async (span) => {
    try {
      const connection = await amqp.connect('amqp://localhost');
      const channel = await connection.createChannel();
      await channel.assertQueue('orders');
      
      const message = {
        orderId,
        userId,
        timestamp: Date.now(),
      };
      
      // Inject trace context into message properties
      const carrier: Record<string, string> = {};
      propagation.inject(context.active(), carrier);
      
      channel.sendToQueue('orders', Buffer.from(JSON.stringify(message)), {
        headers: carrier, // Trace context in headers!
      });
      
      span.addEvent('Order event published', {
        'message.queue': 'orders',
        'message.order_id': orderId,
      });
      
      await channel.close();
      await connection.close();
      
    } finally {
      span.end();
    }
  });
}

// Consumer side
async function consumeOrderEvents() {
  const connection = await amqp.connect('amqp://localhost');
  const channel = await connection.createChannel();
  await channel.assertQueue('orders');
  
  channel.consume('orders', (msg) => {
    if (!msg) return;
    
    // Extract trace context from message headers
    const extractedContext = propagation.extract(context.active(), msg.properties.headers || {});
    
    // Continue the trace in consumer
    context.with(extractedContext, () => {
      tracer.startActiveSpan('processOrderEvent', (span) => {
        try {
          const order = JSON.parse(msg.content.toString());
          
          span.setAttribute('message.order_id', order.orderId);
          span.addEvent('Processing order event');
          
          // Process the order...
          
          channel.ack(msg);
        } catch (error) {
          span.recordException(error as Error);
          channel.nack(msg);
        } finally {
          span.end();
        }
      });
    });
  });
}

Now the trace flows through:

HTTP POST → Order Service → RabbitMQ → Analytics Consumer
[Same Trace ID throughout]

Debugging Production Issues

Issue 1: Cascading Failures

Symptom: Entire checkout flow failing

Distributed trace showed:

checkout [FAILED]
└─ createOrder [FAILED]
   └─ processPayment [FAILED]
      └─ HTTP POST payment-service [TIMEOUT - 30s]

Root cause: Payment service was down. Order service waited 30 seconds per request. API Gateway timed out.

Fix: Add circuit breaker with 3-second timeout.

Issue 2: Hidden N+1 Problem

Symptom: Premium checkout slow

Trace revealed:

checkout [3.2s]
└─ calculateLoyalty [2.8s]
   ├─ HTTP GET loyalty-service/tier [200ms]
   ├─ HTTP GET loyalty-service/benefits [220ms]
   ├─ HTTP GET loyalty-service/discounts [180ms]
   ├─ HTTP GET loyalty-service/multiplier [210ms]
   └─ ... (10 more calls)

Fix: Created batch endpoint returning all data in one call.

Issue 3: Silent Failures

Symptom: Analytics data missing

Trace showed:

createOrder [SUCCESS]
├─ processPayment [SUCCESS]
└─ sendAnalytics [FAILED] ⚠️ Fire-and-forget
   └─ Error: Connection refused

Issue: Analytics service was down, but we weren't monitoring fire-and-forget calls.

Fix: Added error tracking for async operations.

Sampling Strategies

At high volume, you can't keep every trace. Use sampling:

import { TraceIdRatioBasedSampler, ParentBasedSampler } from '@opentelemetry/sdk-trace-node';

const sdk = new NodeSDK({
  sampler: new ParentBasedSampler({
    // Sample 10% of root spans
    root: new TraceIdRatioBasedSampler(0.1),
  }),
  // ... other config
});

Better: Always sample errors:

import { Sampler, SamplingDecision, SamplingResult } from '@opentelemetry/sdk-trace-base';

class ErrorSampler implements Sampler {
  shouldSample(context: any, traceId: string, spanName: string, spanKind: any, attributes: any): SamplingResult {
    // Always sample if error
    if (attributes['error'] === true || attributes['http.status_code'] >= 500) {
      return { decision: SamplingDecision.RECORD_AND_SAMPLED };
    }
    
    // Sample 1% of successful requests
    if (Math.random() < 0.01) {
      return { decision: SamplingDecision.RECORD_AND_SAMPLED };
    }
    
    return { decision: SamplingDecision.NOT_RECORD };
  }
  
  toString(): string {
    return 'ErrorSampler';
  }
}

Best Practices

Always propagate context in HTTP headers, message properties
Keep span names consistent across services
Add service version to resource attributes
Sample intelligently - always keep errors
Set timeouts on downstream calls
Monitor trace completion - are spans being dropped?

What's Next

Continue to Sampling Strategies for deep dive on:

Head-based vs tail-based sampling
Custom sampling logic
Managing telemetry costs
Sampling in production

Previous: ← Metrics Collection | Next: Sampling Strategies →

Distributed tracing connects the dots across your entire system.

PreviousMetrics Collection NextSampling Strategies

Last updated 1 month ago

hashtagThe Microservices Debugging Nightmare

hashtagContext Propagation: The Magic Glue

hashtagW3C Trace Context Standard

hashtagBuilding a Distributed System

hashtagService 1: API Gateway

hashtagService 2: Order Service

hashtagService 3: Inventory Service

hashtagService 4: Payment Service

hashtagService 5: Loyalty Service

hashtagVisualizing Distributed Traces

hashtagContext Propagation in Message Queues

hashtagDebugging Production Issues

hashtagIssue 1: Cascading Failures

hashtagIssue 2: Hidden N+1 Problem

hashtagIssue 3: Silent Failures

hashtagSampling Strategies

hashtagBest Practices

hashtagWhat's Next