Production Deployment

From 10 to 10,000 Requests Per Second

When I first deployed OpenTelemetry in production, we handled 10 requests/second. The setup was simple:

  • Single Jaeger instance

  • One OpenTelemetry Collector

  • Services sending directly to Collector

Fast forward 6 months: 10,000 requests/second. The original setup collapsed:

  • Jaeger ran out of storage

  • Collector became a bottleneck

  • Export failures caused memory leaks

  • Traces were getting dropped

This article covers how I scaled OpenTelemetry to handle massive production load.

Architecture Evolution

Phase 1: Simple (0-100 req/s)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Service │────▢│ Collector │────▢│ Jaeger  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Works for: Development, small apps, proof-of-concept

Phase 2: Distributed (100-1,000 req/s)

Works for: Production apps, small teams

Phase 3: Scaled (1,000-10,000+ req/s)

Works for: High-scale production, large teams

Production Kubernetes Deployment

Collector DaemonSet (Sidecar Pattern)

Deploy Collector on every node:

Gateway Collector Deployment

Centralized collectors for processing:

DaemonSet Configuration

Node collectors forward to gateway:

Gateway Configuration

Gateway does heavy processing:

Jaeger High Availability

Elasticsearch Backend

Monitoring the Monitoring System

Critical: Monitor your observability infrastructure!

Collector Metrics

Key metrics to alert on:

Application Configuration

Services send to local DaemonSet collector:

Kubernetes Deployment:

Cost Optimization

Storage Costs

At 10,000 req/s with 1% sampling:

  • 100 traces/second Γ— 15KB avg = 1.5 MB/s

  • Daily: 130 GB

  • 7-day retention: 910 GB

  • Monthly: ~3.9 TB

At $0.10/GB Elasticsearch: ~$390/month just for storage

Optimization:

Runbook: Common Issues

Issue 1: Export Queue Full

Symptoms: Memory usage increasing, spans being dropped

Check:

Fix:

Issue 2: Jaeger Storage Full

Symptoms: Collector export errors, "index read-only"

Check:

Fix:

Issue 3: High Collector CPU

Symptoms: Collector CPU > 80%

Check:

Fix: Scale collectors horizontally

Production Checklist

Scaling Benchmarks

From my production deployment:

Metric
100 req/s
1,000 req/s
10,000 req/s

DaemonSet Collectors

1 per node

1 per node

1 per node

Gateway Collectors

1

2

5

Jaeger Collectors

1

2

3

Elasticsearch Nodes

1

3

6

Monthly Cost

$50

$200

$800

Data Generated

13 GB/day

130 GB/day

1.3 TB/day

What's Next

You've completed the core OpenTelemetry 101 series! Continue to Multi-Backend Integration for advanced topics on integrating with cloud providers and commercial observability platforms.


Previous: ← Security Best Practices | Next: Multi-Backend Integration β†’

Production-ready observability is a marathon, not a sprint.

Last updated