OpenTelemetry Collector

Why the Collector Changed Everything

Before the Collector, each of my 15 microservices sent telemetry directly to Jaeger. This worked fine... until Jaeger went down for maintenance.

Result: 15 services couldn't export telemetry. Export queues filled up. Memory usage spiked. Services started crashing.

After implementing the OpenTelemetry Collector as a central hub:

  • Services send to Collector (localhost, always available)

  • Collector buffers and forwards to backends

  • Backends can go down without affecting services

  • One place to configure routing, filtering, transformation

The Collector is a production necessity, not a nice-to-have.

What Is the Collector?

The OpenTelemetry Collector is a standalone service that:

  1. Receives telemetry from applications

  2. Processes (filters, transforms, samples)

  3. Exports to one or more backends

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Service A   │────▢│                      │────▢│   Jaeger    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚                      β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚   OTel Collector     β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚                      β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Service B   │────▢│  - Receives          │────▢│ Prometheus  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  - Processes         β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚  - Exports           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚                      β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Service C   │────▢│                      │────▢│ CloudWatch  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Basic Collector Setup

Install the Collector:

Basic configuration (otel-collector-config.yaml):

Update your application:

Advanced Processing

Filtering Spans

Remove health check spans:

Sampling at the Collector

Adding Attributes

Enrich all spans with environment info:

Redacting Sensitive Data

Remove PII from spans:

Multi-Backend Routing

Send different data to different backends:

Service-Specific Routing

Route by service name:

Production Collector Configuration

Here's my actual production setup:

Scaling the Collector

Horizontal Scaling

Run multiple collector instances behind a load balancer:

nginx.conf:

Kubernetes Deployment

Collector Metrics

Monitor the Collector itself:

Key metrics to monitor:

Debugging the Collector

Enable debug logging:

Export to console for testing:

Real Production Issue: Collector Overload

Symptom: Services timing out when sending telemetry

Investigation:

Root cause: Jaeger couldn't keep up with trace volume

Fix:

  1. Increased tail sampling (100% β†’ 10% of successful requests)

  2. Added memory limiter to prevent OOM

  3. Scaled Collector replicas from 2 β†’ 5

Result: Queue size dropped to <100, no more timeouts

Best Practices

  1. Always use the Collector in production - don't export directly

  2. Enable health checks for Kubernetes probes

  3. Set memory limits to prevent OOM

  4. Monitor Collector metrics - it's critical infrastructure

  5. Use tail-based sampling at Collector for better decisions

  6. Scale horizontally for high-volume environments

  7. Filter early to reduce processing overhead

  8. Batch aggressively to reduce network calls

What's Next

Continue to Performance Optimization to learn:

  • Minimizing instrumentation overhead

  • Optimizing sampler performance

  • Reducing memory usage

  • Benchmarking telemetry impact


Previous: ← Custom Exportersarrow-up-right | Next: Performance Optimization β†’

The Collector is the traffic controller for your telemetry.

Last updated