OpenTelemetry Collector

Why the Collector Changed Everything

Before the Collector, each of my 15 microservices sent telemetry directly to Jaeger. This worked fine... until Jaeger went down for maintenance.

Result: 15 services couldn't export telemetry. Export queues filled up. Memory usage spiked. Services started crashing.

After implementing the OpenTelemetry Collector as a central hub:

Services send to Collector (localhost, always available)
Collector buffers and forwards to backends
Backends can go down without affecting services
One place to configure routing, filtering, transformation

The Collector is a production necessity, not a nice-to-have.

What Is the Collector?

The OpenTelemetry Collector is a standalone service that:

Receives telemetry from applications
Processes (filters, transforms, samples)
Exports to one or more backends

┌──────────────┐     ┌──────────────────────┐     ┌─────────────┐
│  Service A   │────▶│                      │────▶│   Jaeger    │
└──────────────┘     │                      │     └─────────────┘
                     │   OTel Collector     │
┌──────────────┐     │                      │     ┌─────────────┐
│  Service B   │────▶│  - Receives          │────▶│ Prometheus  │
└──────────────┘     │  - Processes         │     └─────────────┘
                     │  - Exports           │
┌──────────────┐     │                      │     ┌─────────────┐
│  Service C   │────▶│                      │────▶│ CloudWatch  │
└──────────────┘     └──────────────────────┘     └─────────────┘

Basic Collector Setup

Install the Collector:

# Docker
docker run -d --name otel-collector \
  -p 4317:4317 \   # OTLP gRPC
  -p 4318:4318 \   # OTLP HTTP
  -p 9464:9464 \   # Prometheus metrics
  -v $(pwd)/otel-collector-config.yaml:/etc/otel-collector-config.yaml \
  otel/opentelemetry-collector:latest \
  --config=/etc/otel-collector-config.yaml

Basic configuration (otel-collector-config.yaml):

receivers:
  # Receive OTLP telemetry
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  # Batch spans for efficiency
  batch:
    timeout: 10s
    send_batch_size: 1024

exporters:
  # Export to Jaeger
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  
  # Export metrics to Prometheus
  prometheus:
    endpoint: 0.0.0.0:9464
    namespace: app

service:
  pipelines:
    # Traces pipeline
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/jaeger]
    
    # Metrics pipeline
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Update your application:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';

const sdk = new NodeSDK({
  // Send to Collector instead of Jaeger directly
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces',
  }),
  
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'http://localhost:4318/v1/metrics',
    }),
  }),
});

sdk.start();

Advanced Processing

Filtering Spans

Remove health check spans:

processors:
  # Filter out health checks
  filter/health:
    traces:
      span:
        - 'attributes["http.route"] == "/health"'
        - 'attributes["http.route"] == "/metrics"'
        - 'attributes["http.route"] == "/favicon.ico"'

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [filter/health, batch]
      exporters: [otlp/jaeger]

Sampling at the Collector

processors:
  # Tail-based sampling
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      # Always keep errors
      - name: error-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      
      # Always keep slow requests
      - name: slow-policy
        type: latency
        latency:
          threshold_ms: 2000
      
      # 1% of everything else
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 1

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp/jaeger]

Adding Attributes

Enrich all spans with environment info:

processors:
  # Add resource attributes
  resource:
    attributes:
      - key: deployment.environment
        value: production
        action: insert
      - key: cloud.region
        value: us-east-1
        action: insert
      - key: team
        value: platform
        action: insert

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resource, batch]
      exporters: [otlp/jaeger]

Redacting Sensitive Data

Remove PII from spans:

processors:
  # Redact sensitive attributes
  attributes/redact:
    actions:
      - key: http.headers.authorization
        action: delete
      - key: http.headers.cookie
        action: delete
      - key: user.email
        action: hash
      - key: user.phone
        action: hash
      - key: credit_card.number
        action: delete

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [attributes/redact, batch]
      exporters: [otlp/jaeger]

Multi-Backend Routing

Send different data to different backends:

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s

  # Filter errors only
  filter/errors:
    traces:
      span:
        - 'status.code == ERROR'

exporters:
  # Jaeger: All traces
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  
  # PagerDuty: Errors only (for alerting)
  otlp/pagerduty:
    endpoint: pagerduty-collector:4317
    tls:
      insecure: true
  
  # CloudWatch: Long-term storage
  awsxray:
    region: us-east-1

service:
  pipelines:
    # Pipeline 1: All traces to Jaeger
    traces/jaeger:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/jaeger]
    
    # Pipeline 2: Errors to PagerDuty
    traces/alerts:
      receivers: [otlp]
      processors: [filter/errors, batch]
      exporters: [otlp/pagerduty]
    
    # Pipeline 3: All traces to CloudWatch
    traces/cloudwatch:
      receivers: [otlp]
      processors: [batch]
      exporters: [awsxray]

Service-Specific Routing

Route by service name:

processors:
  # Filter order-service traces
  filter/order-service:
    traces:
      resource:
        - 'resource.attributes["service.name"] == "order-service"'
  
  # Filter payment-service traces
  filter/payment-service:
    traces:
      resource:
        - 'resource.attributes["service.name"] == "payment-service"'

exporters:
  otlp/general:
    endpoint: jaeger:4317
    tls:
      insecure: true
  
  # High-retention backend for critical services
  otlp/critical:
    endpoint: critical-backend:4317
    tls:
      insecure: true

service:
  pipelines:
    # Critical services: high retention
    traces/critical:
      receivers: [otlp]
      processors: [filter/payment-service, batch]
      exporters: [otlp/critical, otlp/general]
    
    # Other services: standard retention
    traces/standard:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/general]

Production Collector Configuration

Here's my actual production setup:

receivers:
  # Receive from applications
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
        cors:
          allowed_origins:
            - http://localhost:*
            - https://*.example.com

processors:
  # Batch for efficiency
  batch:
    timeout: 10s
    send_batch_size: 1024
    send_batch_max_size: 2048
  
  # Tail-based sampling
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 2000
    policies:
      - name: error-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-policy
        type: latency
        latency:
          threshold_ms: 3000
      - name: critical-endpoints
        type: string_attribute
        string_attribute:
          key: http.route
          values:
            - /api/checkout
            - /api/payment
          enabled_regex_matching: false
          invert_match: false
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 1
  
  # Filter health checks
  filter/health:
    traces:
      span:
        - 'attributes["http.route"] == "/health"'
        - 'attributes["http.route"] == "/metrics"'
  
  # Redact sensitive data
  attributes/redact:
    actions:
      - key: http.headers.authorization
        action: delete
      - key: http.headers.cookie
        action: delete
      - key: user.email
        action: hash
      - key: password
        action: delete
  
  # Add environment metadata
  resource:
    attributes:
      - key: deployment.environment
        value: ${env:ENVIRONMENT}
        action: insert
      - key: k8s.cluster.name
        value: ${env:CLUSTER_NAME}
        action: insert
  
  # Memory limiter (prevent OOM)
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

exporters:
  # Jaeger (short-term, debugging)
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval: 30s
      max_elapsed_time: 300s
  
  # AWS CloudWatch (long-term, compliance)
  awsxray:
    region: ${env:AWS_REGION}
  
  # Prometheus (metrics)
  prometheus:
    endpoint: 0.0.0.0:9464
    namespace: app
    const_labels:
      environment: ${env:ENVIRONMENT}

service:
  # Health check extension
  extensions: [health_check]
  
  pipelines:
    # Traces: Sampling → Filtering → Redaction → Export
    traces:
      receivers: [otlp]
      processors:
        - memory_limiter
        - tail_sampling
        - filter/health
        - attributes/redact
        - resource
        - batch
      exporters: [otlp/jaeger, awsxray]
    
    # Metrics: Simple pipeline
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [prometheus]

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
    path: /health

Scaling the Collector

Horizontal Scaling

Run multiple collector instances behind a load balancer:

# docker-compose.yml
version: '3'
services:
  otel-collector-1:
    image: otel/opentelemetry-collector:latest
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    command: ["--config=/etc/otel-collector-config.yaml"]
  
  otel-collector-2:
    image: otel/opentelemetry-collector:latest
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    command: ["--config=/etc/otel-collector-config.yaml"]
  
  nginx:
    image: nginx:alpine
    ports:
      - "4318:4318"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf

nginx.conf:

upstream otel_collectors {
    least_conn;
    server otel-collector-1:4318;
    server otel-collector-2:4318;
}

server {
    listen 4318;
    
    location / {
        proxy_pass http://otel_collectors;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }
}

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: observability
spec:
  replicas: 3
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
      - name: otel-collector
        image: otel/opentelemetry-collector:latest
        args: ["--config=/etc/otel-collector-config.yaml"]
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        volumeMounts:
        - name: config
          mountPath: /etc/otel-collector-config.yaml
          subPath: otel-collector-config.yaml
        ports:
        - containerPort: 4317
          name: otlp-grpc
        - containerPort: 4318
          name: otlp-http
        - containerPort: 13133
          name: health
        livenessProbe:
          httpGet:
            path: /health
            port: 13133
          initialDelaySeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 13133
          initialDelaySeconds: 5
      volumes:
      - name: config
        configMap:
          name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  namespace: observability
spec:
  selector:
    app: otel-collector
  ports:
  - name: otlp-grpc
    port: 4317
    targetPort: 4317
  - name: otlp-http
    port: 4318
    targetPort: 4318
  type: ClusterIP

Collector Metrics

Monitor the Collector itself:

service:
  telemetry:
    logs:
      level: info
    metrics:
      level: detailed
      address: 0.0.0.0:8888  # Prometheus endpoint for Collector metrics

Key metrics to monitor:

# Spans received per second
rate(otelcol_receiver_accepted_spans[5m])

# Spans dropped (buffer full)
rate(otelcol_processor_dropped_spans[5m])

# Export failures
rate(otelcol_exporter_send_failed_spans[5m])

# Queue size
otelcol_exporter_queue_size

Debugging the Collector

Enable debug logging:

service:
  telemetry:
    logs:
      level: debug  # Verbose output

Export to console for testing:

exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]  # See spans in logs

Real Production Issue: Collector Overload

Symptom: Services timing out when sending telemetry

Investigation:

otelcol_exporter_queue_size{exporter="otlp/jaeger"}
# Result: 2048 (maximum queue size)

Root cause: Jaeger couldn't keep up with trace volume

Fix:

Increased tail sampling (100% → 10% of successful requests)
Added memory limiter to prevent OOM
Scaled Collector replicas from 2 → 5

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1024  # Increased from 512
    spike_limit_mib: 256
  
  tail_sampling:
    policies:
      - name: probabilistic-policy
        probabilistic:
          sampling_percentage: 10  # Reduced from 100

Result: Queue size dropped to <100, no more timeouts

Best Practices

Always use the Collector in production - don't export directly
Enable health checks for Kubernetes probes
Set memory limits to prevent OOM
Monitor Collector metrics - it's critical infrastructure
Use tail-based sampling at Collector for better decisions
Scale horizontally for high-volume environments
Filter early to reduce processing overhead
Batch aggressively to reduce network calls

What's Next

Continue to Performance Optimization to learn:

Minimizing instrumentation overhead
Optimizing sampler performance
Reducing memory usage
Benchmarking telemetry impact

Previous: ← Custom Exporters | Next: Performance Optimization →

The Collector is the traffic controller for your telemetry.

PreviousCustom Exporters NextPerformance Optimization

Last updated 1 month ago

hashtagWhy the Collector Changed Everything

hashtagWhat Is the Collector?

hashtagBasic Collector Setup

hashtagAdvanced Processing

hashtagFiltering Spans

hashtagSampling at the Collector

hashtagAdding Attributes

hashtagRedacting Sensitive Data

hashtagMulti-Backend Routing

hashtagService-Specific Routing

hashtagProduction Collector Configuration

hashtagScaling the Collector

hashtagHorizontal Scaling

hashtagKubernetes Deployment

hashtagCollector Metrics

hashtagDebugging the Collector

hashtagReal Production Issue: Collector Overload

hashtagBest Practices

hashtagWhat's Next