Sampling Strategies

The Cost of Complete Observability

Three months into production with OpenTelemetry, I got a wake-up call: my observability bill was $12,000/month. We were processing 50 million requests per day, and I was tracing every single one.

The math was brutal:

  • 50M requests/day = ~580 requests/second

  • Average trace size: 15KB (across 7 microservices)

  • Daily data: 50M Γ— 15KB = 750GB/day

  • Monthly storage: 22.5TB

At $0.50/GB storage and $0.10/GB ingestion: $13,500/month for traces alone.

The solution? Smart sampling. I kept 100% visibility into errors while sampling only 1% of successful requests. New cost: $800/month. Same debugging capability.

Understanding Sampling

Sampling means deciding which traces to keep and which to discard.

Head-Based Sampling

Decision made at trace creation (the "head" of the trace).

Pros:

  • Low overhead

  • Decision made early

  • Easy to implement

Cons:

  • Can't see future (don't know if trace will error)

  • Might discard interesting traces

  • No post-filtering

Tail-Based Sampling

Decision made after trace completes (the "tail").

Pros:

  • Can see entire trace before deciding

  • Keep all errors, slow requests

  • More intelligent decisions

Cons:

  • Higher overhead

  • Requires buffering

  • Needs centralized collector

Built-In Samplers

1. Always On (Don't Use in Production!)

Keeps every trace. Only use for development or very low-volume services.

2. Always Off (Also Don't Use!)

Discards every trace. Why even run OTel?

3. Ratio-Based Sampling (Production Standard)

How it works: Uses trace ID's hash to determine sampling. Same trace ID always gets same decision.

Use when: You want simple, stateless sampling.

4. Parent-Based Sampling (The Smart Default)

Critical for distributed tracing: Ensures all services in a trace make the same sampling decision.

Example:

Custom Sampling: The Production Solution

Here's the sampler I actually use in production:

Measuring Request Duration for Sampling

The problem: How do you know if a request is "slow" at the start of the trace?

Answer: You don't. But you can make an educated guess:

Warning: This has memory implications. Clean up old entries:

Tail-Based Sampling with Collector

For true tail-based sampling, use the OpenTelemetry Collector:

collector-config.yaml:

Run the collector:

Update your application to send to collector:

Sampling Metrics: What Am I Actually Keeping?

Track your sampling decisions:

Query in Prometheus:

Real Production Sampling Strategy

Here's what I actually run:

Results at 50M requests/day:

Category
Requests/day
Sample Rate
Traces Kept
Cost

Errors (0.5%)

250,000

100%

250,000

$187.50

Slow (2%)

1,000,000

50%

500,000

$375.00

Critical endpoints (10%)

5,000,000

10%

500,000

$375.00

Normal traffic

43,750,000

1%

437,500

$328.13

Total

50,000,000

3.35%

1,687,500

$1,265.63

Down from $13,500 to $1,266 β€” 90% cost reduction, still catching every error!

Adaptive Sampling (Advanced)

Dynamically adjust sampling based on load:

Best Practices

  1. Always sample errors - you can't debug what you don't see

  2. Use parent-based sampling for distributed traces

  3. Track sampling metrics to understand what you're keeping

  4. Start conservative (1%), increase if needed

  5. Monitor costs - set budget alerts

  6. Test sampling in staging before production

  7. Document sampling logic for your team

Common Pitfalls

❌ Sampling after the fact

βœ… Sample at creation

❌ Different sampling rates per service

βœ… Parent-based sampling

What's Next

Continue to Resource Detectionarrow-up-right where you'll learn:

  • Automatic service identification

  • Environment metadata

  • Deployment information

  • Custom resource attributes


Previous: ← Distributed Tracing | Next: Resource Detection β†’arrow-up-right

Sample smart, not hard. Keep what matters.

Last updated