Three months into production with OpenTelemetry, I got a wake-up call: my observability bill was $12,000/month. We were processing 50 million requests per day, and I was tracing every single one.
The math was brutal:
50M requests/day = ~580 requests/second
Average trace size: 15KB (across 7 microservices)
Daily data: 50M Γ 15KB = 750GB/day
Monthly storage: 22.5TB
At $0.50/GB storage and $0.10/GB ingestion: $13,500/month for traces alone.
The solution? Smart sampling. I kept 100% visibility into errors while sampling only 1% of successful requests. New cost: $800/month. Same debugging capability.
Understanding Sampling
Sampling means deciding which traces to keep and which to discard.
Head-Based Sampling
Decision made at trace creation (the "head" of the trace).
Pros:
Low overhead
Decision made early
Easy to implement
Cons:
Can't see future (don't know if trace will error)
Might discard interesting traces
No post-filtering
Tail-Based Sampling
Decision made after trace completes (the "tail").
Pros:
Can see entire trace before deciding
Keep all errors, slow requests
More intelligent decisions
Cons:
Higher overhead
Requires buffering
Needs centralized collector
Built-In Samplers
1. Always On (Don't Use in Production!)
Keeps every trace. Only use for development or very low-volume services.
2. Always Off (Also Don't Use!)
Discards every trace. Why even run OTel?
3. Ratio-Based Sampling (Production Standard)
How it works: Uses trace ID's hash to determine sampling. Same trace ID always gets same decision.
Use when: You want simple, stateless sampling.
4. Parent-Based Sampling (The Smart Default)
Critical for distributed tracing: Ensures all services in a trace make the same sampling decision.
Example:
Custom Sampling: The Production Solution
Here's the sampler I actually use in production:
Measuring Request Duration for Sampling
The problem: How do you know if a request is "slow" at the start of the trace?
Answer: You don't. But you can make an educated guess:
Warning: This has memory implications. Clean up old entries:
Tail-Based Sampling with Collector
For true tail-based sampling, use the OpenTelemetry Collector:
collector-config.yaml:
Run the collector:
Update your application to send to collector:
Sampling Metrics: What Am I Actually Keeping?
Track your sampling decisions:
Query in Prometheus:
Real Production Sampling Strategy
Here's what I actually run:
Results at 50M requests/day:
Category
Requests/day
Sample Rate
Traces Kept
Cost
Errors (0.5%)
250,000
100%
250,000
$187.50
Slow (2%)
1,000,000
50%
500,000
$375.00
Critical endpoints (10%)
5,000,000
10%
500,000
$375.00
Normal traffic
43,750,000
1%
437,500
$328.13
Total
50,000,000
3.35%
1,687,500
$1,265.63
Down from $13,500 to $1,266 β 90% cost reduction, still catching every error!
Adaptive Sampling (Advanced)
Dynamically adjust sampling based on load:
Best Practices
Always sample errors - you can't debug what you don't see
Use parent-based sampling for distributed traces
Track sampling metrics to understand what you're keeping
import { NodeSDK } from '@opentelemetry/sdk-node';
import { AlwaysOnSampler } from '@opentelemetry/sdk-trace-base';
const sdk = new NodeSDK({
sampler: new AlwaysOnSampler(),
// ... other config
});
import { AlwaysOffSampler } from '@opentelemetry/sdk-trace-base';
const sdk = new NodeSDK({
sampler: new AlwaysOffSampler(),
// ... other config
});
import { TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';
const sdk = new NodeSDK({
// Sample 5% of traces
sampler: new TraceIdRatioBasedSampler(0.05),
// ... other config
});
import { ParentBasedSampler, TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';
const sdk = new NodeSDK({
sampler: new ParentBasedSampler({
// Root spans (no parent): sample 10%
root: new TraceIdRatioBasedSampler(0.1),
// If parent was sampled, sample this too
// If parent was not sampled, don't sample this
// (This keeps distributed traces consistent)
}),
// ... other config
});
API Gateway samples 10% β samples trace abc123
Order Service sees parent sampled β also samples trace abc123
Payment Service sees parent sampled β also samples trace abc123
Result: Complete trace, or no trace. Never partial.
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://localhost:4318/v1/traces', // Collector, not Jaeger
}),
// Use AlwaysOnSampler - let collector decide
sampler: new AlwaysOnSampler(),
});