Metrics Collection

The Metrics vs Traces Dilemma

When I first started with OpenTelemetry, I tried to answer every question with traces. "How many orders per minute?" I'd count spans. "What's our error rate?" I'd filter failed spans. "Average response time?" Span duration aggregation.

This was a terrible idea. Traces are expensive—you can't keep every single one at high volume. I was sampling 10% of traffic, which meant my "metrics" were statistically wrong.

Then I discovered proper metrics, and everything clicked. Traces are for debugging individual requests. Metrics are for understanding system behavior over time.

The Three Types of Metrics

1. Counter: Counting Events

Counters only go up. They track cumulative totals.

When to use:

  • Total requests processed

  • Total errors encountered

  • Total bytes sent

  • Total orders completed

import { metrics } from '@opentelemetry/api';
import { MeterProvider, PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';

// Set up metrics
const metricExporter = new OTLPMetricExporter({
  url: 'http://localhost:4318/v1/metrics',
});

const meterProvider = new MeterProvider({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'order-service',
  }),
  readers: [
    new PeriodicExportingMetricReader({
      exporter: metricExporter,
      exportIntervalMillis: 10000, // Export every 10 seconds
    }),
  ],
});

metrics.setGlobalMeterProvider(meterProvider);

const meter = metrics.getMeter('order-service', '1.0.0');

// Create counters
const orderCounter = meter.createCounter('orders.created', {
  description: 'Total number of orders created',
  unit: '1',
});

const errorCounter = meter.createCounter('orders.errors', {
  description: 'Total number of order processing errors',
  unit: '1',
});

const revenueCounter = meter.createCounter('revenue.total', {
  description: 'Total revenue in USD',
  unit: 'USD',
});

// Using counters
export async function createOrder(userId: string, amount: number, items: any[]): Promise<Order> {
  try {
    const order = await saveOrderToDatabase(userId, amount, items);
    
    // Increment counter with attributes
    orderCounter.add(1, {
      'order.status': 'completed',
      'user.tier': await getUserTier(userId),
      'order.channel': 'web'
    });
    
    // Track revenue
    revenueCounter.add(amount, {
      'currency': 'USD',
      'payment.method': 'credit_card'
    });
    
    return order;
  } catch (error) {
    // Track errors
    errorCounter.add(1, {
      'error.type': (error as Error).name,
      'operation': 'createOrder'
    });
    throw error;
  }
}

async function saveOrderToDatabase(userId: string, amount: number, items: any[]): Promise<Order> {
  // Simulated
  return {
    id: `ORD-${Date.now()}`,
    userId,
    amount,
    items,
    status: 'completed',
    createdAt: new Date()
  };
}

async function getUserTier(userId: string): Promise<string> {
  return 'premium'; // Simulated
}

interface Order {
  id: string;
  userId: string;
  amount: number;
  items: any[];
  status: string;
  createdAt: Date;
}

2. Gauge: Measuring Current State

Gauges represent a value that can go up or down.

When to use:

  • Current memory usage

  • Active connections

  • Queue size

  • Items in cart

  • Current temperature

3. Histogram: Distribution of Values

Histograms track the distribution of values over time.

When to use:

  • Request duration

  • Request payload size

  • Order value distribution

  • Database query duration

Real-World Metrics Dashboard

Here's the complete metrics setup I use in production:

Visualizing Metrics with Prometheus

Start Prometheus with Docker:

Visit http://localhost:9090 and query:

Production Learnings: The Metrics That Mattered

1. Error Budget Monitoring

I track error budgets using metrics, not traces:

2. Capacity Planning

Metrics revealed we were hitting PostgreSQL connection limits at 1000 req/s:

Alert when utilization > 80% → time to scale!

3. Business KPIs

Technical metrics don't tell the full story. Business metrics do:

Best Practices

1. Use Proper Metric Types

2. Keep Cardinality Low

3. Namespace Your Metrics

4. Export to Multiple Backends

Production systems need both Prometheus (alerting) and cloud backends (long-term storage):

What's Next

Continue to Distributed Tracingarrow-up-right where you'll learn:

  • Context propagation across services

  • Trace correlation in microservices

  • Debugging distributed systems

  • Trace sampling strategies


Previous: ← Manual Instrumentation | Next: Distributed Tracing →arrow-up-right

Metrics tell you what's happening. Traces tell you why.

Last updated