OpenTelemetry Fundamentals

My First Production Mystery

Three years ago, I was debugging a critical production issue where user checkout requests were timing out intermittently. Our logs showed the API receiving requests and returning responses, but somewhere in between, 15% of requests took over 30 seconds. Without distributed tracing, I spent hours manually correlating timestamps across service logs, database query logs, and Redis monitoring dashboards. What should have taken 15 minutes took nearly 4 hours to identify - a misconfigured connection pool in our payment service that only manifested under specific load conditions.

That incident taught me that traditional logging and metrics aren't enough for modern distributed systems. You need visibility into the entire request lifecycle, correlated context across services, and the ability to drill down from high-level metrics to individual request traces. That's when I committed to implementing comprehensive observability with OpenTelemetry.

What is Observability?

Observability is the ability to understand the internal state of a system by examining its external outputs. Unlike monitoring (which tells you when something is wrong), observability helps you understand why it's wrong.

The Key Difference:

Monitoring: "The API is slow" (symptom)
Observability: "The API is slow because database connection pool exhaustion causes 200ms waits for available connections during traffic spikes above 500 req/s" (root cause)

In software systems, observability is achieved through telemetry data - the signals emitted by your application:

Traces: The journey of a request through your system
Metrics: Numeric measurements of system behavior over time
Logs: Discrete events with context and details

The Three Pillars of Telemetry

1. Distributed Traces

Traces show the complete path of a request through your distributed system. Each trace contains multiple spans representing units of work.

Real Example from My Experience:

When a user places an order in my e-commerce system, the request flows through:

User Request → API Gateway → Order Service → Inventory Service
                                ↓
                          Payment Service → Payment Gateway (external)
                                ↓
                          Notification Service → Email Service

A distributed trace captures this entire flow:

// Trace: Order Creation (trace_id: abc123)
└─ Span: POST /api/orders (order-service)
   ├─ Span: SELECT inventory (inventory-service)
   │  └─ Span: postgres.query (database)
   ├─ Span: POST /payments (payment-service)
   │  ├─ Span: redis.get (cache)
   │  └─ Span: HTTP POST stripe.com/charges (external)
   └─ Span: POST /notifications (notification-service)
      └─ Span: smtp.send (email)

Key Concepts:

Trace: The entire journey (trace_id: abc123)
Span: Individual operations (each box above)
Parent-Child Relationships: Spans form a tree structure
Duration: How long each operation took
Attributes: Metadata (user_id, order_amount, payment_method)

What This Reveals:

In my production system, traces showed that 95% of slow checkouts had a common pattern: the payment service's Stripe API call took 8+ seconds. Without tracing, I would have optimized the wrong parts of the system.

2. Metrics

Metrics are numeric measurements aggregated over time windows. They answer questions like "How many requests per second?" or "What's the 95th percentile latency?"

Types of Metrics:

Counter: Monotonically increasing value

// Example: Total orders processed
orders_total{status="completed"} = 15,432
orders_total{status="failed"} = 127

Gauge: Value that goes up and down

// Example: Active database connections
db_connections_active = 23
memory_usage_bytes = 512_000_000

Histogram: Distribution of values

// Example: API response times
http_request_duration_seconds{
  endpoint="/api/orders",
  method="POST"
} 
// Buckets: 0.1s, 0.5s, 1s, 5s, 10s
// P50: 0.2s, P95: 1.2s, P99: 3.5s

Real Example:

In my order service, I track:

// Business metrics
const orderCounter = meter.createCounter('orders.created', {
  description: 'Total orders created',
  unit: 'orders'
});

// Performance metrics  
const requestDuration = meter.createHistogram('http.request.duration', {
  description: 'HTTP request duration',
  unit: 'ms'
});

// Resource metrics
const dbPoolGauge = meter.createGauge('db.pool.connections', {
  description: 'Database connection pool size',
  unit: 'connections'
});

What Metrics Reveal:

Metrics showed me that order creation rate spiked to 500/minute during flash sales, but our database connection pool was capped at 20 connections. This caused request queuing and degraded performance - a problem invisible to traditional logging.

3. Logs

Logs are timestamped text records of discrete events. With OpenTelemetry, logs can be enriched with trace context, linking them directly to spans.

Traditional Log:

2026-01-08 10:23:45 ERROR Payment failed for order

OpenTelemetry-Enhanced Log:

{
  timestamp: '2026-01-08T10:23:45.123Z',
  level: 'ERROR',
  message: 'Payment failed for order',
  trace_id: 'abc123',
  span_id: 'def456',
  attributes: {
    order_id: 'ORD-12345',
    user_id: 'USR-789',
    payment_method: 'stripe',
    error_code: 'insufficient_funds',
    amount: 99.99,
    currency: 'USD'
  }
}

The Power of Correlation:

When investigating an error, I can:

See the error log with trace_id
Open the full distributed trace
See exactly what happened before and after the error
Examine all logs from related spans
Check metrics for that time period

This correlation turns debugging from archaeology into precision investigation.

OpenTelemetry Architecture

OpenTelemetry provides a standard way to generate, collect, and export telemetry data. Here's how the components work together:

┌─────────────────────────────────────────────────────────┐
│            Your TypeScript Application                   │
│                                                          │
│  ┌─────────────────────────────────────────────┐        │
│  │         OpenTelemetry API                    │        │
│  │  (Define how to create telemetry)           │        │
│  │                                              │        │
│  │  • tracer.startSpan()                        │        │
│  │  • meter.createCounter()                     │        │
│  │  • logger.emit()                             │        │
│  └──────────────┬──────────────────────────────┘        │
│                 │                                        │
│  ┌──────────────▼──────────────────────────────┐        │
│  │         OpenTelemetry SDK                    │        │
│  │  (Implementation + Configuration)            │        │
│  │                                              │        │
│  │  • Sampling decisions                        │        │
│  │  • Resource detection                        │        │
│  │  • Context propagation                       │        │
│  │  • Batch processing                          │        │
│  └──────────────┬──────────────────────────────┘        │
│                 │                                        │
│  ┌──────────────▼──────────────────────────────┐        │
│  │      Auto-Instrumentation Libraries         │        │
│  │                                              │        │
│  │  • Express.js                                │        │
│  │  • PostgreSQL                                │        │
│  │  • Redis                                     │        │
│  │  • HTTP/HTTPS                                │        │
│  └──────────────┬──────────────────────────────┘        │
└─────────────────┼──────────────────────────────────────┘
                  │ Telemetry Data (OTLP Protocol)
                  │
        ┌─────────▼─────────┐
        │  OTel Collector   │ (Optional but Recommended)
        │                   │
        │  • Receive        │
        │  • Process        │
        │  • Filter         │
        │  • Batch          │
        │  • Export         │
        └─────────┬─────────┘
                  │
      ┌───────────┼───────────┐
      │           │           │
┌─────▼────┐ ┌───▼────┐ ┌───▼──────┐
│  Jaeger  │ │Prometheus│ │Cloud Logs│
│ (Traces) │ │(Metrics) │ │  (All)   │
└──────────┘ └─────────┘ └──────────┘

Component Breakdown

1. OpenTelemetry API

The API defines the interfaces for creating telemetry. It's language-specific but follows the same patterns:

import { trace, metrics } from '@opentelemetry/api';

// Get a tracer
const tracer = trace.getTracer('order-service', '1.0.0');

// Get a meter
const meter = metrics.getMeter('order-service', '1.0.0');

Why separate API and SDK? Your application code depends only on the API. The SDK implementation can be swapped without changing your code. This enables:

Testing with no-op implementations
Different SDK configurations per environment
Library code that works with any SDK

2. OpenTelemetry SDK

The SDK implements the API and handles:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  // Where to send traces
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces'
  }),
  
  // Service identification
  resource: {
    attributes: {
      'service.name': 'order-service',
      'service.version': '1.0.0',
      'deployment.environment': 'production'
    }
  },
  
  // Sampling (only send 10% of traces)
  sampler: new TraceIdRatioBasedSampler(0.1)
});

sdk.start();

3. Auto-Instrumentation

Libraries that automatically create spans for framework operations:

import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  instrumentations: [
    getNodeAutoInstrumentations({
      // Automatically trace Express routes
      '@opentelemetry/instrumentation-express': {
        enabled: true
      },
      // Automatically trace PostgreSQL queries
      '@opentelemetry/instrumentation-pg': {
        enabled: true
      },
      // Automatically trace Redis commands
      '@opentelemetry/instrumentation-redis': {
        enabled: true
      }
    })
  ]
});

With zero code changes in your application logic, every HTTP request, database query, and Redis command is automatically traced!

4. Exporters

Exporters send telemetry to observability backends:

import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';

// Send to OpenTelemetry Collector
const otlpExporter = new OTLPTraceExporter({
  url: 'http://collector:4318/v1/traces'
});

// Send to Jaeger directly
const jaegerExporter = new JaegerExporter({
  endpoint: 'http://jaeger:14268/api/traces'
});

// Expose Prometheus metrics endpoint
const prometheusExporter = new PrometheusExporter({
  port: 9464
});

Context Propagation: The Secret Sauce

Context propagation is how OpenTelemetry maintains trace relationships across service boundaries. Without it, each service creates isolated traces.

The Problem:

Order Service: Trace A (trace_id: abc123)
  └─ Span: Create Order

Payment Service: Trace B (trace_id: xyz789) ❌ Different trace!
  └─ Span: Process Payment

The Solution: W3C Trace Context

OpenTelemetry uses HTTP headers to propagate context:

// Order Service makes request to Payment Service
const response = await fetch('http://payment-service/api/charge', {
  headers: {
    // OpenTelemetry automatically adds these
    'traceparent': '00-abc123-def456-01',
    'tracestate': 'vendor=value'
  }
});

Header Format:

traceparent: version-trace_id-parent_span_id-flags
00-abc123-def456-01

Payment Service automatically extracts this context and creates a child span:

Order Service: Trace A (trace_id: abc123)
  └─ Span: Create Order (span_id: def456)
     └─ Span: Process Payment (span_id: ghi789) ✅ Same trace!

Real Impact:

In my microservices architecture, context propagation revealed that payment failures weren't due to our payment service - they originated from the inventory service returning stale data. The full trace showed:

1. Order Service checks inventory (200ms)
2. Inventory Service returns cached data (5ms) ← Stale!
3. Order Service creates order (50ms)
4. Payment Service charges card (300ms)
5. Inventory Service rejects (product sold out) (10ms)
6. Payment Service refunds (400ms)
7. Order Service returns error (10ms)

Total: 975ms wasted on a request that should have failed in step 1. Without distributed tracing, I would never have seen this pattern.

Semantic Conventions

OpenTelemetry defines standard naming conventions for common attributes, ensuring consistency across languages and tools.

HTTP Semantic Conventions:

// ✅ Correct: Following semantic conventions
span.setAttribute('http.method', 'POST');
span.setAttribute('http.url', 'https://api.example.com/orders');
span.setAttribute('http.status_code', 201);
span.setAttribute('http.user_agent', 'Mozilla/5.0...');

// ❌ Wrong: Custom naming breaks compatibility
span.setAttribute('request_type', 'POST');
span.setAttribute('api_url', 'https://api.example.com/orders');
span.setAttribute('response_code', 201);

Database Semantic Conventions:

span.setAttribute('db.system', 'postgresql');
span.setAttribute('db.name', 'orders_db');
span.setAttribute('db.statement', 'SELECT * FROM orders WHERE id = $1');
span.setAttribute('db.operation', 'SELECT');

Why This Matters:

Following semantic conventions means:

Backend tools automatically recognize standard attributes
Dashboards work without custom configuration
Multi-language systems use consistent naming
Community instrumentation libraries work together seamlessly

The OTLP Protocol

OpenTelemetry Protocol (OTLP) is the standard way to transmit telemetry data. It supports:

HTTP/JSON: Human-readable, easy debugging
gRPC/Protobuf: Efficient binary protocol for production

Example OTLP Trace Payload:

{
  "resourceSpans": [{
    "resource": {
      "attributes": [
        { "key": "service.name", "value": { "stringValue": "order-service" }},
        { "key": "service.version", "value": { "stringValue": "1.0.0" }}
      ]
    },
    "scopeSpans": [{
      "scope": {
        "name": "order-service",
        "version": "1.0.0"
      },
      "spans": [{
        "traceId": "abc123",
        "spanId": "def456",
        "name": "POST /api/orders",
        "kind": "SERVER",
        "startTimeUnixNano": "1704705825000000000",
        "endTimeUnixNano": "1704705825500000000",
        "attributes": [
          { "key": "http.method", "value": { "stringValue": "POST" }},
          { "key": "http.status_code", "value": { "intValue": 201 }}
        ]
      }]
    }]
  }]
}

Signal Correlation: Tying It All Together

The real power of OpenTelemetry comes from correlating all three signals:

Example: Investigating High Error Rate

Metric Alert: http_requests_failed spike detected
Query Traces: Filter traces where http.status_code >= 500
Find Pattern: All failures have db.statement containing specific query
Check Logs: Find detailed error messages with same trace_id
Root Cause: Database index missing on recently added column

In Code:

// Metric shows problem exists
errorCounter.add(1, {
  'http.status_code': 500,
  'error.type': 'DatabaseError'
});

// Trace shows where it happens
const span = tracer.startSpan('processOrder');
span.setStatus({ code: SpanStatusCode.ERROR });
span.recordException(error);

// Log provides details
logger.error('Order processing failed', {
  trace_id: span.spanContext().traceId,
  span_id: span.spanContext().spanId,
  order_id: orderId,
  error_message: error.message,
  stack_trace: error.stack
});

What You've Learned

You now understand:

✅ The three pillars of telemetry: Traces, metrics, and logs ✅ OpenTelemetry architecture: API, SDK, instrumentation, exporters ✅ Distributed tracing mechanics: Spans, traces, and context propagation ✅ Semantic conventions: Standard naming for interoperability ✅ OTLP protocol: How telemetry data is transmitted ✅ Signal correlation: Connecting traces, metrics, and logs for investigations

Real-World Impact

Since implementing OpenTelemetry in my production systems:

MTTR reduced by 70%: From 4 hours to 45 minutes average
Performance optimization: Identified and fixed 3 major bottlenecks
Proactive issue detection: Caught problems before users reported them
Cost savings: Rightsized infrastructure based on actual usage patterns
Team productivity: Engineers debug independently without escalation

Next Steps

Now that you understand the fundamentals, you're ready to instrument your first TypeScript application. Continue to Getting Started with TypeScript where you'll:

Set up OpenTelemetry in a Node.js/TypeScript project
Instrument an Express.js API
See automatic traces in action
Export telemetry to Jaeger and Prometheus
Create your first custom spans

Previous: ← OpenTelemetry 101 | Next: Getting Started with TypeScript →

Observability isn't overhead - it's insurance. And the time to buy insurance is before you need it.

PreviousOpenTelemetry 101 NextGetting Started with TypeScript

Last updated 1 month ago

hashtagMy First Production Mystery

hashtagWhat is Observability?

hashtagThe Three Pillars of Telemetry

hashtag1. Distributed Traces

hashtag2. Metrics

hashtag3. Logs

hashtagOpenTelemetry Architecture

hashtagComponent Breakdown

hashtagContext Propagation: The Secret Sauce

hashtagSemantic Conventions

hashtagThe OTLP Protocol

hashtagSignal Correlation: Tying It All Together

hashtagWhat You've Learned

hashtagReal-World Impact

hashtagNext Steps