OpenTelemetry Fundamentals

My First Production Mystery

Three years ago, I was debugging a critical production issue where user checkout requests were timing out intermittently. Our logs showed the API receiving requests and returning responses, but somewhere in between, 15% of requests took over 30 seconds. Without distributed tracing, I spent hours manually correlating timestamps across service logs, database query logs, and Redis monitoring dashboards. What should have taken 15 minutes took nearly 4 hours to identify - a misconfigured connection pool in our payment service that only manifested under specific load conditions.

That incident taught me that traditional logging and metrics aren't enough for modern distributed systems. You need visibility into the entire request lifecycle, correlated context across services, and the ability to drill down from high-level metrics to individual request traces. That's when I committed to implementing comprehensive observability with OpenTelemetry.

What is Observability?

Observability is the ability to understand the internal state of a system by examining its external outputs. Unlike monitoring (which tells you when something is wrong), observability helps you understand why it's wrong.

The Key Difference:

  • Monitoring: "The API is slow" (symptom)

  • Observability: "The API is slow because database connection pool exhaustion causes 200ms waits for available connections during traffic spikes above 500 req/s" (root cause)

In software systems, observability is achieved through telemetry data - the signals emitted by your application:

  1. Traces: The journey of a request through your system

  2. Metrics: Numeric measurements of system behavior over time

  3. Logs: Discrete events with context and details

The Three Pillars of Telemetry

1. Distributed Traces

Traces show the complete path of a request through your distributed system. Each trace contains multiple spans representing units of work.

Real Example from My Experience:

When a user places an order in my e-commerce system, the request flows through:

A distributed trace captures this entire flow:

Key Concepts:

  • Trace: The entire journey (trace_id: abc123)

  • Span: Individual operations (each box above)

  • Parent-Child Relationships: Spans form a tree structure

  • Duration: How long each operation took

  • Attributes: Metadata (user_id, order_amount, payment_method)

What This Reveals:

In my production system, traces showed that 95% of slow checkouts had a common pattern: the payment service's Stripe API call took 8+ seconds. Without tracing, I would have optimized the wrong parts of the system.

2. Metrics

Metrics are numeric measurements aggregated over time windows. They answer questions like "How many requests per second?" or "What's the 95th percentile latency?"

Types of Metrics:

Counter: Monotonically increasing value

Gauge: Value that goes up and down

Histogram: Distribution of values

Real Example:

In my order service, I track:

What Metrics Reveal:

Metrics showed me that order creation rate spiked to 500/minute during flash sales, but our database connection pool was capped at 20 connections. This caused request queuing and degraded performance - a problem invisible to traditional logging.

3. Logs

Logs are timestamped text records of discrete events. With OpenTelemetry, logs can be enriched with trace context, linking them directly to spans.

Traditional Log:

OpenTelemetry-Enhanced Log:

The Power of Correlation:

When investigating an error, I can:

  1. See the error log with trace_id

  2. Open the full distributed trace

  3. See exactly what happened before and after the error

  4. Examine all logs from related spans

  5. Check metrics for that time period

This correlation turns debugging from archaeology into precision investigation.

OpenTelemetry Architecture

OpenTelemetry provides a standard way to generate, collect, and export telemetry data. Here's how the components work together:

Component Breakdown

1. OpenTelemetry API

The API defines the interfaces for creating telemetry. It's language-specific but follows the same patterns:

Why separate API and SDK? Your application code depends only on the API. The SDK implementation can be swapped without changing your code. This enables:

  • Testing with no-op implementations

  • Different SDK configurations per environment

  • Library code that works with any SDK

2. OpenTelemetry SDK

The SDK implements the API and handles:

3. Auto-Instrumentation

Libraries that automatically create spans for framework operations:

With zero code changes in your application logic, every HTTP request, database query, and Redis command is automatically traced!

4. Exporters

Exporters send telemetry to observability backends:

Context Propagation: The Secret Sauce

Context propagation is how OpenTelemetry maintains trace relationships across service boundaries. Without it, each service creates isolated traces.

The Problem:

The Solution: W3C Trace Context

OpenTelemetry uses HTTP headers to propagate context:

Header Format:

Payment Service automatically extracts this context and creates a child span:

Real Impact:

In my microservices architecture, context propagation revealed that payment failures weren't due to our payment service - they originated from the inventory service returning stale data. The full trace showed:

Total: 975ms wasted on a request that should have failed in step 1. Without distributed tracing, I would never have seen this pattern.

Semantic Conventions

OpenTelemetry defines standard naming conventions for common attributes, ensuring consistency across languages and tools.

HTTP Semantic Conventions:

Database Semantic Conventions:

Why This Matters:

Following semantic conventions means:

  • Backend tools automatically recognize standard attributes

  • Dashboards work without custom configuration

  • Multi-language systems use consistent naming

  • Community instrumentation libraries work together seamlessly

The OTLP Protocol

OpenTelemetry Protocol (OTLP) is the standard way to transmit telemetry data. It supports:

  • HTTP/JSON: Human-readable, easy debugging

  • gRPC/Protobuf: Efficient binary protocol for production

Example OTLP Trace Payload:

Signal Correlation: Tying It All Together

The real power of OpenTelemetry comes from correlating all three signals:

Example: Investigating High Error Rate

  1. Metric Alert: http_requests_failed spike detected

  2. Query Traces: Filter traces where http.status_code >= 500

  3. Find Pattern: All failures have db.statement containing specific query

  4. Check Logs: Find detailed error messages with same trace_id

  5. Root Cause: Database index missing on recently added column

In Code:

What You've Learned

You now understand:

βœ… The three pillars of telemetry: Traces, metrics, and logs βœ… OpenTelemetry architecture: API, SDK, instrumentation, exporters βœ… Distributed tracing mechanics: Spans, traces, and context propagation βœ… Semantic conventions: Standard naming for interoperability βœ… OTLP protocol: How telemetry data is transmitted βœ… Signal correlation: Connecting traces, metrics, and logs for investigations

Real-World Impact

Since implementing OpenTelemetry in my production systems:

  • MTTR reduced by 70%: From 4 hours to 45 minutes average

  • Performance optimization: Identified and fixed 3 major bottlenecks

  • Proactive issue detection: Caught problems before users reported them

  • Cost savings: Rightsized infrastructure based on actual usage patterns

  • Team productivity: Engineers debug independently without escalation

Next Steps

Now that you understand the fundamentals, you're ready to instrument your first TypeScript application. Continue to Getting Started with TypeScript where you'll:

  • Set up OpenTelemetry in a Node.js/TypeScript project

  • Instrument an Express.js API

  • See automatic traces in action

  • Export telemetry to Jaeger and Prometheus

  • Create your first custom spans


Previous: ← OpenTelemetry 101 | Next: Getting Started with TypeScript β†’

Observability isn't overhead - it's insurance. And the time to buy insurance is before you need it.

Last updated