# OpenTelemetry Fundamentals

## My First Production Mystery

Three years ago, I was debugging a critical production issue where user checkout requests were timing out intermittently. Our logs showed the API receiving requests and returning responses, but somewhere in between, 15% of requests took over 30 seconds. Without distributed tracing, I spent hours manually correlating timestamps across service logs, database query logs, and Redis monitoring dashboards. What should have taken 15 minutes took nearly 4 hours to identify - a misconfigured connection pool in our payment service that only manifested under specific load conditions.

That incident taught me that traditional logging and metrics aren't enough for modern distributed systems. You need visibility into the entire request lifecycle, correlated context across services, and the ability to drill down from high-level metrics to individual request traces. That's when I committed to implementing comprehensive observability with OpenTelemetry.

## What is Observability?

Observability is the ability to understand the internal state of a system by examining its external outputs. Unlike monitoring (which tells you *when* something is wrong), observability helps you understand *why* it's wrong.

**The Key Difference:**

* **Monitoring**: "The API is slow" (symptom)
* **Observability**: "The API is slow because database connection pool exhaustion causes 200ms waits for available connections during traffic spikes above 500 req/s" (root cause)

In software systems, observability is achieved through **telemetry data** - the signals emitted by your application:

1. **Traces**: The journey of a request through your system
2. **Metrics**: Numeric measurements of system behavior over time
3. **Logs**: Discrete events with context and details

## The Three Pillars of Telemetry

### 1. Distributed Traces

Traces show the complete path of a request through your distributed system. Each trace contains multiple **spans** representing units of work.

**Real Example from My Experience:**

When a user places an order in my e-commerce system, the request flows through:

```
User Request → API Gateway → Order Service → Inventory Service
                                ↓
                          Payment Service → Payment Gateway (external)
                                ↓
                          Notification Service → Email Service
```

A distributed trace captures this entire flow:

```typescript
// Trace: Order Creation (trace_id: abc123)
└─ Span: POST /api/orders (order-service)
   ├─ Span: SELECT inventory (inventory-service)
   │  └─ Span: postgres.query (database)
   ├─ Span: POST /payments (payment-service)
   │  ├─ Span: redis.get (cache)
   │  └─ Span: HTTP POST stripe.com/charges (external)
   └─ Span: POST /notifications (notification-service)
      └─ Span: smtp.send (email)
```

**Key Concepts:**

* **Trace**: The entire journey (trace\_id: abc123)
* **Span**: Individual operations (each box above)
* **Parent-Child Relationships**: Spans form a tree structure
* **Duration**: How long each operation took
* **Attributes**: Metadata (user\_id, order\_amount, payment\_method)

**What This Reveals:**

In my production system, traces showed that 95% of slow checkouts had a common pattern: the payment service's Stripe API call took 8+ seconds. Without tracing, I would have optimized the wrong parts of the system.

### 2. Metrics

Metrics are numeric measurements aggregated over time windows. They answer questions like "How many requests per second?" or "What's the 95th percentile latency?"

**Types of Metrics:**

**Counter**: Monotonically increasing value

```typescript
// Example: Total orders processed
orders_total{status="completed"} = 15,432
orders_total{status="failed"} = 127
```

**Gauge**: Value that goes up and down

```typescript
// Example: Active database connections
db_connections_active = 23
memory_usage_bytes = 512_000_000
```

**Histogram**: Distribution of values

```typescript
// Example: API response times
http_request_duration_seconds{
  endpoint="/api/orders",
  method="POST"
} 
// Buckets: 0.1s, 0.5s, 1s, 5s, 10s
// P50: 0.2s, P95: 1.2s, P99: 3.5s
```

**Real Example:**

In my order service, I track:

```typescript
// Business metrics
const orderCounter = meter.createCounter('orders.created', {
  description: 'Total orders created',
  unit: 'orders'
});

// Performance metrics  
const requestDuration = meter.createHistogram('http.request.duration', {
  description: 'HTTP request duration',
  unit: 'ms'
});

// Resource metrics
const dbPoolGauge = meter.createGauge('db.pool.connections', {
  description: 'Database connection pool size',
  unit: 'connections'
});
```

**What Metrics Reveal:**

Metrics showed me that order creation rate spiked to 500/minute during flash sales, but our database connection pool was capped at 20 connections. This caused request queuing and degraded performance - a problem invisible to traditional logging.

### 3. Logs

Logs are timestamped text records of discrete events. With OpenTelemetry, logs can be enriched with trace context, linking them directly to spans.

**Traditional Log:**

```
2026-01-08 10:23:45 ERROR Payment failed for order
```

**OpenTelemetry-Enhanced Log:**

```typescript
{
  timestamp: '2026-01-08T10:23:45.123Z',
  level: 'ERROR',
  message: 'Payment failed for order',
  trace_id: 'abc123',
  span_id: 'def456',
  attributes: {
    order_id: 'ORD-12345',
    user_id: 'USR-789',
    payment_method: 'stripe',
    error_code: 'insufficient_funds',
    amount: 99.99,
    currency: 'USD'
  }
}
```

**The Power of Correlation:**

When investigating an error, I can:

1. See the error log with trace\_id
2. Open the full distributed trace
3. See exactly what happened before and after the error
4. Examine all logs from related spans
5. Check metrics for that time period

This correlation turns debugging from archaeology into precision investigation.

## OpenTelemetry Architecture

OpenTelemetry provides a standard way to generate, collect, and export telemetry data. Here's how the components work together:

```
┌─────────────────────────────────────────────────────────┐
│            Your TypeScript Application                   │
│                                                          │
│  ┌─────────────────────────────────────────────┐        │
│  │         OpenTelemetry API                    │        │
│  │  (Define how to create telemetry)           │        │
│  │                                              │        │
│  │  • tracer.startSpan()                        │        │
│  │  • meter.createCounter()                     │        │
│  │  • logger.emit()                             │        │
│  └──────────────┬──────────────────────────────┘        │
│                 │                                        │
│  ┌──────────────▼──────────────────────────────┐        │
│  │         OpenTelemetry SDK                    │        │
│  │  (Implementation + Configuration)            │        │
│  │                                              │        │
│  │  • Sampling decisions                        │        │
│  │  • Resource detection                        │        │
│  │  • Context propagation                       │        │
│  │  • Batch processing                          │        │
│  └──────────────┬──────────────────────────────┘        │
│                 │                                        │
│  ┌──────────────▼──────────────────────────────┐        │
│  │      Auto-Instrumentation Libraries         │        │
│  │                                              │        │
│  │  • Express.js                                │        │
│  │  • PostgreSQL                                │        │
│  │  • Redis                                     │        │
│  │  • HTTP/HTTPS                                │        │
│  └──────────────┬──────────────────────────────┘        │
└─────────────────┼──────────────────────────────────────┘
                  │ Telemetry Data (OTLP Protocol)
                  │
        ┌─────────▼─────────┐
        │  OTel Collector   │ (Optional but Recommended)
        │                   │
        │  • Receive        │
        │  • Process        │
        │  • Filter         │
        │  • Batch          │
        │  • Export         │
        └─────────┬─────────┘
                  │
      ┌───────────┼───────────┐
      │           │           │
┌─────▼────┐ ┌───▼────┐ ┌───▼──────┐
│  Jaeger  │ │Prometheus│ │Cloud Logs│
│ (Traces) │ │(Metrics) │ │  (All)   │
└──────────┘ └─────────┘ └──────────┘
```

### Component Breakdown

**1. OpenTelemetry API**

The API defines the interfaces for creating telemetry. It's language-specific but follows the same patterns:

```typescript
import { trace, metrics } from '@opentelemetry/api';

// Get a tracer
const tracer = trace.getTracer('order-service', '1.0.0');

// Get a meter
const meter = metrics.getMeter('order-service', '1.0.0');
```

**Why separate API and SDK?**\
Your application code depends only on the API. The SDK implementation can be swapped without changing your code. This enables:

* Testing with no-op implementations
* Different SDK configurations per environment
* Library code that works with any SDK

**2. OpenTelemetry SDK**

The SDK implements the API and handles:

```typescript
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  // Where to send traces
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces'
  }),
  
  // Service identification
  resource: {
    attributes: {
      'service.name': 'order-service',
      'service.version': '1.0.0',
      'deployment.environment': 'production'
    }
  },
  
  // Sampling (only send 10% of traces)
  sampler: new TraceIdRatioBasedSampler(0.1)
});

sdk.start();
```

**3. Auto-Instrumentation**

Libraries that automatically create spans for framework operations:

```typescript
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  instrumentations: [
    getNodeAutoInstrumentations({
      // Automatically trace Express routes
      '@opentelemetry/instrumentation-express': {
        enabled: true
      },
      // Automatically trace PostgreSQL queries
      '@opentelemetry/instrumentation-pg': {
        enabled: true
      },
      // Automatically trace Redis commands
      '@opentelemetry/instrumentation-redis': {
        enabled: true
      }
    })
  ]
});
```

With zero code changes in your application logic, every HTTP request, database query, and Redis command is automatically traced!

**4. Exporters**

Exporters send telemetry to observability backends:

```typescript
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';

// Send to OpenTelemetry Collector
const otlpExporter = new OTLPTraceExporter({
  url: 'http://collector:4318/v1/traces'
});

// Send to Jaeger directly
const jaegerExporter = new JaegerExporter({
  endpoint: 'http://jaeger:14268/api/traces'
});

// Expose Prometheus metrics endpoint
const prometheusExporter = new PrometheusExporter({
  port: 9464
});
```

## Context Propagation: The Secret Sauce

Context propagation is how OpenTelemetry maintains trace relationships across service boundaries. Without it, each service creates isolated traces.

**The Problem:**

```
Order Service: Trace A (trace_id: abc123)
  └─ Span: Create Order

Payment Service: Trace B (trace_id: xyz789) ❌ Different trace!
  └─ Span: Process Payment
```

**The Solution: W3C Trace Context**

OpenTelemetry uses HTTP headers to propagate context:

```typescript
// Order Service makes request to Payment Service
const response = await fetch('http://payment-service/api/charge', {
  headers: {
    // OpenTelemetry automatically adds these
    'traceparent': '00-abc123-def456-01',
    'tracestate': 'vendor=value'
  }
});
```

**Header Format:**

```
traceparent: version-trace_id-parent_span_id-flags
00-abc123-def456-01
```

Payment Service automatically extracts this context and creates a child span:

```
Order Service: Trace A (trace_id: abc123)
  └─ Span: Create Order (span_id: def456)
     └─ Span: Process Payment (span_id: ghi789) ✅ Same trace!
```

**Real Impact:**

In my microservices architecture, context propagation revealed that payment failures weren't due to our payment service - they originated from the inventory service returning stale data. The full trace showed:

```
1. Order Service checks inventory (200ms)
2. Inventory Service returns cached data (5ms) ← Stale!
3. Order Service creates order (50ms)
4. Payment Service charges card (300ms)
5. Inventory Service rejects (product sold out) (10ms)
6. Payment Service refunds (400ms)
7. Order Service returns error (10ms)
```

Total: 975ms wasted on a request that should have failed in step 1. Without distributed tracing, I would never have seen this pattern.

## Semantic Conventions

OpenTelemetry defines standard naming conventions for common attributes, ensuring consistency across languages and tools.

**HTTP Semantic Conventions:**

```typescript
// ✅ Correct: Following semantic conventions
span.setAttribute('http.method', 'POST');
span.setAttribute('http.url', 'https://api.example.com/orders');
span.setAttribute('http.status_code', 201);
span.setAttribute('http.user_agent', 'Mozilla/5.0...');

// ❌ Wrong: Custom naming breaks compatibility
span.setAttribute('request_type', 'POST');
span.setAttribute('api_url', 'https://api.example.com/orders');
span.setAttribute('response_code', 201);
```

**Database Semantic Conventions:**

```typescript
span.setAttribute('db.system', 'postgresql');
span.setAttribute('db.name', 'orders_db');
span.setAttribute('db.statement', 'SELECT * FROM orders WHERE id = $1');
span.setAttribute('db.operation', 'SELECT');
```

**Why This Matters:**

Following semantic conventions means:

* Backend tools automatically recognize standard attributes
* Dashboards work without custom configuration
* Multi-language systems use consistent naming
* Community instrumentation libraries work together seamlessly

## The OTLP Protocol

OpenTelemetry Protocol (OTLP) is the standard way to transmit telemetry data. It supports:

* **HTTP/JSON**: Human-readable, easy debugging
* **gRPC/Protobuf**: Efficient binary protocol for production

**Example OTLP Trace Payload:**

```json
{
  "resourceSpans": [{
    "resource": {
      "attributes": [
        { "key": "service.name", "value": { "stringValue": "order-service" }},
        { "key": "service.version", "value": { "stringValue": "1.0.0" }}
      ]
    },
    "scopeSpans": [{
      "scope": {
        "name": "order-service",
        "version": "1.0.0"
      },
      "spans": [{
        "traceId": "abc123",
        "spanId": "def456",
        "name": "POST /api/orders",
        "kind": "SERVER",
        "startTimeUnixNano": "1704705825000000000",
        "endTimeUnixNano": "1704705825500000000",
        "attributes": [
          { "key": "http.method", "value": { "stringValue": "POST" }},
          { "key": "http.status_code", "value": { "intValue": 201 }}
        ]
      }]
    }]
  }]
}
```

## Signal Correlation: Tying It All Together

The real power of OpenTelemetry comes from correlating all three signals:

**Example: Investigating High Error Rate**

1. **Metric Alert**: `http_requests_failed` spike detected
2. **Query Traces**: Filter traces where `http.status_code >= 500`
3. **Find Pattern**: All failures have `db.statement` containing specific query
4. **Check Logs**: Find detailed error messages with same `trace_id`
5. **Root Cause**: Database index missing on recently added column

**In Code:**

```typescript
// Metric shows problem exists
errorCounter.add(1, {
  'http.status_code': 500,
  'error.type': 'DatabaseError'
});

// Trace shows where it happens
const span = tracer.startSpan('processOrder');
span.setStatus({ code: SpanStatusCode.ERROR });
span.recordException(error);

// Log provides details
logger.error('Order processing failed', {
  trace_id: span.spanContext().traceId,
  span_id: span.spanContext().spanId,
  order_id: orderId,
  error_message: error.message,
  stack_trace: error.stack
});
```

## What You've Learned

You now understand:

✅ **The three pillars of telemetry**: Traces, metrics, and logs\
✅ **OpenTelemetry architecture**: API, SDK, instrumentation, exporters\
✅ **Distributed tracing mechanics**: Spans, traces, and context propagation\
✅ **Semantic conventions**: Standard naming for interoperability\
✅ **OTLP protocol**: How telemetry data is transmitted\
✅ **Signal correlation**: Connecting traces, metrics, and logs for investigations

## Real-World Impact

Since implementing OpenTelemetry in my production systems:

* **MTTR reduced by 70%**: From 4 hours to 45 minutes average
* **Performance optimization**: Identified and fixed 3 major bottlenecks
* **Proactive issue detection**: Caught problems before users reported them
* **Cost savings**: Rightsized infrastructure based on actual usage patterns
* **Team productivity**: Engineers debug independently without escalation

## Next Steps

Now that you understand the fundamentals, you're ready to instrument your first TypeScript application. Continue to [Getting Started with TypeScript](https://blog.htunnthuthu.com/devops-and-sre/opentelemetry-101/opentelemetry-101-typescript-setup) where you'll:

* Set up OpenTelemetry in a Node.js/TypeScript project
* Instrument an Express.js API
* See automatic traces in action
* Export telemetry to Jaeger and Prometheus
* Create your first custom spans

***

**Previous**: [← OpenTelemetry 101](https://blog.htunnthuthu.com/devops-and-sre/opentelemetry-101) | **Next**: [Getting Started with TypeScript →](https://blog.htunnthuthu.com/devops-and-sre/opentelemetry-101/opentelemetry-101-typescript-setup)

*Observability isn't overhead - it's insurance. And the time to buy insurance is before you need it.*
