Implementing Full-Stack Observability in a Multi-Tenant POS Microservice: OpenTelemetry, Grafana, and Distributed Tracing

A Developer's Journey from Blind Operations to Complete Visibility

Hey there! 👋

I want to share what I learned about implementing comprehensive observability in a multi-tenant POS microservice system. You know those moments when issues occur in a distributed system, and you're left wondering, "What exactly happened? Which service failed? How long did that request take?"

Running microservices without observability is like driving a car with no dashboard - you're moving, but you have no idea how fast, how much fuel you have, or if the engine is overheating.

Imagine having 6 microservices running smoothly... until they aren't. And when things break, there's no clear way to identify the root cause.

Let me show you how to transform from operating blind to having complete visibility into every layer of a distributed system.

⚡ Quick Overview (TL;DR)

For developers implementing similar observability stacks:

Observability Stack:

Grafana Dashboard (Port 3002)
├── Tempo (Distributed Tracing)
├── Prometheus (Metrics Collection)
├── Loki (Log Aggregation)
└── OpenTelemetry Collector (Data Pipeline)
    ├── Traces → Tempo
    ├── Metrics → Prometheus
    └── Logs → Loki

Key Features:

🔍 Distributed Tracing - Track requests across all microservices
📊 Custom Metrics - CPU, memory, database connections, business KPIs
📝 Centralized Logging - All service logs in one place
🎯 Real-time Dashboards - Pre-built visualizations for each service
🔔 Alert Rules - Proactive monitoring with notifications
🏢 Multi-tenant Support - Isolated metrics per tenant

Tech Stack:

OpenTelemetry SDK (Instrumentation)
Grafana v10.2.3 OSS (Visualization)
Tempo (Trace Backend)
Prometheus (Metrics Backend)
Loki (Log Backend)
Promtail (Log Shipper)
Docker Compose (Orchestration)

That's it! Let's dive into how this all works.

🤔 The Problem: Operating Blind in Distributed Systems

Think about a typical microservices architecture. You might have:

Auth Service - User authentication & JWT management
POS Core Service - Orders, transactions, sales processing
Inventory Service - Stock management & product catalog
Payment Service - Payment processing & reconciliation
Restaurant Service - Store operations, menu management
Chatbot Service - AI-powered business analytics

Each service logged to its own stdout. Each had its own metrics (if any). When issues occurred:

Without Observability:

Scenario: "An order is stuck in pending state"
Debugging Process: 
          - SSH into 6 different servers
          - Grep through logs on each
          - Manually correlate timestamps
          - Guess which service failed
          - Hope the logs weren't rotated yet
Time to Resolution: 2-3 hours ⏰

With Observability:

Scenario: "An order is stuck in pending state"
Debugging Process:
          - Search for order ID in traces
          - See complete request flow
          - Identify failed service immediately
          - Check error logs in context
          - View database query performance
Time to Resolution: 5-10 minutes ⚡

That's the power of observability. Let's build it.

🏗️ Architecture Deep Dive

The Three Pillars of Observability

A comprehensive observability stack implements the three pillars:

Metrics - What is happening? (CPU, memory, requests/sec)
Logs - Why is it happening? (Error messages, debug info)
Traces - Where is it happening? (Request flow, latency breakdown)

Architecture Diagram

Component Responsibilities

OpenTelemetry Collector:

Receives telemetry data from all services
Processes and transforms data
Routes traces to Tempo, metrics to Prometheus
Provides buffering and retry logic

Prometheus:

Scrapes metrics from /metrics endpoints
Time-series database for metrics storage
Supports PromQL query language
Handles alerting rules

Tempo:

Distributed tracing backend
Stores and queries trace data
Supports TraceQL query language
Efficient columnar storage

Loki:

Log aggregation system
Indexes only metadata (not content)
LogQL query language
Cost-effective log storage

Grafana:

Unified visualization platform
Connects to all data sources
Pre-built and custom dashboards
Alerting and notification system

📦 Implementation: OpenTelemetry Integration

Step 1: Install Dependencies

First, add OpenTelemetry packages to your service:

// package.json
{
  "dependencies": {
    "@opentelemetry/api": "^1.7.0",
    "@opentelemetry/sdk-node": "^0.45.1",
    "@opentelemetry/auto-instrumentations-node": "^0.40.3",
    "@opentelemetry/exporter-trace-otlp-http": "^0.45.1",
    "@opentelemetry/exporter-metrics-otlp-http": "^0.45.1",
    "@opentelemetry/sdk-metrics": "^1.18.1",
    "@opentelemetry/resources": "^1.18.1",
    "@opentelemetry/semantic-conventions": "^1.18.1",
    "prom-client": "^15.1.0"
  }
}

Step 2: Create Telemetry Configuration

Create src/telemetry.ts to initialize OpenTelemetry:

// src/telemetry.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

// OpenTelemetry Collector endpoint
const OTEL_COLLECTOR_URL = process.env.OTEL_COLLECTOR_URL || 'http://otel-collector:4318';

// Trace exporter configuration
const traceExporter = new OTLPTraceExporter({
  url: `${OTEL_COLLECTOR_URL}/v1/traces`,
  headers: {},
});

// Metric exporter configuration
const metricExporter = new OTLPMetricExporter({
  url: `${OTEL_COLLECTOR_URL}/v1/metrics`,
  headers: {},
});

// Metric reader (export every 15 seconds)
const metricReader = new PeriodicExportingMetricReader({
  exporter: metricExporter,
  exportIntervalMillis: 15000,
});

// Service resource attributes
const resource = new Resource({
  [SemanticResourceAttributes.SERVICE_NAME]: 'pos-core-service',
  [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
  [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
  'service.instance.id': process.env.HOSTNAME || 'local',
  'component': 'backend',
});

// Initialize OpenTelemetry SDK
const sdk = new NodeSDK({
  resource,
  traceExporter,
  metricReader,
  instrumentations: [
    getNodeAutoInstrumentations({
      // Automatic instrumentation for common libraries
      '@opentelemetry/instrumentation-http': {
        enabled: true,
        ignoreIncomingPaths: ['/health', '/metrics'], // Don't trace health checks
      },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true }, // PostgreSQL
      '@opentelemetry/instrumentation-redis-4': { enabled: true },
    }),
  ],
});

// Start the SDK
sdk.start();

// Graceful shutdown
process.on('SIGTERM', () => {
  sdk
    .shutdown()
    .then(() => console.log('Telemetry shutdown successfully'))
    .catch((error) => console.error('Error shutting down telemetry', error))
    .finally(() => process.exit(0));
});

export default sdk;

Key Learning: Import telemetry configuration FIRST in your main file, before any other imports. This ensures all HTTP, database, and framework calls are automatically instrumented.

Step 3: Initialize in Main Application

// src/index.ts
// CRITICAL: Import telemetry FIRST before anything else
import './telemetry';

import express from 'express';
import dotenv from 'dotenv';
import routes from './routes';
import prisma from './utils/database';

dotenv.config();

const app = express();
const PORT = process.env.PORT || 4002;

// Your application code...

Step 4: Add Custom Metrics

Create src/utils/metrics.ts for custom business metrics:

// src/utils/metrics.ts
import { metrics } from '@opentelemetry/api';
import { ValueType } from '@opentelemetry/api';

// Get meter for creating metrics
const meter = metrics.getMeter('pos-core-service', '1.0.0');

// ============================================
// COUNTERS - Track cumulative values
// ============================================

// Order metrics
export const ordersCreatedCounter = meter.createCounter('orders.created', {
  description: 'Total number of orders created',
  unit: 'orders',
  valueType: ValueType.INT,
});

export const ordersCompletedCounter = meter.createCounter('orders.completed', {
  description: 'Total number of orders completed',
  unit: 'orders',
  valueType: ValueType.INT,
});

// Transaction metrics
export const transactionCounter = meter.createCounter('transactions.total', {
  description: 'Total number of transactions processed',
  unit: 'transactions',
  valueType: ValueType.INT,
});

// Database metrics
export const dbQueryCounter = meter.createCounter('db.queries.total', {
  description: 'Total number of database queries',
  unit: 'queries',
  valueType: ValueType.INT,
});

// ============================================
// HISTOGRAMS - Track distributions
// ============================================

// HTTP request duration
export const httpRequestDuration = meter.createHistogram('http.server.request.duration', {
  description: 'HTTP request duration in milliseconds',
  unit: 'ms',
  valueType: ValueType.DOUBLE,
});

// Database query duration
export const dbQueryDuration = meter.createHistogram('db.query.duration', {
  description: 'Database query duration',
  unit: 'ms',
  valueType: ValueType.DOUBLE,
});

// Order processing duration
export const orderProcessingDuration = meter.createHistogram('order.processing.duration', {
  description: 'Time taken to process an order',
  unit: 'ms',
  valueType: ValueType.DOUBLE,
});

// ============================================
// OBSERVABLE GAUGES - Track current values
// ============================================

// Active orders gauge
let activeOrdersValue = 0;
export const setActiveOrders = (count: number) => {
  activeOrdersValue = count;
};

export const activeOrdersGauge = meter.createObservableGauge('orders.active', {
  description: 'Current number of active orders',
  unit: 'orders',
  valueType: ValueType.INT,
});

activeOrdersGauge.addCallback((observableResult) => {
  observableResult.observe(activeOrdersValue);
});

// Database connections gauge
let dbConnectionsValue = 0;
export const setDbConnections = (count: number) => {
  dbConnectionsValue = count;
};

export const dbConnectionsGauge = meter.createObservableGauge('db.connections.active', {
  description: 'Current number of active database connections',
  unit: 'connections',
  valueType: ValueType.INT,
});

dbConnectionsGauge.addCallback((observableResult) => {
  observableResult.observe(dbConnectionsValue);
});

// Memory usage gauge
export const memoryUsageGauge = meter.createObservableGauge('process.memory.usage', {
  description: 'Process memory usage in bytes',
  unit: 'bytes',
  valueType: ValueType.INT,
});

memoryUsageGauge.addCallback((observableResult) => {
  const memUsage = process.memoryUsage();
  observableResult.observe(memUsage.heapUsed, { type: 'heap_used' });
  observableResult.observe(memUsage.heapTotal, { type: 'heap_total' });
  observableResult.observe(memUsage.rss, { type: 'rss' });
  observableResult.observe(memUsage.external, { type: 'external' });
});

// ============================================
// HELPER FUNCTIONS
// ============================================

export const recordOrderCreated = (tenantId: string, userId: string, orderType: string) => {
  ordersCreatedCounter.add(1, {
    tenant_id: tenantId,
    user_id: userId,
    order_type: orderType,
  });
};

export const recordOrderCompleted = (tenantId: string) => {
  ordersCompletedCounter.add(1, { tenant_id: tenantId });
};

export const recordDbQuery = (operation: string, table: string, duration: number, success: boolean) => {
  dbQueryCounter.add(1, {
    operation,
    table,
    success: success.toString(),
  });
  
  dbQueryDuration.record(duration, {
    operation,
    table,
    success: success.toString(),
  });
};

export const recordHttpRequest = (
  method: string,
  route: string,
  statusCode: number,
  duration: number
) => {
  httpRequestDuration.record(duration, {
    method,
    route,
    status_code: statusCode.toString(),
  });
};

Key Learning: Use different metric types for different use cases:

Counters for cumulative values (orders created, errors)
Histograms for distributions (request latency, query duration)
Gauges for current values (active connections, memory usage)

Step 5: Add Prometheus Metrics Endpoint

For additional metrics exposure via Prometheus scraping:

// src/utils/promMetrics.ts
import { Registry, collectDefaultMetrics, Counter, Histogram, Gauge } from 'prom-client';

// Create registry
export const promRegistry = new Registry();

// Collect default Node.js metrics (CPU, memory, event loop)
collectDefaultMetrics({
  register: promRegistry,
  prefix: 'pos_core_',
  labels: {
    service: 'pos-core-service',
  },
});

// Custom business metrics
export const ordersCreatedCounter = new Counter({
  name: 'pos_core_orders_created_total',
  help: 'Total number of orders created',
  labelNames: ['tenant_id', 'user_id', 'order_type'],
  registers: [promRegistry],
});

export const httpRequestDurationHistogram = new Histogram({
  name: 'pos_core_http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [promRegistry],
});

export const activeOrdersGauge = new Gauge({
  name: 'pos_core_orders_active',
  help: 'Number of currently active orders',
  registers: [promRegistry],
});

export const dbConnectionsGauge = new Gauge({
  name: 'pos_core_db_connections_active',
  help: 'Number of active database connections',
  registers: [promRegistry],
});

// Expose metrics endpoint
export function setDbConnectionsGauge(count: number) {
  dbConnectionsGauge.set(count);
}

export function setActiveOrdersGauge(count: number) {
  activeOrdersGauge.set(count);
}

Add metrics endpoint to your Express app:

// src/index.ts
import { promRegistry } from './utils/promMetrics';

// Prometheus metrics endpoint
app.get('/metrics', async (req, res) => {
  try {
    res.set('Content-Type', promRegistry.contentType);
    const metrics = await promRegistry.metrics();
    res.end(metrics);
  } catch (error) {
    logger.error('Error generating metrics', error);
    res.status(500).end('Error generating metrics');
  }
});

Step 6: Add Periodic Metric Updates

Some metrics need periodic updates (like database connections):

// src/index.ts
import { setDbConnectionsGauge } from './utils/promMetrics';
import prisma from './utils/database';

// Function to monitor database connections periodically
const startDbConnectionMonitoring = () => {
  const updateDbConnections = async () => {
    try {
      // Query PostgreSQL for active connections
      const result: any = await prisma.$queryRaw`
        SELECT count(*) as count 
        FROM pg_stat_activity 
        WHERE datname = current_database() 
          AND application_name LIKE '%prisma%'
          AND state = 'active';
      `;
      
      const activeConnections = parseInt(result[0]?.count || '0', 10);
      setDbConnectionsGauge(activeConnections);
      logger.debug(`Active DB connections: ${activeConnections}`);
    } catch (error) {
      logger.error('Error querying database connections:', error);
    }
  };

  // Update immediately on startup
  updateDbConnections();
  
  // Update every 30 seconds
  setInterval(updateDbConnections, 30000);
};

// Start monitoring after database connection
prisma.$connect()
  .then(() => {
    logger.info('Database connected successfully');
    startDbConnectionMonitoring();
  })
  .catch((err) => {
    logger.error('Database connection error', err);
    process.exit(1);
  });

Key Learning: Don't query database connections on every request - use periodic polling to update gauges efficiently.

🐳 Docker Compose Observability Stack

Create docker-compose.observability.yml:

version: '3.8'

services:
  # OpenTelemetry Collector - Central telemetry pipeline
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.91.0
    container_name: otel-collector
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./observability/otel/otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC receiver
      - "4318:4318"   # OTLP HTTP receiver
      - "8888:8888"   # Prometheus metrics exposed by collector
      - "8889:8889"   # Prometheus exporter metrics
    networks:
      - pos-network

  # Tempo - Distributed tracing backend
  tempo:
    image: grafana/tempo:2.3.1
    container_name: tempo
    command: [ "-config.file=/etc/tempo.yaml" ]
    volumes:
      - ./observability/tempo/tempo.yaml:/etc/tempo.yaml
      - tempo-data:/var/tempo
    ports:
      - "3200:3200"   # Tempo HTTP
      - "4317"        # OTLP gRPC
    networks:
      - pos-network

  # Prometheus - Metrics collection and storage
  prometheus:
    image: prom/prometheus:v2.48.1
    container_name: prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--enable-feature=exemplar-storage'
    volumes:
      - ./observability/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    ports:
      - "9090:9090"
    networks:
      - pos-network

  # Loki - Log aggregation
  loki:
    image: grafana/loki:2.9.3
    container_name: loki
    command: -config.file=/etc/loki/local-config.yaml
    volumes:
      - ./observability/loki/loki-config.yaml:/etc/loki/local-config.yaml
      - loki-data:/loki
    ports:
      - "3100:3100"
    networks:
      - pos-network

  # Promtail - Log shipper
  promtail:
    image: grafana/promtail:2.9.3
    container_name: promtail
    command: -config.file=/etc/promtail/config.yml
    volumes:
      - ./observability/promtail/promtail-config.yaml:/etc/promtail/config.yml
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock
    ports:
      - "9080:9080"
    networks:
      - pos-network

  # Grafana - Visualization and dashboards
  grafana:
    image: grafana/grafana-oss:10.2.3
    container_name: grafana
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
      - GF_SERVER_HTTP_PORT=3002
      - GF_USERS_ALLOW_SIGN_UP=true
      - GF_LOG_LEVEL=info
    volumes:
      - ./observability/grafana/provisioning:/etc/grafana/provisioning
      - ./observability/grafana/dashboards:/var/lib/grafana/dashboards
      - grafana-data:/var/lib/grafana
    ports:
      - "3002:3002"
    networks:
      - pos-network
    depends_on:
      - prometheus
      - tempo
      - loki

volumes:
  tempo-data:
  prometheus-data:
  loki-data:
  grafana-data:

networks:
  pos-network:
    external: true

OpenTelemetry Collector Configuration

Create observability/otel/otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
  
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
  
  resource:
    attributes:
      - key: cluster
        value: pos-microservice
        action: upsert

exporters:
  # Send traces to Tempo
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  
  # Send metrics to Prometheus
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: "otelcol"
  
  # Logging exporter for debugging
  logging:
    loglevel: info

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo, logging]
    
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus, logging]

Prometheus Configuration

Create observability/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'pos-microservice'
    environment: 'development'

scrape_configs:
  # Scrape Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Scrape POS Core Service
  - job_name: 'pos-core'
    static_configs:
      - targets: ['pos-core-service-prod:4002']
    metrics_path: '/metrics'
    scrape_interval: 15s

  # Scrape Auth Service
  - job_name: 'auth-service'
    static_configs:
      - targets: ['auth-service-prod:4001']
    metrics_path: '/metrics'

  # Scrape Inventory Service
  - job_name: 'inventory-service'
    static_configs:
      - targets: ['inventory-service-prod:4003']
    metrics_path: '/metrics'

  # Scrape Restaurant Service
  - job_name: 'restaurant-service'
    static_configs:
      - targets: ['restaurant-service-prod:4005']
    metrics_path: '/metrics'

  # Scrape OpenTelemetry Collector metrics
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8888']

  # Scrape Node Exporter (system metrics)
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  # Scrape PostgreSQL Exporter
  - job_name: 'postgres-exporter'
    static_configs:
      - targets: ['postgres-exporter:9187']

  # Scrape Redis Exporter
  - job_name: 'redis-exporter'
    static_configs:
      - targets: ['redis-exporter:9121']

Tempo Configuration

Create observability/tempo/tempo.yaml:

server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

storage:
  trace:
    backend: local
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal
    pool:
      max_workers: 100
      queue_depth: 10000

querier:
  frontend_worker:
    frontend_address: tempo:9095

query_frontend:
  search:
    duration_slo: 5s
    throughput_bytes_slo: 1.073741824e+09
  trace_by_id:
    duration_slo: 5s

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: pos-microservice
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true

📊 Creating Grafana Dashboards

POS Core Service Dashboard

Create observability/grafana/dashboards/pos-core-service.json:

This dashboard includes panels for:

Service Health
- Service Status (up/down)
- Uptime percentage
- Last restart time
Resource Metrics
- CPU Usage (%)
- Memory Usage (MB)
- Heap Memory (used/total)
- Active Handles
- Event Loop Lag
Database Metrics
- Active DB Connections
- Query Duration (p50, p95, p99)
- Query Rate (queries/sec)
- Connection Pool Usage
HTTP Metrics
- Request Rate (req/sec)
- Request Duration (p50, p95, p99)
- Error Rate (%)
- Status Code Distribution
Business Metrics
- Orders Created (total, rate)
- Active Orders (current)
- Transactions Processed
- Average Order Value
Distributed Traces
- Recent traces from Tempo
- Trace duration visualization
- Service dependency graph

Grafana Data Source Configuration

Create observability/grafana/provisioning/datasources/datasources.yml:

apiVersion: 1

datasources:
  # Prometheus data source
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: 15s
      exemplarTraceIdDestinations:
        - name: trace_id
          datasourceUid: tempo

  # Tempo data source
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    uid: tempo
    jsonData:
      tracesToLogs:
        datasourceUid: loki
        tags: ['service_name', 'trace_id']
      tracesToMetrics:
        datasourceUid: prometheus
        tags: [{ key: 'service.name', value: 'service_name' }]
      serviceMap:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true

  # Loki data source
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    uid: loki
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: "trace_id=(\\w+)"
          name: TraceID
          url: '$${__value.raw}'

🔄 Complete Request Flow with Observability

Let's trace a complete order creation request:

🎯 TraceQL Queries for Investigation

Find Slow Requests

{
  resource.service.name="pos-core-service" &&
  duration > 1s
}

Find Errors in Specific Service

{
  resource.service.name="pos-core-service" &&
  status=error
}

Find Traces with Database Queries

{
  resource.service.name="pos-core-service" &&
  span.db.operation="SELECT"
}

Find Traces for Specific Tenant

{
  resource.service.name="pos-core-service" &&
  span.tenant_id="tenant-123"
}

Complex Query: Slow Orders with Errors

{
  resource.service.name="pos-core-service" &&
  name="order.create" &&
  duration > 500ms &&
  status=error
}

🎯 PromQL Queries for Monitoring

Request Rate

# Requests per second
rate(pos_core_http_request_duration_seconds_count[5m])

# By status code
sum by (status_code) (
  rate(pos_core_http_request_duration_seconds_count[5m])
)

Request Latency

# P50 latency
histogram_quantile(0.50,
  rate(pos_core_http_request_duration_seconds_bucket[5m])
)

# P95 latency
histogram_quantile(0.95,
  rate(pos_core_http_request_duration_seconds_bucket[5m])
)

# P99 latency
histogram_quantile(0.99,
  rate(pos_core_http_request_duration_seconds_bucket[5m])
)

Error Rate

# Error rate percentage
(
  sum(rate(pos_core_http_request_duration_seconds_count{status_code=~"5.."}[5m]))
  /
  sum(rate(pos_core_http_request_duration_seconds_count[5m]))
) * 100

Database Connections

# Current active connections
pos_core_db_connections_active

# Connection usage over time
rate(pos_core_db_connections_active[5m])

Memory Usage

# Heap memory usage
pos_core_nodejs_heap_size_used_bytes / 1024 / 1024

# Memory growth rate
rate(pos_core_nodejs_heap_size_used_bytes[5m])

Business Metrics

# Orders created per minute
rate(pos_core_orders_created_total[1m]) * 60

# Active orders
pos_core_orders_active

# Orders by tenant
sum by (tenant_id) (
  rate(pos_core_orders_created_total[5m])
)

🚀 Deployment and Usage

1. Start Observability Stack

# Start all observability services
docker-compose -f docker-compose.observability.yml up -d

# Check service health
docker-compose -f docker-compose.observability.yml ps

# View logs
docker-compose -f docker-compose.observability.yml logs -f grafana

2. Access Grafana

URL: http://grafana.tripleseven.cloud:3002
Username: admin
Password: <your-password>

3. View Traces in Tempo

Navigate to Explore in Grafana
Select Tempo data source
Use TraceQL queries or search by trace ID
Click on spans to see details

4. Query Metrics in Prometheus

Navigate to Explore in Grafana
Select Prometheus data source
Use PromQL queries
Visualize with graphs

5. Search Logs in Loki

Navigate to Explore in Grafana
Select Loki data source
Use LogQL queries: {service="pos-core"}
Filter by time range and labels

📈 Monitoring Best Practices

1. The Four Golden Signals

Latency - How long does it take?

histogram_quantile(0.99,
  rate(pos_core_http_request_duration_seconds_bucket[5m])
)

Traffic - How many requests?

sum(rate(pos_core_http_request_duration_seconds_count[5m]))

Errors - How many failures?

sum(rate(pos_core_http_request_duration_seconds_count{status_code=~"5.."}[5m]))

Saturation - How full are resources?

pos_core_nodejs_heap_size_used_bytes / pos_core_nodejs_heap_size_total_bytes

2. Set Up Alerts

Create alert rules in Prometheus:

# observability/prometheus/alerts.yml
groups:
  - name: pos_core_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(pos_core_http_request_duration_seconds_count{status_code=~"5.."}[5m]))
            /
            sum(rate(pos_core_http_request_duration_seconds_count[5m]))
          ) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate in POS Core Service"
          description: "Error rate is {{ $value | humanizePercentage }}"

      # High latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            rate(pos_core_http_request_duration_seconds_bucket[5m])
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency in POS Core Service"
          description: "P99 latency is {{ $value }}s"

      # Database connection pool exhaustion
      - alert: DatabaseConnectionPoolNearLimit
        expr: pos_core_db_connections_active > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Database connection pool near limit"
          description: "Active connections: {{ $value }}"

3. Dashboard Organization

Organize dashboards by concern:

Overview Dashboard - High-level health across all services
Service Dashboards - Deep dive per service
Infrastructure Dashboard - System resources (CPU, memory, disk)
Business Dashboard - KPIs (orders, revenue, users)
SLI/SLO Dashboard - Service level objectives tracking

🎯 Lessons Learned

1. Start with Auto-Instrumentation

Don't write custom spans for everything initially. OpenTelemetry's auto-instrumentation covers:

HTTP requests/responses
Database queries (Prisma, TypeORM, etc.)
Redis operations
External HTTP calls

Add custom spans only for business-critical operations.

2. Be Strategic with Metrics

More metrics ≠ better observability. Focus on:

RED metrics (Rate, Errors, Duration) for requests
USE metrics (Utilization, Saturation, Errors) for resources
Business KPIs specific to your domain

3. Sampling Strategy

For high-traffic services, implement sampling:

// Sample 10% of traces in staging/production environments
const sampler = process.env.NODE_ENV === 'staging'
  ? new TraceIdRatioBasedSampler(0.1)
  : new AlwaysOnSampler();

But always sample:

Errors (status=error)
Slow requests (duration > threshold)
High-value operations (payment processing)

4. Correlate Telemetry

The power of observability comes from correlation:

Link traces to logs (via trace_id)
Link traces to metrics (via exemplars)
Link metrics to logs (via timestamps and labels)

5. Make Dashboards Actionable

Every dashboard should answer:

What is happening? (Current state)
Why is it happening? (Root cause)
What should I do? (Remediation)

Bad dashboard: Shows CPU is 80% Good dashboard: Shows CPU is 80% because endpoint /api/orders has 5x traffic spike, with runbook link

💭 Real-World Impact

Before Observability

MTTR (Mean Time To Resolution): 2-3 hours
Investigation Method: SSH + grep + guesswork
Visibility: Per-service logs only
Proactive Monitoring: None
Cost: High developer time on debugging

After Observability

MTTR: 5-10 minutes
Investigation Method: Grafana dashboard + trace search
Visibility: Full request lifecycle across all services
Proactive Monitoring: Alerts before users notice
Cost: Low developer time, more focus on features

Real Example

Issue: "Orders are taking too long to process"

Before (2 hours):

Check all service logs
Find order ID mentions
Manually correlate timestamps
Discover database query timeout
Check database logs
Find slow query on inventory check
Fix with index

After (8 minutes):

Search trace by order ID
See complete request flow
Identify slow span: inventory.check (4.2s)
View span attributes: SELECT query without index
Check Prometheus: db_query_duration increased
Fix with index

🔗 Architecture Summary

Observability Stack:

OpenTelemetry SDK (Instrumentation)
OpenTelemetry Collector (Pipeline)
Tempo (Traces)
Prometheus (Metrics)
Loki (Logs)
Grafana (Visualization)

Services Monitored:

Auth Service (Port 4001)
POS Core (Port 4002)
Inventory (Port 4003)
Payment (Port 4004)
Restaurant (Port 4005)
Chatbot (Port 4006)

Key Features:

✅ Distributed tracing with trace context propagation
✅ Custom business metrics (orders, transactions, revenue)
✅ System metrics (CPU, memory, connections)
✅ Centralized log aggregation
✅ Pre-built dashboards per service
✅ Alert rules for proactive monitoring
✅ Multi-tenant metric isolation

💬 Final Thoughts

Implementing comprehensive observability can transform how teams operate distributed microservices. The key insights I've learned:

Observability is not optional - You can't improve what you can't measure
Start simple, iterate - Basic instrumentation first, advanced features later
Auto-instrumentation is your friend - Leverage existing tools rather than building from scratch
Correlation is key - Traces + Metrics + Logs together provide complete context
Make it actionable - Dashboards should guide investigation and remediation

The investment in observability infrastructure provides significant value over time. When issues occur (and they will), having proper observability means knowing exactly what happened, where it happened, and how to fix it.

I hope this guide helps you implement observability in your own microservices architecture. Feel free to adapt the patterns and configurations to your specific needs!

Thanks for reading! If you found this helpful, consider sharing it with your team. Better observability leads to better software for everyone.

— Happy monitoring! 🔍📊

📚 Additional Resources

🔖 Tags

#Observability #OpenTelemetry #Grafana #Microservices #DistributedTracing #Prometheus #Tempo #Loki #Monitoring #DevOps #SRE #POS #TypeScript #NodeJS

PreviousShift Left and Shift Right: My Journey from Reactive Bug Fixes to Proactive Software Development NextRelease and Reliability Engineering 101

Last updated 2 months ago

hashtagA Developer's Journey from Blind Operations to Complete Visibility

hashtag⚡ Quick Overview (TL;DR)

hashtag🤔 The Problem: Operating Blind in Distributed Systems

hashtag🏗️ Architecture Deep Dive

hashtagThe Three Pillars of Observability

hashtagArchitecture Diagram

hashtagComponent Responsibilities

hashtag📦 Implementation: OpenTelemetry Integration

hashtagStep 1: Install Dependencies

hashtagStep 2: Create Telemetry Configuration

hashtagStep 3: Initialize in Main Application

hashtagStep 4: Add Custom Metrics

hashtagStep 5: Add Prometheus Metrics Endpoint

hashtagStep 6: Add Periodic Metric Updates

hashtag🐳 Docker Compose Observability Stack

hashtagOpenTelemetry Collector Configuration

hashtagPrometheus Configuration

hashtagTempo Configuration

hashtag📊 Creating Grafana Dashboards

hashtagPOS Core Service Dashboard

hashtagGrafana Data Source Configuration

hashtag🔄 Complete Request Flow with Observability

hashtag🎯 TraceQL Queries for Investigation

hashtagFind Slow Requests

hashtagFind Errors in Specific Service

hashtagFind Traces with Database Queries

hashtagFind Traces for Specific Tenant

hashtagComplex Query: Slow Orders with Errors

hashtag🎯 PromQL Queries for Monitoring

hashtagRequest Rate

hashtagRequest Latency

hashtagError Rate

hashtagDatabase Connections

hashtagMemory Usage

hashtagBusiness Metrics

hashtag🚀 Deployment and Usage

hashtag1. Start Observability Stack

hashtag2. Access Grafana

hashtag3. View Traces in Tempo

hashtag4. Query Metrics in Prometheus

hashtag5. Search Logs in Loki

hashtag📈 Monitoring Best Practices

hashtag1. The Four Golden Signals

hashtag2. Set Up Alerts

hashtag3. Dashboard Organization

hashtag🎯 Lessons Learned

hashtag1. Start with Auto-Instrumentation

hashtag2. Be Strategic with Metrics

hashtag3. Sampling Strategy

hashtag4. Correlate Telemetry

hashtag5. Make Dashboards Actionable

hashtag💭 Real-World Impact

hashtagBefore Observability

hashtagAfter Observability

hashtagReal Example

hashtag🔗 Architecture Summary

hashtag💬 Final Thoughts

hashtag📚 Additional Resources

hashtag🔖 Tags