Implementing Full-Stack Observability in a Multi-Tenant POS Microservice: OpenTelemetry, Grafana, and Distributed Tracing

A Developer's Journey from Blind Operations to Complete Visibility

Hey there! πŸ‘‹

I want to share what I learned about implementing comprehensive observability in a multi-tenant POS microservice system. You know those moments when issues occur in a distributed system, and you're left wondering, "What exactly happened? Which service failed? How long did that request take?"

Running microservices without observability is like driving a car with no dashboard - you're moving, but you have no idea how fast, how much fuel you have, or if the engine is overheating.

Imagine having 6 microservices running smoothly... until they aren't. And when things break, there's no clear way to identify the root cause.

Let me show you how to transform from operating blind to having complete visibility into every layer of a distributed system.

⚑ Quick Overview (TL;DR)

For developers implementing similar observability stacks:

Observability Stack:

Grafana Dashboard (Port 3002)
β”œβ”€β”€ Tempo (Distributed Tracing)
β”œβ”€β”€ Prometheus (Metrics Collection)
β”œβ”€β”€ Loki (Log Aggregation)
└── OpenTelemetry Collector (Data Pipeline)
    β”œβ”€β”€ Traces β†’ Tempo
    β”œβ”€β”€ Metrics β†’ Prometheus
    └── Logs β†’ Loki

Key Features:

  • πŸ” Distributed Tracing - Track requests across all microservices

  • πŸ“Š Custom Metrics - CPU, memory, database connections, business KPIs

  • πŸ“ Centralized Logging - All service logs in one place

  • 🎯 Real-time Dashboards - Pre-built visualizations for each service

  • πŸ”” Alert Rules - Proactive monitoring with notifications

  • 🏒 Multi-tenant Support - Isolated metrics per tenant

Tech Stack:

  • OpenTelemetry SDK (Instrumentation)

  • Grafana v10.2.3 OSS (Visualization)

  • Tempo (Trace Backend)

  • Prometheus (Metrics Backend)

  • Loki (Log Backend)

  • Promtail (Log Shipper)

  • Docker Compose (Orchestration)

That's it! Let's dive into how this all works.

πŸ€” The Problem: Operating Blind in Distributed Systems

Think about a typical microservices architecture. You might have:

  1. Auth Service - User authentication & JWT management

  2. POS Core Service - Orders, transactions, sales processing

  3. Inventory Service - Stock management & product catalog

  4. Payment Service - Payment processing & reconciliation

  5. Restaurant Service - Store operations, menu management

  6. Chatbot Service - AI-powered business analytics

Each service logged to its own stdout. Each had its own metrics (if any). When issues occurred:

Without Observability:

With Observability:

That's the power of observability. Let's build it.

πŸ—οΈ Architecture Deep Dive

The Three Pillars of Observability

A comprehensive observability stack implements the three pillars:

  1. Metrics - What is happening? (CPU, memory, requests/sec)

  2. Logs - Why is it happening? (Error messages, debug info)

  3. Traces - Where is it happening? (Request flow, latency breakdown)

Architecture Diagram

spinner

Component Responsibilities

OpenTelemetry Collector:

  • Receives telemetry data from all services

  • Processes and transforms data

  • Routes traces to Tempo, metrics to Prometheus

  • Provides buffering and retry logic

Prometheus:

  • Scrapes metrics from /metrics endpoints

  • Time-series database for metrics storage

  • Supports PromQL query language

  • Handles alerting rules

Tempo:

  • Distributed tracing backend

  • Stores and queries trace data

  • Supports TraceQL query language

  • Efficient columnar storage

Loki:

  • Log aggregation system

  • Indexes only metadata (not content)

  • LogQL query language

  • Cost-effective log storage

Grafana:

  • Unified visualization platform

  • Connects to all data sources

  • Pre-built and custom dashboards

  • Alerting and notification system

πŸ“¦ Implementation: OpenTelemetry Integration

Step 1: Install Dependencies

First, add OpenTelemetry packages to your service:

Step 2: Create Telemetry Configuration

Create src/telemetry.ts to initialize OpenTelemetry:

Key Learning: Import telemetry configuration FIRST in your main file, before any other imports. This ensures all HTTP, database, and framework calls are automatically instrumented.

Step 3: Initialize in Main Application

Step 4: Add Custom Metrics

Create src/utils/metrics.ts for custom business metrics:

Key Learning: Use different metric types for different use cases:

  • Counters for cumulative values (orders created, errors)

  • Histograms for distributions (request latency, query duration)

  • Gauges for current values (active connections, memory usage)

Step 5: Add Prometheus Metrics Endpoint

For additional metrics exposure via Prometheus scraping:

Add metrics endpoint to your Express app:

Step 6: Add Periodic Metric Updates

Some metrics need periodic updates (like database connections):

Key Learning: Don't query database connections on every request - use periodic polling to update gauges efficiently.

🐳 Docker Compose Observability Stack

Create docker-compose.observability.yml:

OpenTelemetry Collector Configuration

Create observability/otel/otel-collector-config.yaml:

Prometheus Configuration

Create observability/prometheus/prometheus.yml:

Tempo Configuration

Create observability/tempo/tempo.yaml:

πŸ“Š Creating Grafana Dashboards

POS Core Service Dashboard

Create observability/grafana/dashboards/pos-core-service.json:

This dashboard includes panels for:

  1. Service Health

    • Service Status (up/down)

    • Uptime percentage

    • Last restart time

  2. Resource Metrics

    • CPU Usage (%)

    • Memory Usage (MB)

    • Heap Memory (used/total)

    • Active Handles

    • Event Loop Lag

  3. Database Metrics

    • Active DB Connections

    • Query Duration (p50, p95, p99)

    • Query Rate (queries/sec)

    • Connection Pool Usage

  4. HTTP Metrics

    • Request Rate (req/sec)

    • Request Duration (p50, p95, p99)

    • Error Rate (%)

    • Status Code Distribution

  5. Business Metrics

    • Orders Created (total, rate)

    • Active Orders (current)

    • Transactions Processed

    • Average Order Value

  6. Distributed Traces

    • Recent traces from Tempo

    • Trace duration visualization

    • Service dependency graph

Grafana Data Source Configuration

Create observability/grafana/provisioning/datasources/datasources.yml:

πŸ”„ Complete Request Flow with Observability

Let's trace a complete order creation request:

spinner

🎯 TraceQL Queries for Investigation

Find Slow Requests

Find Errors in Specific Service

Find Traces with Database Queries

Find Traces for Specific Tenant

Complex Query: Slow Orders with Errors

🎯 PromQL Queries for Monitoring

Request Rate

Request Latency

Error Rate

Database Connections

Memory Usage

Business Metrics

πŸš€ Deployment and Usage

1. Start Observability Stack

2. Access Grafana

3. View Traces in Tempo

  1. Navigate to Explore in Grafana

  2. Select Tempo data source

  3. Use TraceQL queries or search by trace ID

  4. Click on spans to see details

4. Query Metrics in Prometheus

  1. Navigate to Explore in Grafana

  2. Select Prometheus data source

  3. Use PromQL queries

  4. Visualize with graphs

5. Search Logs in Loki

  1. Navigate to Explore in Grafana

  2. Select Loki data source

  3. Use LogQL queries: {service="pos-core"}

  4. Filter by time range and labels

πŸ“ˆ Monitoring Best Practices

1. The Four Golden Signals

Latency - How long does it take?

Traffic - How many requests?

Errors - How many failures?

Saturation - How full are resources?

2. Set Up Alerts

Create alert rules in Prometheus:

3. Dashboard Organization

Organize dashboards by concern:

  • Overview Dashboard - High-level health across all services

  • Service Dashboards - Deep dive per service

  • Infrastructure Dashboard - System resources (CPU, memory, disk)

  • Business Dashboard - KPIs (orders, revenue, users)

  • SLI/SLO Dashboard - Service level objectives tracking

🎯 Lessons Learned

1. Start with Auto-Instrumentation

Don't write custom spans for everything initially. OpenTelemetry's auto-instrumentation covers:

  • HTTP requests/responses

  • Database queries (Prisma, TypeORM, etc.)

  • Redis operations

  • External HTTP calls

Add custom spans only for business-critical operations.

2. Be Strategic with Metrics

More metrics β‰  better observability. Focus on:

  • RED metrics (Rate, Errors, Duration) for requests

  • USE metrics (Utilization, Saturation, Errors) for resources

  • Business KPIs specific to your domain

3. Sampling Strategy

For high-traffic services, implement sampling:

But always sample:

  • Errors (status=error)

  • Slow requests (duration > threshold)

  • High-value operations (payment processing)

4. Correlate Telemetry

The power of observability comes from correlation:

  • Link traces to logs (via trace_id)

  • Link traces to metrics (via exemplars)

  • Link metrics to logs (via timestamps and labels)

5. Make Dashboards Actionable

Every dashboard should answer:

  • What is happening? (Current state)

  • Why is it happening? (Root cause)

  • What should I do? (Remediation)

Bad dashboard: Shows CPU is 80% Good dashboard: Shows CPU is 80% because endpoint /api/orders has 5x traffic spike, with runbook link

πŸ’­ Real-World Impact

Before Observability

  • MTTR (Mean Time To Resolution): 2-3 hours

  • Investigation Method: SSH + grep + guesswork

  • Visibility: Per-service logs only

  • Proactive Monitoring: None

  • Cost: High developer time on debugging

After Observability

  • MTTR: 5-10 minutes

  • Investigation Method: Grafana dashboard + trace search

  • Visibility: Full request lifecycle across all services

  • Proactive Monitoring: Alerts before users notice

  • Cost: Low developer time, more focus on features

Real Example

Issue: "Orders are taking too long to process"

Before (2 hours):

  1. Check all service logs

  2. Find order ID mentions

  3. Manually correlate timestamps

  4. Discover database query timeout

  5. Check database logs

  6. Find slow query on inventory check

  7. Fix with index

After (8 minutes):

  1. Search trace by order ID

  2. See complete request flow

  3. Identify slow span: inventory.check (4.2s)

  4. View span attributes: SELECT query without index

  5. Check Prometheus: db_query_duration increased

  6. Fix with index

πŸ”— Architecture Summary

Observability Stack:

  • OpenTelemetry SDK (Instrumentation)

  • OpenTelemetry Collector (Pipeline)

  • Tempo (Traces)

  • Prometheus (Metrics)

  • Loki (Logs)

  • Grafana (Visualization)

Services Monitored:

  • Auth Service (Port 4001)

  • POS Core (Port 4002)

  • Inventory (Port 4003)

  • Payment (Port 4004)

  • Restaurant (Port 4005)

  • Chatbot (Port 4006)

Key Features:

  • βœ… Distributed tracing with trace context propagation

  • βœ… Custom business metrics (orders, transactions, revenue)

  • βœ… System metrics (CPU, memory, connections)

  • βœ… Centralized log aggregation

  • βœ… Pre-built dashboards per service

  • βœ… Alert rules for proactive monitoring

  • βœ… Multi-tenant metric isolation

πŸ’¬ Final Thoughts

Implementing comprehensive observability can transform how teams operate distributed microservices. The key insights I've learned:

  1. Observability is not optional - You can't improve what you can't measure

  2. Start simple, iterate - Basic instrumentation first, advanced features later

  3. Auto-instrumentation is your friend - Leverage existing tools rather than building from scratch

  4. Correlation is key - Traces + Metrics + Logs together provide complete context

  5. Make it actionable - Dashboards should guide investigation and remediation

The investment in observability infrastructure provides significant value over time. When issues occur (and they will), having proper observability means knowing exactly what happened, where it happened, and how to fix it.

I hope this guide helps you implement observability in your own microservices architecture. Feel free to adapt the patterns and configurations to your specific needs!

Thanks for reading! If you found this helpful, consider sharing it with your team. Better observability leads to better software for everyone.

β€” Happy monitoring! πŸ”πŸ“Š


πŸ“š Additional Resources

πŸ”– Tags

#Observability #OpenTelemetry #Grafana #Microservices #DistributedTracing #Prometheus #Tempo #Loki #Monitoring #DevOps #SRE #POS #TypeScript #NodeJS

Last updated