Part 4: Instrumenting Go Services — Metrics, Traces, and Logs

Part of the SRE Playbook series

What You'll Learn: This article covers how I instrument the GoReliable Go services for full observability. You'll see the actual Prometheus client_golang instrumentation for custom metrics, how OpenTelemetry distributed tracing connects spans across the API Gateway, Order Service, and Notification Worker, and how I use zerolog for structured logging with trace correlation. Then I show how I deploy the complete observability stack — Prometheus, Grafana, Loki, and Tempo — as ArgoCD applications in the GitOps workflow.

The Observability Deficit

After deploying the services to Kubernetes in Parts 2 and 3, I had a working pipeline but almost no visibility. I could see that pods were running and health checks were passing. But I had no answer to: "What is the p99 latency on the order creation endpoint?" or "Why did that order take 4 seconds?"

That gap between "it's running" and "I understand how it's running" is the observability deficit. This article closes it.

For Prometheus fundamentals, see the Prometheus 101 series. For OpenTelemetry basics, see the OpenTelemetry 101 guide. This article focuses on the Go-specific implementation.

Prometheus Instrumentation

The Metrics Middleware

I instrument the API Gateway with a middleware that records four golden signals for every HTTP request:

// internal/gateway/middleware/metrics.go
package middleware

import (
    "net/http"
    "strconv"
    "time"

    "github.com/go-chi/chi/v5"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Namespace: "goreliable",
            Subsystem: "api_gateway",
            Name:      "http_requests_total",
            Help:      "Total number of HTTP requests",
        },
        []string{"method", "route", "status_code"},
    )

    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Namespace: "goreliable",
            Subsystem: "api_gateway",
            Name:      "http_request_duration_seconds",
            Help:      "HTTP request duration in seconds",
            // Buckets chosen to cover the range from fast cached responses to slow DB queries
            Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5},
        },
        []string{"method", "route", "status_code"},
    )

    httpRequestsInFlight = promauto.NewGauge(
        prometheus.GaugeOpts{
            Namespace: "goreliable",
            Subsystem: "api_gateway",
            Name:      "http_requests_in_flight",
            Help:      "Number of HTTP requests currently being processed",
        },
    )
)

// responseWriter wraps http.ResponseWriter to capture the status code
type responseWriter struct {
    http.ResponseWriter
    status int
    written bool
}

func (rw *responseWriter) WriteHeader(code int) {
    if !rw.written {
        rw.status = code
        rw.written = true
        rw.ResponseWriter.WriteHeader(code)
    }
}

func Metrics(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        httpRequestsInFlight.Inc()
        defer httpRequestsInFlight.Dec()

        rw := &responseWriter{ResponseWriter: w, status: http.StatusOK}
        next.ServeHTTP(rw, r)

        // Use chi's route pattern (e.g., /api/v1/orders/{id}) not the actual path
        // This prevents high cardinality from individual order IDs becoming metric labels
        route := chi.RouteContext(r.Context()).RoutePattern()
        if route == "" {
            route = "unknown"
        }

        labels := prometheus.Labels{
            "method":      r.Method,
            "route":       route,
            "status_code": strconv.Itoa(rw.status),
        }

        httpRequestsTotal.With(labels).Inc()
        httpRequestDuration.With(labels).Observe(time.Since(start).Seconds())
    })
}

The cardinality note on route is important. If I used r.URL.Path as the label, every unique order ID (/api/v1/orders/uuid-1, /api/v1/orders/uuid-2, ...) would create a new time series. Prometheus memory usage would grow unboundedly. Using the chi route pattern collapses all order GETs to a single route="/api/v1/orders/{id}" label.

Order Service Metrics

The Order Service has domain-specific metrics:

Notification Worker Metrics

OpenTelemetry Distributed Tracing

Setup

Tracing in the API Gateway

Propagating Traces to the Order Service

When the API Gateway calls the Order Service, it propagates the trace context in HTTP headers. The Order Service extracts it to continue the same trace.

Order Service Span

With these traces in place, Grafana Tempo shows me the complete path of a single request: gateway.CreateOrderorder.CreateOrderdb.exec. When an order is slow, I can pinpoint which layer is responsible.

Structured Logging with Trace Correlation

The last piece is correlating logs with traces. When a request causes both a trace span and a log event, I want to be able to click the trace ID in Tempo and jump to the corresponding log lines in Loki.

Log lines in Loki now contain trace_id and span_id fields. I configure a Grafana data source link so clicking a trace ID in the Loki explorer opens the corresponding trace in Tempo. This single connection — logs ↔ traces — cuts my mean investigation time significantly.

Deploying the Observability Stack via ArgoCD

I deploy the full observability stack using the same GitOps pattern from Part 3. Each component is an ArgoCD Application sourcing from the official Helm charts.

OpenTelemetry Collector Configuration

The collector receives traces from Go services via gRPC and exports to Tempo:

Grafana Dashboard

I maintain the Go services dashboard as a JSON file in the GitOps repo, loaded via Grafana's ConfigMap-based dashboard provisioning. The dashboard has four sections:

  1. Overview — Request rate, error rate, p50/p99 latency for all services

  2. Order Service — Orders per minute, DB query latency, connection pool utilization

  3. Notification Worker — Messages processed/sec, pending queue depth, processing duration

  4. Infrastructure — Pod CPU/memory vs limits, restarts, node utilization

What the Three Pillars Now Give Me

After completing this instrumentation:

Metrics (Prometheus + Grafana): I can answer "Is anything broken right now?" with a dashboard. I can set up alerts on error rate and latency before users notice.

Traces (OpenTelemetry + Tempo): I can answer "Why was this specific request slow?" by following the trace through all services.

Logs (zerolog + Loki): I can answer "What happened around a specific event?" by filtering logs by trace ID, request ID, or time range.

The three pillars work together because every log line contains the trace ID, and every trace links to the service that generated it. In Part 5, I use these metrics to define formal SLIs and SLOs, calculate error budgets, and set up multi-window burn-rate alerts.

Last updated