Part 4: Instrumenting Go Services — Metrics, Traces, and Logs

Part of the SRE Playbook series

What You'll Learn: This article covers how I instrument the GoReliable Go services for full observability. You'll see the actual Prometheus client_golang instrumentation for custom metrics, how OpenTelemetry distributed tracing connects spans across the API Gateway, Order Service, and Notification Worker, and how I use zerolog for structured logging with trace correlation. Then I show how I deploy the complete observability stack — Prometheus, Grafana, Loki, and Tempo — as ArgoCD applications in the GitOps workflow.

The Observability Deficit

After deploying the services to Kubernetes in Parts 2 and 3, I had a working pipeline but almost no visibility. I could see that pods were running and health checks were passing. But I had no answer to: "What is the p99 latency on the order creation endpoint?" or "Why did that order take 4 seconds?"

That gap between "it's running" and "I understand how it's running" is the observability deficit. This article closes it.

For Prometheus fundamentals, see the Prometheus 101 series. For OpenTelemetry basics, see the OpenTelemetry 101 guide. This article focuses on the Go-specific implementation.

Prometheus Instrumentation

The Metrics Middleware

I instrument the API Gateway with a middleware that records four golden signals for every HTTP request:

// internal/gateway/middleware/metrics.go
package middleware

import (
    "net/http"
    "strconv"
    "time"

    "github.com/go-chi/chi/v5"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Namespace: "goreliable",
            Subsystem: "api_gateway",
            Name:      "http_requests_total",
            Help:      "Total number of HTTP requests",
        },
        []string{"method", "route", "status_code"},
    )

    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Namespace: "goreliable",
            Subsystem: "api_gateway",
            Name:      "http_request_duration_seconds",
            Help:      "HTTP request duration in seconds",
            // Buckets chosen to cover the range from fast cached responses to slow DB queries
            Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5},
        },
        []string{"method", "route", "status_code"},
    )

    httpRequestsInFlight = promauto.NewGauge(
        prometheus.GaugeOpts{
            Namespace: "goreliable",
            Subsystem: "api_gateway",
            Name:      "http_requests_in_flight",
            Help:      "Number of HTTP requests currently being processed",
        },
    )
)

// responseWriter wraps http.ResponseWriter to capture the status code
type responseWriter struct {
    http.ResponseWriter
    status int
    written bool
}

func (rw *responseWriter) WriteHeader(code int) {
    if !rw.written {
        rw.status = code
        rw.written = true
        rw.ResponseWriter.WriteHeader(code)
    }
}

func Metrics(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        httpRequestsInFlight.Inc()
        defer httpRequestsInFlight.Dec()

        rw := &responseWriter{ResponseWriter: w, status: http.StatusOK}
        next.ServeHTTP(rw, r)

        // Use chi's route pattern (e.g., /api/v1/orders/{id}) not the actual path
        // This prevents high cardinality from individual order IDs becoming metric labels
        route := chi.RouteContext(r.Context()).RoutePattern()
        if route == "" {
            route = "unknown"
        }

        labels := prometheus.Labels{
            "method":      r.Method,
            "route":       route,
            "status_code": strconv.Itoa(rw.status),
        }

        httpRequestsTotal.With(labels).Inc()
        httpRequestDuration.With(labels).Observe(time.Since(start).Seconds())
    })
}

The cardinality note on route is important. If I used r.URL.Path as the label, every unique order ID (/api/v1/orders/uuid-1, /api/v1/orders/uuid-2, ...) would create a new time series. Prometheus memory usage would grow unboundedly. Using the chi route pattern collapses all order GETs to a single route="/api/v1/orders/{id}" label.

Order Service Metrics

The Order Service has domain-specific metrics:

// internal/order/metrics.go
package order

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    ordersCreatedTotal = promauto.NewCounter(
        prometheus.CounterOpts{
            Namespace: "goreliable",
            Subsystem: "order_service",
            Name:      "orders_created_total",
            Help:      "Total number of orders created",
        },
    )

    ordersFailedTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Namespace: "goreliable",
            Subsystem: "order_service",
            Name:      "orders_failed_total",
            Help:      "Total number of failed order creations",
        },
        []string{"reason"},
    )

    dbQueryDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Namespace: "goreliable",
            Subsystem: "order_service",
            Name:      "db_query_duration_seconds",
            Help:      "Database query duration in seconds",
            Buckets:   prometheus.DefBuckets,
        },
        []string{"query"},
    )

    activeConnections = promauto.NewGauge(
        prometheus.GaugeOpts{
            Namespace: "goreliable",
            Subsystem: "order_service",
            Name:      "db_active_connections",
            Help:      "Number of active database connections",
        },
    )
)

// In service.go CreateOrder:
func (s *Service) CreateOrder(ctx context.Context, ...) (*Order, error) {
    timer := prometheus.NewTimer(dbQueryDuration.With(prometheus.Labels{"query": "create_order"}))
    defer timer.ObserveDuration()

    if err := s.repo.Create(ctx, order); err != nil {
        ordersFailedTotal.With(prometheus.Labels{"reason": "db_error"}).Inc()
        return nil, fmt.Errorf("create order: %w", err)
    }

    ordersCreatedTotal.Inc()
    return order, nil
}

Notification Worker Metrics

// internal/notification/metrics.go
var (
    messagesProcessedTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Namespace: "goreliable",
            Subsystem: "notification_worker",
            Name:      "messages_processed_total",
            Help:      "Total messages processed",
        },
        []string{"status"}, // "success", "failed", "dropped"
    )

    messageProcessingDuration = promauto.NewHistogram(
        prometheus.HistogramOpts{
            Namespace: "goreliable",
            Subsystem: "notification_worker",
            Name:      "message_processing_duration_seconds",
            Help:      "Time to process a single message",
            Buckets:   []float64{0.01, 0.05, 0.1, 0.5, 1, 5, 10},
        },
    )

    // This metric is what drives the custom HPA in Part 7
    natsConsumerPendingMessages = promauto.NewGauge(
        prometheus.GaugeOpts{
            Namespace: "goreliable",
            Subsystem: "notification_worker",
            Name:      "nats_consumer_pending_messages",
            Help:      "Number of messages pending in the NATS consumer",
        },
    )
)

OpenTelemetry Distributed Tracing

Setup

// pkg/telemetry/otel.go
package telemetry

import (
    "context"
    "fmt"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/propagation"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials/insecure"
)

func Setup(ctx context.Context, serviceName string) (func(context.Context) error, error) {
    // Connect to OpenTelemetry Collector (which routes to Tempo)
    conn, err := grpc.DialContext(ctx,
        "otel-collector:4317",
        grpc.WithTransportCredentials(insecure.NewCredentials()),
    )
    if err != nil {
        return nil, fmt.Errorf("connect to otel collector: %w", err)
    }

    exporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithGRPCConn(conn))
    if err != nil {
        return nil, fmt.Errorf("create trace exporter: %w", err)
    }

    res, err := resource.New(ctx,
        resource.WithAttributes(
            semconv.ServiceName(serviceName),
            semconv.DeploymentEnvironment(os.Getenv("ENVIRONMENT")),
        ),
    )
    if err != nil {
        return nil, fmt.Errorf("create resource: %w", err)
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.ParentBased(
            sdktrace.TraceIDRatioBased(0.1), // Sample 10% in production
        )),
    )

    otel.SetTracerProvider(tp)
    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{},
        propagation.Baggage{},
    ))

    return tp.Shutdown, nil
}

Tracing in the API Gateway

// internal/gateway/middleware/tracing.go
package middleware

import (
    "net/http"

    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

// Wrap the entire router with OpenTelemetry HTTP instrumentation
func OtelHTTP(handler http.Handler, serviceName string) http.Handler {
    return otelhttp.NewHandler(handler, serviceName)
}

Propagating Traces to the Order Service

When the API Gateway calls the Order Service, it propagates the trace context in HTTP headers. The Order Service extracts it to continue the same trace.

// internal/gateway/handler.go — creating an order
func (h *Handler) CreateOrder(w http.ResponseWriter, r *http.Request) {
    ctx, span := tracer.Start(r.Context(), "gateway.CreateOrder")
    defer span.End()

    // ... decode request body ...

    // Call Order Service — trace context propagates via HTTP headers automatically
    // because the HTTP client is wrapped with otelhttp.NewTransport
    resp, err := h.orderClient.CreateOrder(ctx, req)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        http.Error(w, "internal error", http.StatusInternalServerError)
        return
    }

    span.SetAttributes(
        attribute.String("order.id", resp.ID),
        attribute.Int64("order.amount", resp.Amount),
    )

    // ...
}

// pkg/httpclient/client.go — shared HTTP client with trace propagation
package httpclient

import (
    "net/http"
    "time"

    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

func New() *http.Client {
    return &http.Client{
        Transport: otelhttp.NewTransport(http.DefaultTransport),
        Timeout:   10 * time.Second,
    }
}

Order Service Span

// internal/order/service.go
var tracer = otel.Tracer("order-service")

func (s *Service) CreateOrder(ctx context.Context, userID uuid.UUID, amount int64, currency string) (*Order, error) {
    ctx, span := tracer.Start(ctx, "order.CreateOrder")
    defer span.End()

    span.SetAttributes(
        attribute.String("order.user_id", userID.String()),
        attribute.Int64("order.amount", amount),
        attribute.String("order.currency", currency),
    )

    // DB call — creates a child span automatically via pgx otel instrumentation
    if err := s.repo.Create(ctx, order); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "database error")
        return nil, fmt.Errorf("create order: %w", err)
    }

    return order, nil
}

With these traces in place, Grafana Tempo shows me the complete path of a single request: gateway.CreateOrder → order.CreateOrder → db.exec. When an order is slow, I can pinpoint which layer is responsible.

Structured Logging with Trace Correlation

The last piece is correlating logs with traces. When a request causes both a trace span and a log event, I want to be able to click the trace ID in Tempo and jump to the corresponding log lines in Loki.

// internal/order/service.go
func (s *Service) CreateOrder(ctx context.Context, ...) (*Order, error) {
    ctx, span := tracer.Start(ctx, "order.CreateOrder")
    defer span.End()

    // Extract trace/span IDs and inject into zerolog context
    spanCtx := trace.SpanFromContext(ctx).SpanContext()
    logger := zerolog.Ctx(ctx).With().
        Str("trace_id", spanCtx.TraceID().String()).
        Str("span_id", spanCtx.SpanID().String()).
        Logger()
    ctx = logger.WithContext(ctx)

    zerolog.Ctx(ctx).Info().
        Str("user_id", userID.String()).
        Int64("amount", amount).
        Msg("creating order")

    // ...
}

Log lines in Loki now contain trace_id and span_id fields. I configure a Grafana data source link so clicking a trace ID in the Loki explorer opens the corresponding trace in Tempo. This single connection — logs ↔ traces — cuts my mean investigation time significantly.

Deploying the Observability Stack via ArgoCD

I deploy the full observability stack using the same GitOps pattern from Part 3. Each component is an ArgoCD Application sourcing from the official Helm charts.

# argocd/appsets/infrastructure.yaml (relevant section)
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: go-reliable-infrastructure
  namespace: argocd
spec:
  generators:
    - list:
        elements:
          - name: prometheus-stack
            namespace: monitoring
            chart: kube-prometheus-stack
            repoURL: https://prometheus-community.github.io/helm-charts
            targetRevision: "58.x"
          - name: loki-stack
            namespace: monitoring
            chart: loki
            repoURL: https://grafana.github.io/helm-charts
            targetRevision: "6.x"
          - name: tempo
            namespace: monitoring
            chart: tempo
            repoURL: https://grafana.github.io/helm-charts
            targetRevision: "1.x"
          - name: otel-collector
            namespace: monitoring
            chart: opentelemetry-collector
            repoURL: https://open-telemetry.github.io/opentelemetry-helm-charts
            targetRevision: "0.x"
  template:
    metadata:
      name: "infra-{{name}}"
    spec:
      project: go-reliable
      source:
        repoURL: "{{repoURL}}"
        chart: "{{chart}}"
        targetRevision: "{{targetRevision}}"
        helm:
          valueFiles:
            - $values/infrastructure/{{name}}/values.yaml
      # ...

OpenTelemetry Collector Configuration

The collector receives traces from Go services via gRPC and exports to Tempo:

# infrastructure/otel-collector/values.yaml
config:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

  processors:
    batch:
      timeout: 5s
      send_batch_size: 1024
    memory_limiter:
      check_interval: 1s
      limit_mib: 512

  exporters:
    otlp/tempo:
      endpoint: tempo:4317
      tls:
        insecure: true
    prometheusremotewrite:
      endpoint: http://prometheus-stack-kube-prom-prometheus:9090/api/v1/write

  service:
    pipelines:
      traces:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [otlp/tempo]
      metrics:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [prometheusremotewrite]

Grafana Dashboard

I maintain the Go services dashboard as a JSON file in the GitOps repo, loaded via Grafana's ConfigMap-based dashboard provisioning. The dashboard has four sections:

Overview — Request rate, error rate, p50/p99 latency for all services
Order Service — Orders per minute, DB query latency, connection pool utilization
Notification Worker — Messages processed/sec, pending queue depth, processing duration
Infrastructure — Pod CPU/memory vs limits, restarts, node utilization

What the Three Pillars Now Give Me

After completing this instrumentation:

Metrics (Prometheus + Grafana): I can answer "Is anything broken right now?" with a dashboard. I can set up alerts on error rate and latency before users notice.

Traces (OpenTelemetry + Tempo): I can answer "Why was this specific request slow?" by following the trace through all services.

Logs (zerolog + Loki): I can answer "What happened around a specific event?" by filtering logs by trace ID, request ID, or time range.

The three pillars work together because every log line contains the trace ID, and every trace links to the service that generated it. In Part 5, I use these metrics to define formal SLIs and SLOs, calculate error budgets, and set up multi-window burn-rate alerts.

PreviousPart 3: GitOps with ArgoCD and Continuous Delivery NextPart 5: SLIs, SLOs, and Error Budgets in Practice

Last updated 4 days ago

hashtagThe Observability Deficit

hashtagPrometheus Instrumentation

hashtagThe Metrics Middleware

hashtagOrder Service Metrics

hashtagNotification Worker Metrics

hashtagOpenTelemetry Distributed Tracing

hashtagSetup

hashtagTracing in the API Gateway

hashtagPropagating Traces to the Order Service

hashtagOrder Service Span

hashtagStructured Logging with Trace Correlation

hashtagDeploying the Observability Stack via ArgoCD

hashtagOpenTelemetry Collector Configuration

hashtagGrafana Dashboard

hashtagWhat the Three Pillars Now Give Me