Part 3: Monitoring and Observability - Seeing What Your System Is Really Doing

What You'll Learn: This article shares my journey from basic logging to comprehensive observability in Go microservices. You'll learn the difference between monitoring (knowing when things break) and observability (understanding why), how to implement the four golden signals, instrument Go applications with Prometheus, set up structured logging with zerolog, implement distributed tracing with OpenTelemetry, and build dashboards that actually help during incidents.

The Incident I Couldn't Debug

It was a Friday afternoon when my personal finance tracking API started behaving strangely. Users reported that some transactions were saving correctly while others were silently failing. My monitoring showed:

✅ Server health: OK
✅ CPU usage: 35%
✅ Memory usage: 60%
✅ Database connections: Normal

Everything looked fine in my monitoring, but users were experiencing real problems. I spent three hours SSH-ing into servers, grepping logs, and still couldn't figure out what was wrong.

The root cause? A subtle race condition in my transaction processing code that only manifested under specific concurrent load patterns. My monitoring told me the system was "healthy," but I had zero visibility into what was actually happening inside my application.

That's when I learned the critical difference between monitoring and observability.

Monitoring vs. Observability

After that frustrating Friday, I completely rebuilt my approach to visibility. Here's what I learned:

Monitoring: Known Unknowns

Monitoring is asking questions you already know to ask:

Is my service up?
Is CPU usage high?
Is the database responding?

Monitoring tells you WHEN something is wrong.

You set up dashboards and alerts for things you anticipate failing. It's reactive - you need to predict problems ahead of time.

Observability: Unknown Unknowns

Observability is the ability to ask arbitrary questions about your system without having to predict them beforehand:

Why are some transactions failing while others succeed?
What's different about the slow requests vs. fast ones?
How does this error correlate with that database query?

Observability tells you WHY something is wrong.

You instrument your code to emit detailed signals, then explore that data during incidents. It's proactive - you can debug novel failures.

The Three Pillars of Observability

After rebuilding my systems, I now implement three complementary signals:

Metrics - Numeric time-series data (requests/second, latency, error rate)
Logs - Discrete events with context (request processed, error occurred)
Traces - Request flows through distributed systems (how one request traverses multiple services)

Together, these give me complete visibility into my Go applications.

The Four Golden Signals

Google's SRE book taught me to focus on four critical signals that matter for any service:

1. Latency

How long does it take to service a request?

Why it matters: Slow is often worse than down. Users will tolerate occasional errors, but consistent slowness drives them away.

What I track:

P50 (median) latency
P95 latency (95% of requests are faster than this)
P99 latency (handles outliers)

2. Traffic

How much demand is being placed on your system?

Why it matters: Helps identify spikes, understand usage patterns, and plan capacity.

What I track:

Requests per second
Concurrent connections
Data throughput (bytes in/out)

3. Errors

What's the rate of failed requests?

Why it matters: Directly impacts user experience and SLOs.

What I track:

Error rate by status code (4xx vs 5xx)
Error rate by endpoint
Error types (timeout, validation, database, etc.)

4. Saturation

How "full" is your service?

Why it matters: High saturation predicts future failures. Catch problems before they impact users.

What I track:

CPU utilization
Memory usage
Database connection pool utilization
Disk I/O and space
Goroutine count (Go-specific)

Implementing Metrics with Prometheus

Let me show you how I instrument my Go services with Prometheus to capture the four golden signals.

Setting Up Prometheus Client

First, I create a metrics package that all my services use:

// pkg/metrics/prometheus.go
package metrics

import (
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    // Latency - histogram for request duration
    httpDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "Duration of HTTP requests in seconds",
        Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
    }, []string{"service", "method", "endpoint", "status"})

    // Traffic - counter for total requests
    httpRequestsTotal = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total number of HTTP requests",
    }, []string{"service", "method", "endpoint", "status"})

    // Errors - counter for errors by type
    errorsTotal = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "errors_total",
        Help: "Total number of errors by type",
    }, []string{"service", "type", "error"})

    // Saturation - gauge for active requests
    httpRequestsActive = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "http_requests_active",
        Help: "Number of HTTP requests currently being processed",
    })

    // Saturation - gauge for goroutine count
    goroutinesActive = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "goroutines_active",
        Help: "Current number of goroutines",
    })

    // Saturation - database connection pool
    dbConnectionsInUse = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "db_connections_in_use",
        Help: "Number of database connections currently in use",
    })

    dbConnectionsIdle = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "db_connections_idle",
        Help: "Number of idle database connections",
    })

    dbConnectionsWaiting = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "db_connections_waiting",
        Help: "Number of connections waiting from the pool",
    })
)

// RecordHTTPMetrics records metrics for an HTTP request
func RecordHTTPMetrics(service, method, endpoint, status string, duration time.Duration) {
    httpDuration.WithLabelValues(service, method, endpoint, status).Observe(duration.Seconds())
    httpRequestsTotal.WithLabelValues(service, method, endpoint, status).Inc()
}

// RecordError records an error occurrence
func RecordError(service, errorType, errorMsg string) {
    errorsTotal.WithLabelValues(service, errorType, errorMsg).Inc()
}

// IncrementActiveRequests increments the active request count
func IncrementActiveRequests() {
    httpRequestsActive.Inc()
}

// DecrementActiveRequests decrements the active request count
func DecrementActiveRequests() {
    httpRequestsActive.Dec()
}

// UpdateGoroutineCount updates the goroutine count metric
func UpdateGoroutineCount(count int) {
    goroutinesActive.Set(float64(count))
}

// UpdateDBPoolMetrics updates database pool metrics
func UpdateDBPoolMetrics(inUse, idle, waiting int) {
    dbConnectionsInUse.Set(float64(inUse))
    dbConnectionsIdle.Set(float64(idle))
    dbConnectionsWaiting.Set(float64(waiting))
}

HTTP Middleware for Automatic Instrumentation

I wrap all HTTP handlers with middleware that automatically records metrics:

// internal/middleware/metrics.go
package middleware

import (
    "net/http"
    "runtime"
    "strconv"
    "time"

    "github.com/yourusername/financeapp/pkg/metrics"
)

type metricsResponseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (m *metricsResponseWriter) WriteHeader(code int) {
    m.statusCode = code
    m.ResponseWriter.WriteHeader(code)
}

// MetricsMiddleware records all four golden signals
func MetricsMiddleware(serviceName string) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            start := time.Now()

            // Track active requests (saturation)
            metrics.IncrementActiveRequests()
            defer metrics.DecrementActiveRequests()

            // Track goroutines (saturation)
            metrics.UpdateGoroutineCount(runtime.NumGoroutine())

            // Wrap response writer to capture status code
            wrapped := &metricsResponseWriter{
                ResponseWriter: w,
                statusCode:     200,
            }

            // Execute request
            next.ServeHTTP(wrapped, r)

            // Record metrics (latency, traffic, errors)
            duration := time.Since(start)
            status := strconv.Itoa(wrapped.statusCode)
            metrics.RecordHTTPMetrics(serviceName, r.Method, r.URL.Path, status, duration)

            // Record errors separately if 5xx
            if wrapped.statusCode >= 500 {
                metrics.RecordError(serviceName, "http_5xx", r.URL.Path)
            }
        })
    }
}

Database Instrumentation

I also instrument my database layer to track connection pool saturation:

// internal/database/postgres.go
package database

import (
    "context"
    "database/sql"
    "time"

    _ "github.com/lib/pq"
    "github.com/yourusername/financeapp/pkg/metrics"
)

type DB struct {
    *sql.DB
    serviceName string
}

func NewDB(connString, serviceName string) (*DB, error) {
    db, err := sql.Open("postgres", connString)
    if err != nil {
        return nil, err
    }

    // Configure connection pool
    db.SetMaxOpenConns(25)
    db.SetMaxIdleConns(5)
    db.SetConnMaxLifetime(5 * time.Minute)

    wrapped := &DB{
        DB:          db,
        serviceName: serviceName,
    }

    // Start goroutine to collect pool metrics
    go wrapped.collectPoolMetrics()

    return wrapped, nil
}

func (db *DB) collectPoolMetrics() {
    ticker := time.NewTicker(10 * time.Second)
    defer ticker.Stop()

    for range ticker.C {
        stats := db.Stats()
        metrics.UpdateDBPoolMetrics(
            stats.InUse,
            stats.Idle,
            stats.WaitCount,
        )
    }
}

// Wrap queries to track errors
func (db *DB) QueryContext(ctx context.Context, query string, args ...interface{}) (*sql.Rows, error) {
    rows, err := db.DB.QueryContext(ctx, query, args...)
    if err != nil {
        metrics.RecordError(db.serviceName, "database_query", err.Error())
    }
    return rows, err
}

Structured Logging with zerolog

After the race condition incident, I switched from basic log.Printf to structured logging with zerolog.

Why Structured Logging?

Before (plain logs):

2024-02-06 14:23:45 Processing transaction user=john123 amount=50.00 status=pending
2024-02-06 14:23:46 Error: database timeout

Hard to parse, hard to query, hard to correlate.

After (structured logs):

{"level":"info","time":"2024-02-06T14:23:45Z","service":"finance-api","user_id":"john123","transaction_id":"tx_789","amount":50.00,"status":"pending","msg":"Processing transaction"}
{"level":"error","time":"2024-02-06T14:23:46Z","service":"finance-api","error":"database timeout","transaction_id":"tx_789","msg":"Transaction failed"}

Easy to parse, easy to query, maintains context across related logs.

Setting Up zerolog

// pkg/logger/logger.go
package logger

import (
    "os"
    "time"

    "github.com/rs/zerolog"
    "github.com/rs/zerolog/log"
)

// Config holds logger configuration
type Config struct {
    Level       string
    ServiceName string
    Environment string
}

// Initialize sets up the global logger
func Initialize(cfg Config) {
    // Set global log level
    level, err := zerolog.ParseLevel(cfg.Level)
    if err != nil {
        level = zerolog.InfoLevel
    }
    zerolog.SetGlobalLevel(level)

    // Configure time format
    zerolog.TimeFieldFormat = time.RFC3339Nano

    // Pretty print for development
    if cfg.Environment == "development" {
        log.Logger = log.Output(zerolog.ConsoleWriter{
            Out:        os.Stderr,
            TimeFormat: time.RFC3339,
        })
    }

    // Add default fields to all logs
    log.Logger = log.With().
        Str("service", cfg.ServiceName).
        Str("environment", cfg.Environment).
        Logger()
}

// WithContext creates a logger with request context
func WithContext(requestID, userID string) zerolog.Logger {
    return log.With().
        Str("request_id", requestID).
        Str("user_id", userID).
        Logger()
}

Logging Middleware

// internal/middleware/logging.go
package middleware

import (
    "net/http"
    "time"

    "github.com/google/uuid"
    "github.com/rs/zerolog/log"
)

type loggingResponseWriter struct {
    http.ResponseWriter
    statusCode int
    bytesWritten int
}

func (l *loggingResponseWriter) WriteHeader(code int) {
    l.statusCode = code
    l.ResponseWriter.WriteHeader(code)
}

func (l *loggingResponseWriter) Write(b []byte) (int, error) {
    n, err := l.ResponseWriter.Write(b)
    l.bytesWritten += n
    return n, err
}

// LoggingMiddleware logs all HTTP requests with structured data
func LoggingMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Generate request ID
        requestID := uuid.New().String()

        // Add request ID to response headers
        w.Header().Set("X-Request-ID", requestID)

        // Wrap response writer
        wrapped := &loggingResponseWriter{
            ResponseWriter: w,
            statusCode:     200,
        }

        // Execute request
        next.ServeHTTP(wrapped, r)

        // Log request
        duration := time.Since(start)

        logEvent := log.Info()
        if wrapped.statusCode >= 500 {
            logEvent = log.Error()
        } else if wrapped.statusCode >= 400 {
            logEvent = log.Warn()
        }

        logEvent.
            Str("request_id", requestID).
            Str("method", r.Method).
            Str("path", r.URL.Path).
            Str("remote_addr", r.RemoteAddr).
            Int("status", wrapped.statusCode).
            Int("bytes_written", wrapped.bytesWritten).
            Dur("duration_ms", duration).
            Str("user_agent", r.UserAgent()).
            Msg("HTTP request completed")
    })
}

Application-Level Logging

In my business logic, I use structured logging extensively:

// internal/handlers/transaction.go
package handlers

import (
    "encoding/json"
    "net/http"

    "github.com/rs/zerolog/log"
)

type TransactionHandler struct {
    db TransactionStore
}

func (h *TransactionHandler) Create(w http.ResponseWriter, r *http.Request) {
    // Get request ID from context
    requestID := w.Header().Get("X-Request-ID")
    logger := log.With().Str("request_id", requestID).Logger()

    var req CreateTransactionRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        logger.Error().
            Err(err).
            Msg("Failed to decode request body")
        http.Error(w, "Invalid request", http.StatusBadRequest)
        return
    }

    logger.Info().
        Str("user_id", req.UserID).
        Float64("amount", req.Amount).
        Str("category", req.Category).
        Msg("Creating transaction")

    txn, err := h.db.Create(r.Context(), req)
    if err != nil {
        logger.Error().
            Err(err).
            Str("user_id", req.UserID).
            Float64("amount", req.Amount).
            Msg("Failed to create transaction")
        http.Error(w, "Failed to create transaction", http.StatusInternalServerError)
        return
    }

    logger.Info().
        Str("transaction_id", txn.ID).
        Str("user_id", req.UserID).
        Float64("amount", req.Amount).
        Msg("Transaction created successfully")

    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(txn)
}

Distributed Tracing with OpenTelemetry

When I started building microservices, logs and metrics weren't enough. I needed to see how a single user request flowed through multiple services. That's where distributed tracing saved me.

Why Distributed Tracing?

Imagine a user request that:

Hits the API gateway
Calls the auth service
Calls the transaction service
Calls the notification service

If it's slow, where's the bottleneck? Traces show you the complete journey.

Setting Up OpenTelemetry

// pkg/tracing/tracer.go
package tracing

import (
    "context"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.17.0"
)

// InitTracer initializes the OpenTelemetry tracer
func InitTracer(serviceName, jaegerEndpoint string) (*sdktrace.TracerProvider, error) {
    // Create Jaeger exporter
    exp, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint(jaegerEndpoint)))
    if err != nil {
        return nil, err
    }

    // Create resource with service name
    res, err := resource.Merge(
        resource.Default(),
        resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceName(serviceName),
        ),
    )
    if err != nil {
        return nil, err
    }

    // Create trace provider
    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exp),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.AlwaysSample()),
    )

    // Set global tracer provider
    otel.SetTracerProvider(tp)

    return tp, nil
}

Tracing HTTP Requests

// internal/middleware/tracing.go
package middleware

import (
    "net/http"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
    "go.opentelemetry.io/otel/propagation"
)

// TracingMiddleware adds OpenTelemetry tracing to HTTP handlers
func TracingMiddleware(serviceName string) func(http.Handler) http.Handler {
    tracer := otel.Tracer(serviceName)
    propagator := otel.GetTextMapPropagator()

    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            // Extract context from incoming request
            ctx := propagator.Extract(r.Context(), propagation.HeaderCarrier(r.Header))

            // Start span
            ctx, span := tracer.Start(ctx, r.Method+" "+r.URL.Path)
            defer span.End()

            // Add attributes
            span.SetAttributes(
                attribute.String("http.method", r.Method),
                attribute.String("http.url", r.URL.String()),
                attribute.String("http.user_agent", r.UserAgent()),
            )

            // Wrap response writer to capture status
            wrapped := &tracingResponseWriter{
                ResponseWriter: w,
                statusCode:     200,
            }

            // Execute request with trace context
            next.ServeHTTP(wrapped, r.WithContext(ctx))

            // Record status
            span.SetAttributes(attribute.Int("http.status_code", wrapped.statusCode))
            if wrapped.statusCode >= 500 {
                span.SetStatus(codes.Error, "Internal server error")
            }
        })
    }
}

type tracingResponseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (t *tracingResponseWriter) WriteHeader(code int) {
    t.statusCode = code
    t.ResponseWriter.WriteHeader(code)
}

Tracing Database Queries

// internal/database/traced_db.go
package database

import (
    "context"
    "database/sql"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
)

type TracedDB struct {
    *DB
    tracer trace.Tracer
}

func NewTracedDB(db *DB, serviceName string) *TracedDB {
    return &TracedDB{
        DB:     db,
        tracer: otel.Tracer(serviceName),
    }
}

func (db *TracedDB) QueryContext(ctx context.Context, query string, args ...interface{}) (*sql.Rows, error) {
    ctx, span := db.tracer.Start(ctx, "database.query")
    defer span.End()

    span.SetAttributes(
        attribute.String("db.system", "postgresql"),
        attribute.String("db.statement", query),
    )

    rows, err := db.DB.QueryContext(ctx, query, args...)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
    }

    return rows, err
}

Building Useful Dashboards

After collecting metrics, logs, and traces, I needed dashboards that actually helped during incidents. Here's what I learned works:

Dashboard 1: The Four Golden Signals

This is my default dashboard - one screen that shows service health:

# Grafana dashboard config (simplified)
panels:
  - title: "Latency (P50, P95, P99)"
    query: |
      histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
      histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
      histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

  - title: "Traffic (Requests/sec)"
    query: |
      sum(rate(http_requests_total[5m]))

  - title: "Errors (Error Rate %)"
    query: |
      sum(rate(http_requests_total{status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total[5m])) * 100

  - title: "Saturation (Active Requests)"
    query: |
      http_requests_active

Dashboard 2: Service Deep Dive

When I need to dig deeper:

panels:
  - title: "Latency by Endpoint"
    query: |
      histogram_quantile(0.95, 
        sum(rate(http_request_duration_seconds_bucket[5m])) by (endpoint, le)
      )

  - title: "Error Rate by Endpoint"
    query: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint)

  - title: "Database Connection Pool"
    query: |
      db_connections_in_use
      db_connections_idle
      db_connections_waiting

  - title: "Goroutine Count"
    query: |
      goroutines_active

Dashboard 3: SLO Tracking

Dedicated dashboard for tracking SLOs:

panels:
  - title: "Availability SLO (99.9% target)"
    query: |
      sum(rate(http_requests_total{status!~"5.."}[30d]))
      /
      sum(rate(http_requests_total[30d])) * 100
    alert: 99.9

  - title: "Error Budget Remaining"
    query: |
      1 - (
        (1 - (sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d]))))
        /
        (1 - 0.999)
      )

Putting It All Together: Main Application

Here's how I wire up metrics, logging, and tracing in my main application:

// cmd/server/main.go
package main

import (
    "context"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"

    "github.com/prometheus/client_golang/prometheus/promhttp"

    "github.com/yourusername/financeapp/internal/handlers"
    "github.com/yourusername/financeapp/internal/middleware"
    "github.com/yourusername/financeapp/pkg/logger"
    "github.com/yourusername/financeapp/pkg/tracing"
)

func main() {
    serviceName := "finance-api"

    // Initialize structured logging
    logger.Initialize(logger.Config{
        Level:       "info",
        ServiceName: serviceName,
        Environment: os.Getenv("ENVIRONMENT"),
    })

    // Initialize tracing
    tp, err := tracing.InitTracer(serviceName, os.Getenv("JAEGER_ENDPOINT"))
    if err != nil {
        log.Fatal().Err(err).Msg("Failed to initialize tracer")
    }
    defer tp.Shutdown(context.Background())

    // Set up routes
    mux := http.NewServeMux()

    // Metrics endpoint
    mux.Handle("/metrics", promhttp.Handler())

    // Health endpoints
    mux.HandleFunc("/health/live", healthLive)
    mux.HandleFunc("/health/ready", healthReady)

    // Business endpoints
    transactionHandler := handlers.NewTransactionHandler()
    mux.HandleFunc("/api/transactions", transactionHandler.Create)

    // Apply middleware (order matters!)
    handler := middleware.TracingMiddleware(serviceName)(
        middleware.LoggingMiddleware(
            middleware.MetricsMiddleware(serviceName)(mux),
        ),
    )

    // Start server
    srv := &http.Server{
        Addr:         ":8080",
        Handler:      handler,
        ReadTimeout:  15 * time.Second,
        WriteTimeout: 15 * time.Second,
    }

    go func() {
        log.Info().Str("addr", srv.Addr).Msg("Starting server")
        if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
            log.Fatal().Err(err).Msg("Server failed")
        }
    }()

    // Graceful shutdown
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
    <-quit

    log.Info().Msg("Shutting down server...")
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    if err := srv.Shutdown(ctx); err != nil {
        log.Fatal().Err(err).Msg("Server forced to shutdown")
    }

    log.Info().Msg("Server exited")
}

Real Debugging Story: How Observability Saved Me

Remember that race condition I mentioned at the start? Here's how observability helped me finally debug it:

Step 1: Metrics showed the problem

Error rate spiked from 0.1% to 5% at 14:23
P99 latency jumped from 200ms to 3s

Step 2: Logs showed which transactions

{"level":"error","transaction_id":"tx_456","user_id":"user_789","msg":"Transaction failed: database lock timeout"}

Step 3: Traces showed the timing The trace revealed two concurrent requests for the same user, creating a database deadlock.

Step 4: Fix I added optimistic locking to my transaction code:

func (s *Store) CreateTransaction(ctx context.Context, req CreateTxnRequest) error {
    ctx, span := s.tracer.Start(ctx, "store.CreateTransaction")
    defer span.End()

    // Start transaction with retry logic
    return s.db.WithRetry(func(tx *sql.Tx) error {
        // Lock the user row for update
        var version int
        err := tx.QueryRowContext(ctx, 
            "SELECT version FROM users WHERE id = $1 FOR UPDATE", 
            req.UserID,
        ).Scan(&version)
        if err != nil {
            span.RecordError(err)
            return err
        }

        // Create transaction
        _, err = tx.ExecContext(ctx,
            "INSERT INTO transactions (user_id, amount, version) VALUES ($1, $2, $3)",
            req.UserID, req.Amount, version+1,
        )
        return err
    })
}

Without metrics, logs, and traces, I'd still be guessing.

Key Lessons

Monitoring tells you WHEN, observability tells you WHY. You need both.
Instrument from day one. Adding observability after you have a problem is too late.
Focus on the four golden signals: latency, traffic, errors, saturation. They cover 90% of issues.
Structured logging is non-negotiable for any production service. JSON logs are searchable and parseable.
Distributed tracing becomes essential the moment you have more than one service.
Dashboards should answer questions, not just display data. Build dashboards for specific debugging scenarios.

What's Next

With comprehensive observability in place, you can finally see what's happening in your systems. In Part 4, we'll cover:

Incident management and response
On-call best practices
Writing effective post-mortems
Building runbooks that actually help

Resources

Conclusion

Observability transformed how I debug and understand my systems. Before implementing these practices, I was flying blind - guessing at problems and hoping for the best. Now I have data to guide every decision.

Start small:

Add Prometheus metrics to one service
Switch to structured logging
Build a simple dashboard
Add tracing when you have multiple services

Each step makes your systems more understandable and your life easier. You'll thank yourself the next time something goes wrong at 2 AM.

PreviousPart 2: SLIs, SLOs, and SLAs - Building a Reliability Framework NextPart 4: Incident Management - From Chaos to Coordinated Response

Last updated 11 days ago

hashtagThe Incident I Couldn't Debug

hashtagMonitoring vs. Observability

hashtagMonitoring: Known Unknowns

hashtagObservability: Unknown Unknowns

hashtagThe Three Pillars of Observability

hashtagThe Four Golden Signals

hashtag1. Latency

hashtag2. Traffic

hashtag3. Errors

hashtag4. Saturation

hashtagImplementing Metrics with Prometheus

hashtagSetting Up Prometheus Client

hashtagHTTP Middleware for Automatic Instrumentation

hashtagDatabase Instrumentation

hashtagStructured Logging with zerolog

hashtagWhy Structured Logging?

hashtagSetting Up zerolog

hashtagLogging Middleware

hashtagApplication-Level Logging

hashtagDistributed Tracing with OpenTelemetry

hashtagWhy Distributed Tracing?

hashtagSetting Up OpenTelemetry

hashtagTracing HTTP Requests

hashtagTracing Database Queries

hashtagBuilding Useful Dashboards

hashtagDashboard 1: The Four Golden Signals

hashtagDashboard 2: Service Deep Dive

hashtagDashboard 3: SLO Tracking

hashtagPutting It All Together: Main Application

hashtagReal Debugging Story: How Observability Saved Me

hashtagKey Lessons

hashtagWhat's Next

hashtagResources

hashtagConclusion

The Incident I Couldn't Debug

Monitoring vs. Observability

Monitoring: Known Unknowns

Observability: Unknown Unknowns

The Three Pillars of Observability

The Four Golden Signals

1. Latency

2. Traffic

3. Errors

4. Saturation

Implementing Metrics with Prometheus

Setting Up Prometheus Client

HTTP Middleware for Automatic Instrumentation

Database Instrumentation

Structured Logging with zerolog

Why Structured Logging?

Setting Up zerolog

Logging Middleware

Application-Level Logging

Distributed Tracing with OpenTelemetry

Why Distributed Tracing?

Setting Up OpenTelemetry

Tracing HTTP Requests

Tracing Database Queries

Building Useful Dashboards

Dashboard 1: The Four Golden Signals

Dashboard 2: Service Deep Dive

Dashboard 3: SLO Tracking

Putting It All Together: Main Application

Real Debugging Story: How Observability Saved Me

Key Lessons

What's Next

Resources

Conclusion