Part 3: Monitoring and Observability - Seeing What Your System Is Really Doing

What You'll Learn: This article shares my journey from basic logging to comprehensive observability in Go microservices. You'll learn the difference between monitoring (knowing when things break) and observability (understanding why), how to implement the four golden signals, instrument Go applications with Prometheus, set up structured logging with zerolog, implement distributed tracing with OpenTelemetry, and build dashboards that actually help during incidents.

The Incident I Couldn't Debug

It was a Friday afternoon when my personal finance tracking API started behaving strangely. Users reported that some transactions were saving correctly while others were silently failing. My monitoring showed:

βœ… Server health: OK
βœ… CPU usage: 35%
βœ… Memory usage: 60%
βœ… Database connections: Normal

Everything looked fine in my monitoring, but users were experiencing real problems. I spent three hours SSH-ing into servers, grepping logs, and still couldn't figure out what was wrong.

The root cause? A subtle race condition in my transaction processing code that only manifested under specific concurrent load patterns. My monitoring told me the system was "healthy," but I had zero visibility into what was actually happening inside my application.

That's when I learned the critical difference between monitoring and observability.

Monitoring vs. Observability

After that frustrating Friday, I completely rebuilt my approach to visibility. Here's what I learned:

Monitoring: Known Unknowns

Monitoring is asking questions you already know to ask:

  • Is my service up?

  • Is CPU usage high?

  • Is the database responding?

Monitoring tells you WHEN something is wrong.

You set up dashboards and alerts for things you anticipate failing. It's reactive - you need to predict problems ahead of time.

Observability: Unknown Unknowns

Observability is the ability to ask arbitrary questions about your system without having to predict them beforehand:

  • Why are some transactions failing while others succeed?

  • What's different about the slow requests vs. fast ones?

  • How does this error correlate with that database query?

Observability tells you WHY something is wrong.

You instrument your code to emit detailed signals, then explore that data during incidents. It's proactive - you can debug novel failures.

The Three Pillars of Observability

After rebuilding my systems, I now implement three complementary signals:

  1. Metrics - Numeric time-series data (requests/second, latency, error rate)

  2. Logs - Discrete events with context (request processed, error occurred)

  3. Traces - Request flows through distributed systems (how one request traverses multiple services)

Together, these give me complete visibility into my Go applications.

The Four Golden Signals

Google's SRE book taught me to focus on four critical signals that matter for any service:

1. Latency

How long does it take to service a request?

Why it matters: Slow is often worse than down. Users will tolerate occasional errors, but consistent slowness drives them away.

What I track:

  • P50 (median) latency

  • P95 latency (95% of requests are faster than this)

  • P99 latency (handles outliers)

2. Traffic

How much demand is being placed on your system?

Why it matters: Helps identify spikes, understand usage patterns, and plan capacity.

What I track:

  • Requests per second

  • Concurrent connections

  • Data throughput (bytes in/out)

3. Errors

What's the rate of failed requests?

Why it matters: Directly impacts user experience and SLOs.

What I track:

  • Error rate by status code (4xx vs 5xx)

  • Error rate by endpoint

  • Error types (timeout, validation, database, etc.)

4. Saturation

How "full" is your service?

Why it matters: High saturation predicts future failures. Catch problems before they impact users.

What I track:

  • CPU utilization

  • Memory usage

  • Database connection pool utilization

  • Disk I/O and space

  • Goroutine count (Go-specific)

Implementing Metrics with Prometheus

Let me show you how I instrument my Go services with Prometheus to capture the four golden signals.

Setting Up Prometheus Client

First, I create a metrics package that all my services use:

HTTP Middleware for Automatic Instrumentation

I wrap all HTTP handlers with middleware that automatically records metrics:

Database Instrumentation

I also instrument my database layer to track connection pool saturation:

Structured Logging with zerolog

After the race condition incident, I switched from basic log.Printf to structured logging with zerolog.

Why Structured Logging?

Before (plain logs):

Hard to parse, hard to query, hard to correlate.

After (structured logs):

Easy to parse, easy to query, maintains context across related logs.

Setting Up zerolog

Logging Middleware

Application-Level Logging

In my business logic, I use structured logging extensively:

Distributed Tracing with OpenTelemetry

When I started building microservices, logs and metrics weren't enough. I needed to see how a single user request flowed through multiple services. That's where distributed tracing saved me.

Why Distributed Tracing?

Imagine a user request that:

  1. Hits the API gateway

  2. Calls the auth service

  3. Calls the transaction service

  4. Calls the notification service

If it's slow, where's the bottleneck? Traces show you the complete journey.

Setting Up OpenTelemetry

Tracing HTTP Requests

Tracing Database Queries

Building Useful Dashboards

After collecting metrics, logs, and traces, I needed dashboards that actually helped during incidents. Here's what I learned works:

Dashboard 1: The Four Golden Signals

This is my default dashboard - one screen that shows service health:

Dashboard 2: Service Deep Dive

When I need to dig deeper:

Dashboard 3: SLO Tracking

Dedicated dashboard for tracking SLOs:

Putting It All Together: Main Application

Here's how I wire up metrics, logging, and tracing in my main application:

Real Debugging Story: How Observability Saved Me

Remember that race condition I mentioned at the start? Here's how observability helped me finally debug it:

Step 1: Metrics showed the problem

Step 2: Logs showed which transactions

Step 3: Traces showed the timing The trace revealed two concurrent requests for the same user, creating a database deadlock.

Step 4: Fix I added optimistic locking to my transaction code:

Without metrics, logs, and traces, I'd still be guessing.

Key Lessons

  1. Monitoring tells you WHEN, observability tells you WHY. You need both.

  2. Instrument from day one. Adding observability after you have a problem is too late.

  3. Focus on the four golden signals: latency, traffic, errors, saturation. They cover 90% of issues.

  4. Structured logging is non-negotiable for any production service. JSON logs are searchable and parseable.

  5. Distributed tracing becomes essential the moment you have more than one service.

  6. Dashboards should answer questions, not just display data. Build dashboards for specific debugging scenarios.

What's Next

With comprehensive observability in place, you can finally see what's happening in your systems. In Part 4, we'll cover:

  • Incident management and response

  • On-call best practices

  • Writing effective post-mortems

  • Building runbooks that actually help

Resources

Conclusion

Observability transformed how I debug and understand my systems. Before implementing these practices, I was flying blind - guessing at problems and hoping for the best. Now I have data to guide every decision.

Start small:

  1. Add Prometheus metrics to one service

  2. Switch to structured logging

  3. Build a simple dashboard

  4. Add tracing when you have multiple services

Each step makes your systems more understandable and your life easier. You'll thank yourself the next time something goes wrong at 2 AM.

Last updated