SRE Playbook: Operating Go Microservices in Production

A hands-on series about running software reliably β€” from first deploy to MLOps, LLMOps, and beyond


This series is different from the SRE 101 guide. That series covers the what and why of SRE. This Playbook is about the how β€” grounded in a real Go microservices project I built and operate personally.

Every article in this series is based on knowledge I've accumulated building and running distributed systems. No fictional companies, no invented team dynamics. Just the actual patterns, mistakes, and lessons from working with these tools.

The Running Example: GoReliable Platform

Throughout this series, I use a personal Go microservices project as the reference implementation. The platform consists of:

  • API Gateway β€” Go service using chi router, handles authentication, rate limiting, and request routing

  • Order Service β€” Core business logic service (Go + PostgreSQL via pgx)

  • Notification Worker β€” Async processing service (Go + NATS JetStream)

  • ML Inference Gateway β€” Go proxy that routes prediction requests to model serving endpoints

All services are deployed on Kubernetes using Helm charts, managed via GitOps with ArgoCD, and instrumented with Prometheus, OpenTelemetry, and structured logging.

Prerequisites

This is an applied integration series for engineers who already have working knowledge of:

  • Go (intermediate level β€” goroutines, interfaces, error handling)

  • Kubernetes fundamentals (pods, deployments, services)

  • Basic SRE concepts (SLIs, SLOs, error budgets)

If any of those areas feel shaky, I recommend starting with the relevant 101 series first:

Series Structure

Phase 1: Foundation β€” Build and Deploy

Get the Go services running on Kubernetes with proper GitOps continuous delivery.

Article
Description

Project structure, API gateway, worker services, Docker, Makefile

Helm chart design, health probes, resource management, multi-environment values

App-of-apps, ApplicationSet, CI pipeline, Sealed Secrets

Phase 2: Observability and Reliability

Instrument the services, define SLOs, handle incidents, and tune performance.

Article
Description

Prometheus client_golang, OpenTelemetry, zerolog, Grafana + Loki + Tempo

SLI middleware, Sloth SLO tooling, multi-window burn-rate alerts

Alertmanager, runbooks as code, Go CLI for incident response

k6 load testing, pprof profiling, HPA, Chaos Mesh

Phase 3: ML and AI Operations

Integrate machine learning workflows β€” training, experiment tracking, LLM serving, and governance.

Article
Description

KubeFlow Pipelines, Katib, KServe, Go prediction gateway

MLFlow on K8s, experiment logging, model versioning, Go β†’ MLFlow API

vLLM on K8s, Go LLM gateway, TTFT SLIs, token cost monitoring

Drift detection, automated retraining pipeline, model audit trail

Phase 4: Platform and Perspective

Tie everything together and reflect on what was learned.

Article
Description

ApplicationSet at scale, multi-cluster, Argo Rollouts, sync waves

Full reference architecture, network policies, DR, Kubecost

Lessons learned, decision log, cost breakdown, future directions

How I Use This Series

I read and reference these articles when:

  • Onboarding a new service into the platform

  • Debugging reliability issues between the Go services and the ML layer

  • Reviewing whether the SLOs I've set are actually meaningful

  • Planning the next iteration of the GitOps workflow

The goal is that each article is useful standalone but makes the most sense read in order β€” because each phase builds on the infrastructure established in the previous one.

Last updated