SRE Playbook: Operating Go Microservices in Production

A hands-on series about running software reliably — from first deploy to MLOps, LLMOps, and beyond

This series is different from the SRE 101 guide. That series covers the what and why of SRE. This Playbook is about the how — grounded in a real Go microservices project I built and operate personally.

Every article in this series is based on knowledge I've accumulated building and running distributed systems. No fictional companies, no invented team dynamics. Just the actual patterns, mistakes, and lessons from working with these tools.

The Running Example: GoReliable Platform

Throughout this series, I use a personal Go microservices project as the reference implementation. The platform consists of:

API Gateway — Go service using chi router, handles authentication, rate limiting, and request routing
Order Service — Core business logic service (Go + PostgreSQL via pgx)
Notification Worker — Async processing service (Go + NATS JetStream)
ML Inference Gateway — Go proxy that routes prediction requests to model serving endpoints

All services are deployed on Kubernetes using Helm charts, managed via GitOps with ArgoCD, and instrumented with Prometheus, OpenTelemetry, and structured logging.

Prerequisites

This is an applied integration series for engineers who already have working knowledge of:

Go (intermediate level — goroutines, interfaces, error handling)
Kubernetes fundamentals (pods, deployments, services)
Basic SRE concepts (SLIs, SLOs, error budgets)

If any of those areas feel shaky, I recommend starting with the relevant 101 series first:

SRE 101 — SRE fundamentals
Kubernetes 101 — Kubernetes basics
Helm 101 — Helm charts
GitOps 101 — ArgoCD and GitOps
MLOps 101 — MLOps fundamentals

Series Structure

Phase 1: Foundation — Build and Deploy

Get the Go services running on Kubernetes with proper GitOps continuous delivery.

Article

Description

Part 1: Building a Production-Ready Go Microservices Platform

Project structure, API gateway, worker services, Docker, Makefile

Part 2: Kubernetes Deployment with Helm Charts

Helm chart design, health probes, resource management, multi-environment values

Part 3: GitOps with ArgoCD — Continuous Delivery Pipeline

App-of-apps, ApplicationSet, CI pipeline, Sealed Secrets

Phase 2: Observability and Reliability

Instrument the services, define SLOs, handle incidents, and tune performance.

Article

Description

Part 4: Instrumenting Go Services — Metrics, Traces, and Logs

Prometheus client_golang, OpenTelemetry, zerolog, Grafana + Loki + Tempo

Part 5: SLIs, SLOs, and Error Budgets in Practice

SLI middleware, Sloth SLO tooling, multi-window burn-rate alerts

Part 6: Incident Management and On-Call Automation

Alertmanager, runbooks as code, Go CLI for incident response

Part 7: Capacity Planning, Performance, and Chaos Engineering

k6 load testing, pprof profiling, HPA, Chaos Mesh

Phase 3: ML and AI Operations

Integrate machine learning workflows — training, experiment tracking, LLM serving, and governance.

Article

Description

Part 8: MLOps with KubeFlow — Training Pipelines on Kubernetes

KubeFlow Pipelines, Katib, KServe, Go prediction gateway

Part 9: MLFlow Integration — Experiment Tracking and Model Registry

MLFlow on K8s, experiment logging, model versioning, Go → MLFlow API

Part 10: LLMOps — Operating Large Language Models Reliably

vLLM on K8s, Go LLM gateway, TTFT SLIs, token cost monitoring

Part 11: ModelOps — Governance, Lifecycle, and Production Management

Drift detection, automated retraining pipeline, model audit trail

Phase 4: Platform and Perspective

Tie everything together and reflect on what was learned.

Article

Description

Part 12: GitOps for the Full Stack — ArgoCD Orchestrating Everything

ApplicationSet at scale, multi-cluster, Argo Rollouts, sync waves

Part 13: End-to-End SRE Platform — Reliability at Every Layer

Full reference architecture, network policies, DR, Kubecost

Part 14: SRE Playbook Retrospective — Patterns, Anti-Patterns, and What's Next

Lessons learned, decision log, cost breakdown, future directions

How I Use This Series

I read and reference these articles when:

Onboarding a new service into the platform
Debugging reliability issues between the Go services and the ML layer
Reviewing whether the SLOs I've set are actually meaningful
Planning the next iteration of the GitOps workflow

The goal is that each article is useful standalone but makes the most sense read in order — because each phase builds on the infrastructure established in the previous one.

PreviousPart 7: Programming for Reliability - Building Systems That Don't Break NextPart 1: Building a Production-Ready Go Microservices Platform

Last updated 4 days ago

hashtagThe Running Example: GoReliable Platform

hashtagPrerequisites

hashtagSeries Structure

hashtagPhase 1: Foundation — Build and Deploy

hashtagPhase 2: Observability and Reliability

hashtagPhase 3: ML and AI Operations

hashtagPhase 4: Platform and Perspective

hashtagHow I Use This Series

The Running Example: GoReliable Platform

Prerequisites

Series Structure

Phase 1: Foundation — Build and Deploy

Phase 2: Observability and Reliability

Phase 3: ML and AI Operations

Phase 4: Platform and Perspective

How I Use This Series