SRE Playbook: Operating Go Microservices in Production
A hands-on series about running software reliably β from first deploy to MLOps, LLMOps, and beyond
This series is different from the SRE 101 guide. That series covers the what and why of SRE. This Playbook is about the how β grounded in a real Go microservices project I built and operate personally.
Every article in this series is based on knowledge I've accumulated building and running distributed systems. No fictional companies, no invented team dynamics. Just the actual patterns, mistakes, and lessons from working with these tools.
The Running Example: GoReliable Platform
Throughout this series, I use a personal Go microservices project as the reference implementation. The platform consists of:
API Gateway β Go service using chi router, handles authentication, rate limiting, and request routing
Order Service β Core business logic service (Go + PostgreSQL via pgx)
Notification Worker β Async processing service (Go + NATS JetStream)
ML Inference Gateway β Go proxy that routes prediction requests to model serving endpoints
All services are deployed on Kubernetes using Helm charts, managed via GitOps with ArgoCD, and instrumented with Prometheus, OpenTelemetry, and structured logging.
Prerequisites
This is an applied integration series for engineers who already have working knowledge of:
Go (intermediate level β goroutines, interfaces, error handling)
Kubernetes fundamentals (pods, deployments, services)
Basic SRE concepts (SLIs, SLOs, error budgets)
If any of those areas feel shaky, I recommend starting with the relevant 101 series first:
SRE 101 β SRE fundamentals
Kubernetes 101 β Kubernetes basics
Helm 101 β Helm charts
GitOps 101 β ArgoCD and GitOps
MLOps 101 β MLOps fundamentals
Series Structure
Phase 1: Foundation β Build and Deploy
Get the Go services running on Kubernetes with proper GitOps continuous delivery.
Project structure, API gateway, worker services, Docker, Makefile
Helm chart design, health probes, resource management, multi-environment values
App-of-apps, ApplicationSet, CI pipeline, Sealed Secrets
Phase 2: Observability and Reliability
Instrument the services, define SLOs, handle incidents, and tune performance.
Prometheus client_golang, OpenTelemetry, zerolog, Grafana + Loki + Tempo
SLI middleware, Sloth SLO tooling, multi-window burn-rate alerts
Alertmanager, runbooks as code, Go CLI for incident response
k6 load testing, pprof profiling, HPA, Chaos Mesh
Phase 3: ML and AI Operations
Integrate machine learning workflows β training, experiment tracking, LLM serving, and governance.
KubeFlow Pipelines, Katib, KServe, Go prediction gateway
MLFlow on K8s, experiment logging, model versioning, Go β MLFlow API
vLLM on K8s, Go LLM gateway, TTFT SLIs, token cost monitoring
Drift detection, automated retraining pipeline, model audit trail
Phase 4: Platform and Perspective
Tie everything together and reflect on what was learned.
ApplicationSet at scale, multi-cluster, Argo Rollouts, sync waves
Full reference architecture, network policies, DR, Kubecost
Lessons learned, decision log, cost breakdown, future directions
How I Use This Series
I read and reference these articles when:
Onboarding a new service into the platform
Debugging reliability issues between the Go services and the ML layer
Reviewing whether the SLOs I've set are actually meaningful
Planning the next iteration of the GitOps workflow
The goal is that each article is useful standalone but makes the most sense read in order β because each phase builds on the infrastructure established in the previous one.
Last updated