Part 14: Retrospective — Patterns, Anti-Patterns, and What I'd Do Differently

Part of the SRE Playbook series

What You'll Learn: This closing article summarizes every major architectural decision in the GoReliable series, explains which patterns held up in practice and which I'd change, surfaces common anti-patterns that appear in real platforms, shares the cost breakdown of running this stack, and points toward where SRE for AI/ML systems is heading. If you only read one article in the series, this is the one that tells you whether the others are worth reading for your situation.

What Actually Worked

After building this platform across 13 articles, some patterns proved more valuable than their implementation cost.

GitOps as the Source of Truth

The single principle that prevented the most incidents was: every production change is a Git commit. Twice during the series, I was able to answer "what changed 6 hours ago that might have caused this?" by running git log --since="6 hours ago" on the GitOps repo. The answer was there both times — once a resources limit that had been adjusted, once a model version promotion that coincidentally happened near a latency spike (unrelated, but I could confirm it quickly).

The discipline cost is real. Early on I was tempted to kubectl apply -f a quick fix during an incident. I did it once. ArgoCD reconciled and reverted it within minutes, because the Git state didn't match. I had to do the change properly. After that, I stopped resenting the process.

Error Budget Policies Decided Under Calm

The error budget policy table in Part 5 saved me from two bad decisions. Once I had written down "below 10% budget, no new features, only reliability work," I didn't need to have a negotiation about it during an incident. The document made the decision in advance.

The hard part was committing to the policy before I'd ever experienced a budget exhaustion. It felt theoretical. It stopped feeling theoretical when the budget hit 8% and the policy said "no new features this sprint" and I could point to the document.

SLIs That Measured User Experience

The availability SLI counts 5xx errors. But the latency SLI (p99 < 300ms) actually caught more user-impacting regressions. Database query performance degradations showed up in latency long before they produced errors. The recommendation model accuracy proxy (click-through rate) caught the drift incident before any user filed a report.

Invest in SLIs that measure what users actually notice, not just what's easy to measure.

Structured Logs with Trace Correlation

The zerolog + OpenTelemetry trace ID injection from Part 4 made debugging production incidents dramatically faster. Being able to go from a Prometheus alert, to a Grafana trace, to the exact log lines for that trace — without search-guessing — is the difference between a 20-minute MTTR and a 2-hour one.

The implementation cost was under 50 lines of Go code once I understood the pattern.

What I'd Change

Starting with Argo Rollouts Earlier

I started with plain Kubernetes Deployments and added Argo Rollouts much later. Every deploy between Parts 2 and 5 was effectively a big-bang rollout that relied on readiness probes to catch failures. I should have started with canary deployments from Part 2. The maxUnavailable: 0 setting helps but it's not a substitute for actual traffic splitting with automatic analysis.

Rate Limiting at the Wrong Layer

In Part 1, I built an in-memory IP-based rate limiter in the API Gateway. It worked for a single pod. When the API Gateway scaled to 3 pods (HPA), each pod had its own rate limiter state. A client could make 3× the allowed requests by distributing across all pods.

The right solution is a centralized rate limiter — Redis-backed, or handled at the Ingress layer via nginx.ingress.kubernetes.io/rate-limit annotations. I documented the single-pod limitation but didn't fix it in the series. In a real production system, this would be a priority fix.

Starting with a Namespace-Per-Environment-Per-Team Model

I used two namespaces (go-reliable-staging, go-reliable-production). For a small project this was fine. For a larger organization, the recommended pattern is namespace-per-environment-per-team, with NetworkPolicies at the namespace boundary enforcing team isolation. Starting there costs nothing additional but scales better.

KubeFlow Was Heavyweight for the Use Case

KubeFlow Pipelines is a significant operational burden — Istio, Argo Workflows, multiple controllers, and its own auth layer (dex). For a single model and a weekly retraining schedule, Argo Workflows alone would have been simpler.

KubeFlow makes sense when you have multiple teams running many experiments simultaneously and need the dashboard for experiment comparison. For my single-team project, it was overengineered. I'd use Argo Workflows + MLFlow directly for small-to-medium ML shops.

Common Anti-Patterns I Avoided (and Why They Matter)

Storing Secrets in Git

Every secret reference in the GitOps repo is an ExternalSecret CRD pointing to AWS Secrets Manager. I never committed a secret value. This seems obvious, but I've reviewed production systems where database passwords were in Helm values files committed to internal Git repos. "Internal repo" is not a security boundary.

Treating Training Accuracy as Production Accuracy

The recommendation model had 84% test accuracy at training time. Over time, in production, it was effectively 71% (estimated from click-through rate). Training accuracy and production accuracy are different numbers unless you continuously monitor and retrain. The drift detection in Part 11 exists precisely because of this gap.

Ignoring `resources.limits` for ML Workloads

KubeFlow training Jobs that run without CPU limits will consume an entire node during training, starving application pods. I set explicit resource requests/limits on every training step with room for the actual training workload but not unlimited. The Kyverno admission policy in Part 13 enforces this even if someone forgets.

Using `latest` Tag in Production

Every application deployment in production uses an immutable image tag (the Git commit SHA). latest in production means you don't know what's running, and a node restart can pull a different image than what was tested. The ApplicationSet ignores the imageTag: "stable" parameter in production values and always requires a CI-set SHA.

Alerting on Symptoms vs. Causes

My initial alert setup had one alert for every metric crossing a threshold: CPU > 80%, memory > 80%, error rate > 1%. That's 12 alerts that all fire simultaneously when the Order Service has a DB connection problem.

The final alert setup (Part 6) has one alert tree: a burn rate alert fires when the error budget is consumed unusually fast. That root alert is the signal; the other metrics are the debugging tools. This moved from "12 alerts that tell you something is wrong" to "1 alert that tells you something is wrong" with the others in dashboards only.

Cost Breakdown

Running the complete GoReliable platform on EKS for one month (all components, both environments):

Component

Monthly Cost

EKS cluster control plane

$73

Worker nodes (app workloads, m5.xlarge × 3)

$210

Worker node (ML/vLLM, m5.4xlarge × 1)

$560

RDS PostgreSQL (db.t3.medium, Multi-AZ)

$98

S3 (model artifacts, logs)

$12

Istio/KubeFlow overhead (2× on ML node)

included above

NAT Gateway + data transfer

$45

Total

~$1,000/month

70% of the cost is the ML workloads. If I removed the vLLM server and ran the recommendation model on a smaller CPU instance, the platform drops to ~$400/month. This is the cost visibility argument for Kubecost — before I set it up, I didn't know vLLM was costing more than all four application microservices combined.

Where SRE for AI/ML Is Heading

A few directions I'm watching that will change how we operate these systems:

eBPF-based observability. Tools like Cilium + Hubble provide L4/L7 network observability and service mesh capabilities without sidecar proxies. For ML workloads where every CPU cycle matters, avoiding sidecar overhead is meaningful. The same network policies I wrote as Kubernetes NetworkPolicies in Part 13 become Cilium NetworkPolicies with richer debugging.

AI-assisted SRE. LLM-powered alert analysis is genuinely useful today for correlating alerts with recent Git changes and suggesting runbook steps. It's not replacing SRE judgment but it's reducing the time to first hypothesis. The same ML infrastructure I built in this series is already being used to handle routine on-call tasks.

Continuous evaluation pipelines. The drift detection in Part 11 runs daily on a schedule. The next iteration is continuous: every inference result is compared to ground truth (when it becomes available) and the model's production accuracy is tracked as a time-series metric with an SLO. Same reliability tooling, applied to model quality.

Standardized ML telemetry. The OpenTelemetry spec is expanding to cover ML training runs (spans for training steps, metrics for training loss). When that stabilizes, the same instrumentation pattern from Part 4 will apply to KubeFlow pipelines — no separate MLFlow SDK needed for tracing.

The 14-Article Summary in One Paragraph

Build your Go services with strong foundations (Part 1), deploy them with Helm charts that encode reliability defaults (Part 2), manage all changes through GitOps (Part 3), instrument everything with metrics/traces/logs (Part 4), define SLOs before you need them (Part 5), automate incident response where you can (Part 6), test your capacity assumptions with load tests and chaos (Part 7), add ML workloads to the same GitOps-managed platform (Parts 8-11), understand the full platform as a system (Parts 12-13), and then retrospect honestly about what worked (Part 14). The through-line is the same in all 14 parts: make the system's state observable, make changes reversible, and make decisions in advance.

This series was built as a practical reference, not a tutorial. If a pattern worked for GoReliable, it'll likely work for a similar-scale platform. If your scale is 10× or 100× larger, the principles hold but the implementations will differ.

The SRE Playbook series index has links to all 14 parts.

PreviousPart 13: Reliability at Every Layer — The Complete Platform Reference NextPrometheus 101

Last updated 4 days ago

hashtagWhat Actually Worked

hashtagGitOps as the Source of Truth

hashtagError Budget Policies Decided Under Calm

hashtagSLIs That Measured User Experience

hashtagStructured Logs with Trace Correlation

hashtagWhat I'd Change

hashtagStarting with Argo Rollouts Earlier

hashtagRate Limiting at the Wrong Layer

hashtagStarting with a Namespace-Per-Environment-Per-Team Model

hashtagKubeFlow Was Heavyweight for the Use Case

hashtagCommon Anti-Patterns I Avoided (and Why They Matter)

hashtagStoring Secrets in Git

hashtagTreating Training Accuracy as Production Accuracy

hashtagIgnoring resources.limits for ML Workloads

hashtagUsing latest Tag in Production

hashtagAlerting on Symptoms vs. Causes

hashtagCost Breakdown

hashtagWhere SRE for AI/ML Is Heading

hashtagThe 14-Article Summary in One Paragraph