Part 13: Reliability at Every Layer — The Complete Platform Reference

Part of the SRE Playbook series

What You'll Learn: This article is the capstone reference for the GoReliable platform. It covers the full reliability architecture — network policies for service-to-service isolation, Kubernetes RBAC, Kyverno admission policies, cost observability with Kubecost, disaster recovery testing, and a final architecture diagram that shows how all 14 parts connect. This is the article I reference when someone asks "how does the platform actually work end to end?"

The Architecture at a Glance

After 12 articles of building, the platform looks like this:

Internet

    ├─── Ingress NGINX (TLS termination, rate limiting)

    ├── API Gateway (Go, chi)
    │       ├── → Order Service (Go, PostgreSQL)
    │       │           └── NATS JetStream → Notification Worker (Go)
    │       ├── → ML Inference Gateway (Go)
    │       │           └── → KServe InferenceService (recommendation model)
    │       └── → LLM Gateway (Go)
    │                   └── → vLLM (Mistral 7B Q4)

    ├── KubeFlow Pipelines
    │       ├── Training pipelines (Python, runs as K8s Jobs)
    │       ├── Katib (hyperparameter optimization)
    │       └── → MLFlow (experiment tracking + model registry)

    └── Observability Stack
            ├── Prometheus (metrics)
            ├── Grafana (dashboards)
            ├── Loki (logs)
            ├── Tempo (traces)
            └── OTel Collector (pipeline)

Every component is deployed via ArgoCD from the go-reliable-gitops repository. Every secret is in AWS Secrets Manager, injected via External Secrets Operator. Every service has SLIs, and every SLI has an error budget policy.

Network Policies: Least-Privilege Communication

Without network policies, every pod in the cluster can reach every other pod. That's not acceptable. I use Kubernetes NetworkPolicy to enforce that only intended communication paths work.

Each service has its own NetworkPolicy. Order Service can reach PostgreSQL and NATS but not KServe. ML Inference Gateway can reach KServe but not NATS. This is the service-mesh equivalent of firewall rules.

Kubernetes RBAC

Application pods use dedicated ServiceAccounts with minimal permissions. The ML controller is the only component that needs RBAC to other resources:

Application services (API Gateway, Order Service, etc.) use ServiceAccounts with zero Kubernetes API permissions. They don't need to call the Kubernetes API at all.

Kyverno Admission Policies

I use Kyverno (rather than OPA/Gatekeeper) for admission control. These policies run on every resource creation or update:

Kyverno also has a generate rule I use to automatically create default NetworkPolicies for any new namespace that matches go-reliable-* — so I can't accidentally deploy a new service in an unprotected namespace.

Cost Observability with Kubecost

The platform runs on EKS. Without cost visibility, it's easy to over-provision ML workloads and not notice.

I deploy Kubecost alongside the observability stack:

I expose three Kubecost metrics to Prometheus for alerting:

Cost alert: if the go-reliable-production namespace exceeds $500/month projected cost, I get a Slack notification. This caught an incident where a crashed pod was restarting in a loop and generating excessive log volume charges.

The ML workloads (KubeFlow training, vLLM, KServe) are labeled with cost-team: ml-platform. I use Kubecost allocation to see ML infrastructure costs separately from application runtime costs.

Disaster Recovery Testing

The third item on my reliability checklist (after monitoring and capacity) is: can I recover from a full cluster failure?

My DR strategy for GoReliable:

Component
RPO
RTO
Recovery Mechanism

PostgreSQL

5 min

30 min

AWS RDS Multi-AZ, point-in-time restore

NATS JetStream

0

15 min

Message ACK, in-flight replay from queue

Application code

0

10 min

Helm chart + ArgoCD redeploy from Git

Model artifacts

0

5 min

S3 + KServe re-pull

MLFlow metadata

15 min

45 min

RDS backup, Velero for K8s objects

Grafana dashboards

0

5 min

Dashboards stored as ConfigMaps in GitOps

I test the RTO numbers quarterly by simulating a cluster replacement:

The last DR test took 7 minutes to go from deleted namespaces to all deployments Available. The target was 10 minutes. ArgoCD is the recovery tool — there's nothing to reconstruct manually.

The SLO Dashboard

Everything comes together in a single Grafana dashboard that I check at the start of each week:

Panel
Query
Threshold

API Gateway availability

30-day rolling SLI

99.9%

API Gateway latency SLO

30-day p99 < 300ms

99%

Order success rate

30-day rolling

99.5%

Notification delivery

30-day rolling

99%

ML inference availability

30-day rolling

99.5%

LLM p99 latency

30-day rolling

99% (< 500ms)

Error budget remaining

Each SLO

Alert < 10%

Model drift status

Current state

Alert = detected

The dashboard makes reliability visible. When error budget is above 50%, I work on features. When it's below 20%, I work on reliability. The policy (from Part 5) removes the subjectivity from that decision.

In Part 14, I close the series with a retrospective — what patterns actually worked, what I'd change if starting over, decisions I'd make differently, and where I see SRE for AI/ML systems heading.

Last updated