Part 13: Reliability at Every Layer — The Complete Platform Reference

Part of the SRE Playbook series

What You'll Learn: This article is the capstone reference for the GoReliable platform. It covers the full reliability architecture — network policies for service-to-service isolation, Kubernetes RBAC, Kyverno admission policies, cost observability with Kubecost, disaster recovery testing, and a final architecture diagram that shows how all 14 parts connect. This is the article I reference when someone asks "how does the platform actually work end to end?"

The Architecture at a Glance

After 12 articles of building, the platform looks like this:

Internet
    │
    ├─── Ingress NGINX (TLS termination, rate limiting)
    │
    ├── API Gateway (Go, chi)
    │       ├── → Order Service (Go, PostgreSQL)
    │       │           └── NATS JetStream → Notification Worker (Go)
    │       ├── → ML Inference Gateway (Go)
    │       │           └── → KServe InferenceService (recommendation model)
    │       └── → LLM Gateway (Go)
    │                   └── → vLLM (Mistral 7B Q4)
    │
    ├── KubeFlow Pipelines
    │       ├── Training pipelines (Python, runs as K8s Jobs)
    │       ├── Katib (hyperparameter optimization)
    │       └── → MLFlow (experiment tracking + model registry)
    │
    └── Observability Stack
            ├── Prometheus (metrics)
            ├── Grafana (dashboards)
            ├── Loki (logs)
            ├── Tempo (traces)
            └── OTel Collector (pipeline)

Every component is deployed via ArgoCD from the go-reliable-gitops repository. Every secret is in AWS Secrets Manager, injected via External Secrets Operator. Every service has SLIs, and every SLI has an error budget policy.

Network Policies: Least-Privilege Communication

Without network policies, every pod in the cluster can reach every other pod. That's not acceptable. I use Kubernetes NetworkPolicy to enforce that only intended communication paths work.

# helm/api-gateway/templates/networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-gateway
  namespace: go-reliable-production
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: api-gateway

  ingress:
    # Accept traffic from Ingress NGINX only
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-nginx
      ports:
        - protocol: TCP
          port: 8080
    # Prometheus scraping
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: monitoring
      ports:
        - protocol: TCP
          port: 9090    # Metrics port

  egress:
    # Can reach Order Service
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: order-service
      ports:
        - protocol: TCP
          port: 8080
    # Can reach ML Inference Gateway
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: ml-inference-gateway
      ports:
        - protocol: TCP
          port: 8080
    # Can reach LLM Gateway
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: llm-gateway
      ports:
        - protocol: TCP
          port: 8080
    # DNS resolution
    - ports:
        - protocol: UDP
          port: 53

Each service has its own NetworkPolicy. Order Service can reach PostgreSQL and NATS but not KServe. ML Inference Gateway can reach KServe but not NATS. This is the service-mesh equivalent of firewall rules.

Kubernetes RBAC

Application pods use dedicated ServiceAccounts with minimal permissions. The ML controller is the only component that needs RBAC to other resources:

# infrastructure/ml-controller/rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ml-controller
  namespace: go-reliable-production
  annotations:
    # IRSA: this ServiceAccount can assume the ml-controller IAM role
    eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT_ID:role/ml-controller"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ml-controller
  namespace: go-reliable-production
rules:
  - apiGroups: ["serving.kserve.io"]
    resources: ["inferenceservices"]
    verbs: ["get", "list", "update", "patch"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ml-controller
  namespace: go-reliable-production
subjects:
  - kind: ServiceAccount
    name: ml-controller
roleRef:
  kind: Role
  name: ml-controller
  apiGroup: rbac.authorization.k8s.io

Application services (API Gateway, Order Service, etc.) use ServiceAccounts with zero Kubernetes API permissions. They don't need to call the Kubernetes API at all.

Kyverno Admission Policies

I use Kyverno (rather than OPA/Gatekeeper) for admission control. These policies run on every resource creation or update:

# infrastructure/kyverno/policies/require-resource-limits.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
spec:
  background: false
  rules:
    - name: check-container-resources
      match:
        any:
          - resources:
              kinds: ["Pod"]
              namespaces:
                - "go-reliable-*"
      validate:
        message: "CPU and memory limits are required for all containers"
        pattern:
          spec:
            containers:
              - name: "*"
                resources:
                  limits:
                    cpu: "?*"
                    memory: "?*"

# infrastructure/kyverno/policies/disallow-privileged.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-privileged-containers
spec:
  background: true
  rules:
    - name: no-privileged
      match:
        any:
          - resources:
              kinds: ["Pod"]
              namespaces:
                - "go-reliable-*"
      validate:
        message: "Privileged containers are not allowed"
        pattern:
          spec:
            containers:
              - =(securityContext):
                  =(privileged): "false"

# infrastructure/kyverno/policies/require-labels.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-standard-labels
spec:
  background: false
  rules:
    - name: check-labels
      match:
        any:
          - resources:
              kinds: ["Deployment", "StatefulSet", "DaemonSet"]
              namespaces:
                - "go-reliable-*"
      validate:
        message: "Deployment must have app.kubernetes.io/name and app.kubernetes.io/version labels"
        pattern:
          metadata:
            labels:
              app.kubernetes.io/name: "?*"
              app.kubernetes.io/version: "?*"

Kyverno also has a generate rule I use to automatically create default NetworkPolicies for any new namespace that matches go-reliable-* — so I can't accidentally deploy a new service in an unprotected namespace.

Cost Observability with Kubecost

The platform runs on EKS. Without cost visibility, it's easy to over-provision ML workloads and not notice.

I deploy Kubecost alongside the observability stack:

# argocd/apps/infrastructure/kubecost.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: kubecost
  namespace: argocd
spec:
  project: go-reliable
  source:
    repoURL: https://kubecost.github.io/cost-analyzer
    chart: cost-analyzer
    targetRevision: "2.3.0"
    helm:
      parameters:
        - name: global.prometheus.enabled
          value: "false"   # Use existing Prometheus stack
        - name: global.prometheus.fqdn
          value: "http://prometheus-kube-prometheus-prometheus.monitoring:9090"
  destination:
    server: https://kubernetes.default.svc
    namespace: kubecost

I expose three Kubecost metrics to Prometheus for alerting:

# Monthly cost per namespace projection
kubecost_namespace_current_cost * 730

Cost alert: if the go-reliable-production namespace exceeds $500/month projected cost, I get a Slack notification. This caught an incident where a crashed pod was restarting in a loop and generating excessive log volume charges.

The ML workloads (KubeFlow training, vLLM, KServe) are labeled with cost-team: ml-platform. I use Kubecost allocation to see ML infrastructure costs separately from application runtime costs.

Disaster Recovery Testing

The third item on my reliability checklist (after monitoring and capacity) is: can I recover from a full cluster failure?

My DR strategy for GoReliable:

Component

RPO

RTO

Recovery Mechanism

PostgreSQL

5 min

30 min

AWS RDS Multi-AZ, point-in-time restore

NATS JetStream

15 min

Message ACK, in-flight replay from queue

Application code

10 min

Helm chart + ArgoCD redeploy from Git

Model artifacts

5 min

S3 + KServe re-pull

MLFlow metadata

15 min

45 min

RDS backup, Velero for K8s objects

Grafana dashboards

5 min

Dashboards stored as ConfigMaps in GitOps

I test the RTO numbers quarterly by simulating a cluster replacement:

#!/usr/bin/env bash
# scripts/dr-test.sh — Run quarterly, document results in the runbook

set -euo pipefail

echo "=== DR Test: Simulate cluster replacement ==="

echo "Step 1: Capture current state"
kubectl get deployments -A -o json > /tmp/pre-dr-deployments.json

echo "Step 2: Delete all application deployments (simulate cluster loss)"
kubectl delete namespace go-reliable-production go-reliable-staging --wait=false

echo "Step 3: Recreate namespaces"
kubectl create namespace go-reliable-production
kubectl create namespace go-reliable-staging

echo "Step 4: Re-bootstrap ArgoCD sync"
argocd app sync go-reliable-platform --timeout 300

echo "Step 5: Measure time to healthy"
start=$(date +%s)
kubectl wait deployment -n go-reliable-production \
  --all --for=condition=Available --timeout=600s
end=$(date +%s)

echo "=== Recovery completed in $((end - start)) seconds ==="

The last DR test took 7 minutes to go from deleted namespaces to all deployments Available. The target was 10 minutes. ArgoCD is the recovery tool — there's nothing to reconstruct manually.

The SLO Dashboard

Everything comes together in a single Grafana dashboard that I check at the start of each week:

Panel

Query

Threshold

API Gateway availability

30-day rolling SLI

99.9%

API Gateway latency SLO

30-day p99 < 300ms

99%

Order success rate

30-day rolling

99.5%

Notification delivery

30-day rolling

99%

ML inference availability

30-day rolling

99.5%

LLM p99 latency

30-day rolling

99% (< 500ms)

Error budget remaining

Each SLO

Alert < 10%

Model drift status

Current state

Alert = detected

The dashboard makes reliability visible. When error budget is above 50%, I work on features. When it's below 20%, I work on reliability. The policy (from Part 5) removes the subjectivity from that decision.

In Part 14, I close the series with a retrospective — what patterns actually worked, what I'd change if starting over, decisions I'd make differently, and where I see SRE for AI/ML systems heading.

PreviousPart 12: GitOps at Scale — ArgoCD Orchestrating the Full Platform NextPart 14: Retrospective — Patterns, Anti-Patterns, and What I'd Do Differently

Last updated 4 days ago

hashtagThe Architecture at a Glance

hashtagNetwork Policies: Least-Privilege Communication

hashtagKubernetes RBAC

hashtagKyverno Admission Policies

hashtagCost Observability with Kubecost

hashtagDisaster Recovery Testing

hashtagThe SLO Dashboard

The Architecture at a Glance

Network Policies: Least-Privilege Communication

Kubernetes RBAC

Kyverno Admission Policies

Cost Observability with Kubecost

Disaster Recovery Testing

The SLO Dashboard