Part 13: Reliability at Every Layer — The Complete Platform Reference
The Architecture at a Glance
Internet
│
├─── Ingress NGINX (TLS termination, rate limiting)
│
├── API Gateway (Go, chi)
│ ├── → Order Service (Go, PostgreSQL)
│ │ └── NATS JetStream → Notification Worker (Go)
│ ├── → ML Inference Gateway (Go)
│ │ └── → KServe InferenceService (recommendation model)
│ └── → LLM Gateway (Go)
│ └── → vLLM (Mistral 7B Q4)
│
├── KubeFlow Pipelines
│ ├── Training pipelines (Python, runs as K8s Jobs)
│ ├── Katib (hyperparameter optimization)
│ └── → MLFlow (experiment tracking + model registry)
│
└── Observability Stack
├── Prometheus (metrics)
├── Grafana (dashboards)
├── Loki (logs)
├── Tempo (traces)
└── OTel Collector (pipeline)Network Policies: Least-Privilege Communication
Kubernetes RBAC
Kyverno Admission Policies
Cost Observability with Kubecost
Disaster Recovery Testing
Component
RPO
RTO
Recovery Mechanism
The SLO Dashboard
Panel
Query
Threshold
PreviousPart 12: GitOps at Scale — ArgoCD Orchestrating the Full PlatformNextPart 14: Retrospective — Patterns, Anti-Patterns, and What I'd Do Differently
Last updated