Part 10: LLMOps — Operating Large Language Models Reliably
Part of the SRE Playbook series
What You'll Learn: This article covers deploying a quantized LLM on CPU-based Kubernetes nodes using vLLM, writing a Go LLM Gateway that enforces rate limits and token budgets, defining SLIs specific to LLM workloads (time-to-first-token, tokens per second, token budget compliance), and using Argo Rollouts for A/B model switching without downtime. No GPU cluster required for the setup shown here.
LLMs as Production Services
The GoReliable platform processes orders. Each order has a free-text description field. I wanted to use a language model to generate a short, normalized description — turning "lgr pep pzza x2" into "2× Large Pepperoni Pizza" for display in receipts.
This is a modest use case, but it has real production requirements:
Latency: must complete within 500ms (it's in the checkout path)
Cost: can't make an external API call for every order
Reliability: if this fails, the order must still succeed (graceful degradation)
I host a quantized 7B model (Mistral 7B Q4_K_M) on a CPU node with enough RAM. It's not fast, but it's private, cheap, and predictable.
For LLMOps concepts, see the LLMOps track in the AI-Machine Learning section.
Why Not Just Use an API?
For a commercial product, externally hosted APIs (OpenAI, Anthropic, etc.) are reasonable. But for this platform:
I want to control latency SLAs without dependency on a third party's uptime
I want to test LLMOps pattern end-to-end, including deployment and monitoring
The data (order items) is internal and I prefer it stays internal
The model being CPU-only means lower throughput — but with a p99 target of 500ms, a quantized 7B on a modern CPU handles the batch I need if I size it right.
Deploying vLLM on Kubernetes
vLLM is the serving engine. It handles batching, KV caching, and the OpenAI-compatible API layer.
The model weights are stored in a PersistentVolume (backed by EFS for the cluster). A one-time Job copies the quantized model from S3 to the PV at initial setup. After that, vLLM just mounts and reads.
The Go LLM Gateway
Direct calls to vLLM from application services would work, but it bypasses the governance I want:
Rate limiting per service (the Order Service shouldn't crowd out other services)
Token budget enforcement (unbounded prompts make latency unpredictable)
Request logging for debugging (keeping prompt/response samples for a retention window)
Graceful degradation when the LLM is slow or unavailable
LLM-Specific SLIs
Standard HTTP SLIs don't capture everything important about LLM serving. I add three LLM-specific SLIs:
Time to First Token (TTFT)
For streaming responses, TTFT measures how long before the first token appears. Even though I use non-streaming completions here, I measure the total latency as a proxy:
LLM latency SLI (p99 < 500ms):
LLM fallback rate (< 1% of requests should fall back to raw input):
Token budget utilization (monitor for prompt inflation):
If average prompt tokens creep up, it usually means an upstream service is passing longer text than expected — worth an alert.
A/B Model Switching with Argo Rollouts
When I want to try a different model (say Phi-3 Mini vs Mistral 7B), I use an Argo Rollout for the LLM server deployment:
If the new model's p99 latency exceeds 500ms during the canary phase, the Rollout automatically aborts and reverts to the stable deployment.
Graceful Degradation Is Not Optional
The most important design decision in this service: if the LLM gateway fails, orders must still succeed. The normalization is a nice-to-have feature, not a hard requirement.
The LLM enhancement wraps in a if err == nil guard. The order creation path doesn't branch on ML availability.
In Part 11, I add the governance layer — drift detection, automated retraining triggers, model audit trails, and rollback strategies for when a promoted model turns out to perform worse in production than in the test set.
Last updated