Part 10: LLMOps — Operating Large Language Models Reliably

Part of the SRE Playbook series

What You'll Learn: This article covers deploying a quantized LLM on CPU-based Kubernetes nodes using vLLM, writing a Go LLM Gateway that enforces rate limits and token budgets, defining SLIs specific to LLM workloads (time-to-first-token, tokens per second, token budget compliance), and using Argo Rollouts for A/B model switching without downtime. No GPU cluster required for the setup shown here.

LLMs as Production Services

The GoReliable platform processes orders. Each order has a free-text description field. I wanted to use a language model to generate a short, normalized description — turning "lgr pep pzza x2" into "2× Large Pepperoni Pizza" for display in receipts.

This is a modest use case, but it has real production requirements:

Latency: must complete within 500ms (it's in the checkout path)
Cost: can't make an external API call for every order
Reliability: if this fails, the order must still succeed (graceful degradation)

I host a quantized 7B model (Mistral 7B Q4_K_M) on a CPU node with enough RAM. It's not fast, but it's private, cheap, and predictable.

For LLMOps concepts, see the LLMOps track in the AI-Machine Learning section.

Why Not Just Use an API?

For a commercial product, externally hosted APIs (OpenAI, Anthropic, etc.) are reasonable. But for this platform:

I want to control latency SLAs without dependency on a third party's uptime
I want to test LLMOps pattern end-to-end, including deployment and monitoring
The data (order items) is internal and I prefer it stays internal

The model being CPU-only means lower throughput — but with a p99 target of 500ms, a quantized 7B on a modern CPU handles the batch I need if I size it right.

Deploying vLLM on Kubernetes

vLLM is the serving engine. It handles batching, KV caching, and the OpenAI-compatible API layer.

# infrastructure/vllm/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-server
  namespace: go-reliable-production
spec:
  replicas: 1                    # CPU inference, single pod, scale via queuing
  selector:
    matchLabels:
      app: llm-server
  template:
    metadata:
      labels:
        app: llm-server
    spec:
      nodeSelector:
        node.kubernetes.io/instance-type: m5.4xlarge   # 16 vCPU, 64 GB RAM

      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.4.2
          args:
            - "--model"
            - "/models/mistral-7b-q4"
            - "--quantization"
            - "awq"
            - "--dtype"
            - "float16"
            - "--max-model-len"
            - "2048"
            - "--max-num-batched-tokens"
            - "4096"
            - "--disable-log-requests"   # High-volume; log via metrics not per-request

          resources:
            requests:
              cpu: "8"
              memory: "48Gi"
            limits:
              cpu: "14"
              memory: "56Gi"

          volumeMounts:
            - name: model-storage
              mountPath: /models
              readOnly: true

          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120    # Model loading takes time
            periodSeconds: 30
            failureThreshold: 3

          readinessProbe:
            httpGet:
              path: /v1/models
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
            failureThreshold: 6         # Wait up to 1 min for readiness after liveness

      volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: llm-model-pvc   # Pre-populated with model weights

The model weights are stored in a PersistentVolume (backed by EFS for the cluster). A one-time Job copies the quantized model from S3 to the PV at initial setup. After that, vLLM just mounts and reads.

The Go LLM Gateway

Direct calls to vLLM from application services would work, but it bypasses the governance I want:

Rate limiting per service (the Order Service shouldn't crowd out other services)
Token budget enforcement (unbounded prompts make latency unpredictable)
Request logging for debugging (keeping prompt/response samples for a retention window)
Graceful degradation when the LLM is slow or unavailable

// internal/llmgateway/handler.go
package llmgateway

import (
    "bytes"
    "context"
    "encoding/json"
    "fmt"
    "net/http"
    "strings"
    "time"

    "github.com/rs/zerolog"
    "golang.org/x/time/rate"
)

const (
    maxPromptTokens      = 256   // Input limit — order descriptions are short
    maxCompletionTokens  = 64    // Output limit — normalized descriptions are short
    defaultTimeout       = 800 * time.Millisecond  // Generous; p99 we want < 500ms
)

type NormalizeRequest struct {
    CallerService string `json:"caller_service"`
    RawText       string `json:"raw_text"`
}

type NormalizeResponse struct {
    NormalizedText   string  `json:"normalized_text"`
    PromptTokens     int     `json:"prompt_tokens"`
    CompletionTokens int     `json:"completion_tokens"`
    LatencyMs        float64 `json:"latency_ms"`
    ModelID          string  `json:"model_id"`
}

type Handler struct {
    vllmEndpoint  string
    httpClient    *http.Client

    // Per-service rate limiters: 100 req/min per service
    limiters      map[string]*rate.Limiter
}

func (h *Handler) NormalizeText(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    logger := zerolog.Ctx(ctx)

    var req NormalizeRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        http.Error(w, "invalid request", http.StatusBadRequest)
        return
    }

    // Per-service rate limiting
    limiter := h.limiterFor(req.CallerService)
    if !limiter.Allow() {
        llmRateLimitedTotal.WithLabelValues(req.CallerService).Inc()
        w.Header().Set("Retry-After", "60")
        http.Error(w, "rate limit exceeded", http.StatusTooManyRequests)
        return
    }

    // Truncate input to protect against unbounded prompts
    rawText := truncateToTokenBudget(req.RawText, maxPromptTokens)

    prompt := fmt.Sprintf(
        `Normalize this food order item into a clean, readable description. Reply with only the normalized text.\n\nInput: %s\nNormalized:`,
        rawText,
    )

    start := time.Now()
    normalized, usage, err := h.callVLLM(ctx, prompt)
    latency := time.Since(start)

    // Record SLI metrics regardless of outcome
    llmRequestDuration.WithLabelValues(req.CallerService).Observe(latency.Seconds())

    if err != nil {
        logger.Warn().
            Err(err).
            Str("caller", req.CallerService).
            Str("raw_text", req.RawText).
            Msg("llm inference failed, returning raw text")

        // Graceful degradation: return original text, not an error
        // Order processing must not fail because of the LLM
        llmFallbackTotal.WithLabelValues(req.CallerService, "inference_error").Inc()
        writeJSON(w, NormalizeResponse{
            NormalizedText: req.RawText,  // Fallback to raw input
            LatencyMs:      float64(latency.Milliseconds()),
        })
        return
    }

    if latency > 500*time.Millisecond {
        logger.Warn().
            Float64("latency_ms", float64(latency.Milliseconds())).
            Str("caller", req.CallerService).
            Msg("llm response exceeded 500ms SLO target")
    }

    writeJSON(w, NormalizeResponse{
        NormalizedText:   strings.TrimSpace(normalized),
        PromptTokens:     usage.PromptTokens,
        CompletionTokens: usage.CompletionTokens,
        LatencyMs:        float64(latency.Milliseconds()),
        ModelID:          "mistral-7b-q4",
    })
}

func (h *Handler) callVLLM(ctx context.Context, prompt string) (string, tokenUsage, error) {
    // Add gateway-level timeout — don't let a slow model block the caller forever
    ctx, cancel := context.WithTimeout(ctx, defaultTimeout)
    defer cancel()

    body, _ := json.Marshal(map[string]interface{}{
        "model":       "mistral-7b-q4",
        "prompt":      prompt,
        "max_tokens":  maxCompletionTokens,
        "temperature": 0.1,   // Low temperature for deterministic normalization
        "stop":        []string{"\n", "Input:"},
    })

    req, err := http.NewRequestWithContext(ctx,
        http.MethodPost,
        h.vllmEndpoint+"/v1/completions",
        bytes.NewReader(body),
    )
    if err != nil {
        return "", tokenUsage{}, err
    }
    req.Header.Set("Content-Type", "application/json")

    resp, err := h.httpClient.Do(req)
    if err != nil {
        return "", tokenUsage{}, fmt.Errorf("vllm request: %w", err)
    }
    defer resp.Body.Close()

    if resp.StatusCode != http.StatusOK {
        return "", tokenUsage{}, fmt.Errorf("vllm returned %d", resp.StatusCode)
    }

    var result struct {
        Choices []struct {
            Text string `json:"text"`
        } `json:"choices"`
        Usage struct {
            PromptTokens     int `json:"prompt_tokens"`
            CompletionTokens int `json:"completion_tokens"`
        } `json:"usage"`
    }
    if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
        return "", tokenUsage{}, err
    }

    if len(result.Choices) == 0 {
        return "", tokenUsage{}, fmt.Errorf("no choices in response")
    }

    return result.Choices[0].Text, tokenUsage{
        PromptTokens:     result.Usage.PromptTokens,
        CompletionTokens: result.Usage.CompletionTokens,
    }, nil
}

func (h *Handler) limiterFor(service string) *rate.Limiter {
    if l, ok := h.limiters[service]; ok {
        return l
    }
    // Unknown services get a conservative limiter
    return rate.NewLimiter(rate.Every(time.Minute/20), 5)
}

LLM-Specific SLIs

Standard HTTP SLIs don't capture everything important about LLM serving. I add three LLM-specific SLIs:

Time to First Token (TTFT)

For streaming responses, TTFT measures how long before the first token appears. Even though I use non-streaming completions here, I measure the total latency as a proxy:

// pkg/metrics/llm.go
var (
    llmRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "goreliable_llm_request_duration_seconds",
            Help:    "Total LLM request duration (proxy for TTFT in non-streaming mode)",
            Buckets: []float64{0.1, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.0, 5.0},
        },
        []string{"caller_service"},
    )

    llmTokensTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "goreliable_llm_tokens_total",
            Help: "Total tokens consumed (prompt + completion)",
        },
        []string{"caller_service", "token_type"},
    )

    llmRateLimitedTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "goreliable_llm_rate_limited_total",
            Help: "Number of requests rejected by the rate limiter",
        },
        []string{"caller_service"},
    )

    llmFallbackTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "goreliable_llm_fallback_total",
            Help: "Number of requests that fell back to raw input",
        },
        []string{"caller_service", "reason"},
    )
)

LLM latency SLI (p99 < 500ms):

histogram_quantile(0.99,
  sum(rate(goreliable_llm_request_duration_seconds_bucket[5m])) by (le)
)

LLM fallback rate (< 1% of requests should fall back to raw input):

sum(rate(goreliable_llm_fallback_total[5m]))
/
sum(rate(goreliable_llm_request_duration_seconds_count[5m]))

Token budget utilization (monitor for prompt inflation):

sum(rate(goreliable_llm_tokens_total{token_type="prompt"}[5m]))
/ sum(rate(goreliable_llm_request_duration_seconds_count[5m]))

If average prompt tokens creep up, it usually means an upstream service is passing longer text than expected — worth an alert.

A/B Model Switching with Argo Rollouts

When I want to try a different model (say Phi-3 Mini vs Mistral 7B), I use an Argo Rollout for the LLM server deployment:

# infrastructure/vllm/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: llm-server
  namespace: go-reliable-production
spec:
  replicas: 2
  strategy:
    canary:
      steps:
        - setWeight: 20          # 20% traffic to new model
        - pause: {}              # Pause for manual analysis of metrics
        - setWeight: 50
        - pause:
            duration: 30m        # Auto-advance after 30 min if metrics ok
        - setWeight: 100
      analysis:
        templates:
          - templateName: llm-latency-analysis
        startingStep: 1
        args:
          - name: service-name
            value: llm-server
  selector:
    matchLabels:
      app: llm-server
  template:
    # ... pod template here

---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: llm-latency-analysis
  namespace: go-reliable-production
spec:
  args:
    - name: service-name
  metrics:
    - name: p99-latency
      interval: 2m
      successCondition: "result[0] < 0.5"    # < 500ms
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(goreliable_llm_request_duration_seconds_bucket{service="{{args.service-name}}"}[5m])) by (le)
            )

If the new model's p99 latency exceeds 500ms during the canary phase, the Rollout automatically aborts and reverts to the stable deployment.

Graceful Degradation Is Not Optional

The most important design decision in this service: if the LLM gateway fails, orders must still succeed. The normalization is a nice-to-have feature, not a hard requirement.

// In the Order Service handler, calling the LLM gateway
func (s *OrderService) CreateOrder(ctx context.Context, req CreateOrderRequest) (*Order, error) {
    // Normalize the description — but don't fail the order if this fails
    normalizedDesc := req.Description
    if s.llmClient != nil {
        result, err := s.llmClient.NormalizeText(ctx, NormalizeRequest{
            CallerService: "order-service",
            RawText:       req.Description,
        })
        if err == nil {
            normalizedDesc = result.NormalizedText
        }
        // err != nil: log it, keep original description, continue
    }

    order := &Order{
        Description: normalizedDesc,
        // ... other fields
    }
    return s.repo.Create(ctx, order)
}

The LLM enhancement wraps in a if err == nil guard. The order creation path doesn't branch on ML availability.

In Part 11, I add the governance layer — drift detection, automated retraining triggers, model audit trails, and rollback strategies for when a promoted model turns out to perform worse in production than in the test set.

PreviousPart 9: MLFlow — Experiment Tracking and Model Registry NextPart 11: ModelOps — Governance, Drift Detection, and Production Lifecycle

Last updated 4 days ago

hashtagLLMs as Production Services

hashtagWhy Not Just Use an API?

hashtagDeploying vLLM on Kubernetes

hashtagThe Go LLM Gateway

hashtagLLM-Specific SLIs

hashtagTime to First Token (TTFT)

hashtagA/B Model Switching with Argo Rollouts

hashtagGraceful Degradation Is Not Optional