Part 11: ModelOps — Governance, Drift Detection, and Production Lifecycle

Part of the SRE Playbook series

What You'll Learn: This article covers the operational governance layer for production models — detecting when a model's input distribution drifts from training data, building an automated retraining trigger from drift signal through KubeFlow pipeline to MLFlow to ArgoCD, creating a Go model metadata service for audit trail queries, and implementing rollback strategies for model regressions. This is the layer that keeps models behaving in production over time.

The Problem with "Ship It and Forget It"

Six weeks after I deployed the recommendation model, the accuracy degraded. I didn't notice until a user reported that the recommendations made no sense. Looking at the MLFlow metrics, the training accuracy was 0.84 — as measured at training time on 90-day-old data. But production data had shifted: a seasonal sale introduced product categories that hadn't appeared in the training set.

This is data drift: the distribution of inputs the model sees in production diverges from what it was trained on. The model doesn't fail with an error; it silently returns lower-quality predictions.

ModelOps adds the feedback loop: observe production data, detect when it diverges from training data, trigger retraining when it does, and track the full lifecycle.

Drift Detection with Evidently AI

I use Evidently AI to compute statistical tests on production input data vs. the training reference dataset. It runs as a Kubernetes CronJob that produces a Prometheus metric.

# jobs/drift_detector/detector.py
import json
import os
from datetime import datetime, timedelta

import pandas as pd
from evidently.metrics import (
    DatasetDriftMetric,
    DatasetMissingValuesMetric,
)
from evidently.report import Report
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway

PUSHGATEWAY_URL = os.environ["PUSHGATEWAY_URL"]
MLFLOW_TRACKING_URI = os.environ["MLFLOW_TRACKING_URI"]
REFERENCE_DATASET_PATH = os.environ["REFERENCE_DATASET_PATH"]


def fetch_production_data(hours: int = 24) -> pd.DataFrame:
    """Fetch the last N hours of production inference inputs."""
    # In practice, I log inference inputs to a PostgreSQL table
    # This is the same DB as the application, in a separate schema
    import psycopg2
    conn = psycopg2.connect(os.environ["DATABASE_URL"])
    df = pd.read_sql(
        f"""
        SELECT feature_vector, predicted_at
        FROM ml.recommendation_inference_log
        WHERE predicted_at >= NOW() - INTERVAL '{hours} hours'
        """,
        conn,
    )
    conn.close()
    return df


def run_drift_detection():
    reference_df = pd.read_parquet(REFERENCE_DATASET_PATH)
    current_df = fetch_production_data(hours=24)

    if len(current_df) < 100:
        print("Not enough production samples for drift detection, skipping")
        return

    report = Report(metrics=[
        DatasetDriftMetric(),
        DatasetMissingValuesMetric(),
    ])
    report.run(reference_data=reference_df, current_data=current_df)

    result = report.as_dict()
    drift_detected = result["metrics"][0]["result"]["dataset_drift"]
    drift_share = result["metrics"][0]["result"]["share_of_drifted_columns"]

    # Push metrics to Prometheus via Pushgateway
    registry = CollectorRegistry()
    drift_gauge = Gauge(
        "goreliable_model_drift_detected",
        "1 if dataset drift detected, 0 otherwise",
        ["model"],
        registry=registry,
    )
    drift_share_gauge = Gauge(
        "goreliable_model_drift_share",
        "Share of features that have drifted",
        ["model"],
        registry=registry,
    )

    drift_gauge.labels(model="recommendation").set(1 if drift_detected else 0)
    drift_share_gauge.labels(model="recommendation").set(drift_share)

    push_to_gateway(PUSHGATEWAY_URL, job="drift-detector", registry=registry)

    print(f"Drift detected: {drift_detected}, share: {drift_share:.2%}")
    return drift_detected


if __name__ == "__main__":
    run_drift_detection()

# infrastructure/drift-detector/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: drift-detector-recommendation
  namespace: go-reliable-production
spec:
  schedule: "0 6 * * *"    # Daily at 6am
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: detector
              image: ghcr.io/htunn/go-reliable/drift-detector:latest
              env:
                - name: PUSHGATEWAY_URL
                  value: "http://prometheus-pushgateway.monitoring:9091"
                - name: MLFLOW_TRACKING_URI
                  value: "http://mlflow.mlflow:5000"
                - name: REFERENCE_DATASET_PATH
                  value: "s3://go-reliable-mlflow-artifacts/reference/recommendation-features.parquet"
                - name: DATABASE_URL
                  valueFrom:
                    secretKeyRef:
                      name: app-secrets
                      key: database-url
          restartPolicy: Never

Drift Alert to Retraining Trigger

When drift is detected, I want to automatically kick off a retraining pipeline. The alert fires in Alertmanager:

# infrastructure/alertmanager/rules/model-drift.yaml
groups:
  - name: model_drift
    rules:
      - alert: ModelDriftDetected
        expr: goreliable_model_drift_detected{model="recommendation"} == 1
        for: 0m      # Fire immediately when drift is detected, not after cooldown
        labels:
          severity: warning
          team: ml-platform
        annotations:
          summary: "Data drift detected in recommendation model"
          description: "Production input distribution diverges from training data. Retraining recommended."
          runbook_url: "https://github.com/htunn/go-reliable-gitops/blob/main/runbooks/model-drift.md"

The Alertmanager routes team: ml-platform alerts to a specific webhook receiver that triggers the KubeFlow retraining pipeline:

// cmd/ml-controller/main.go — listens for Alertmanager webhook calls
package main

import (
    "encoding/json"
    "net/http"

    kfpv2 "github.com/kubeflow/pipelines/backend/api/v2beta1/go_client"
    "github.com/rs/zerolog/log"
)

type alertmanagerWebhook struct {
    Alerts []struct {
        Labels      map[string]string `json:"labels"`
        Annotations map[string]string `json:"annotations"`
        Status      string            `json:"status"`
    } `json:"alerts"`
}

func (s *Server) HandleAlertWebhook(w http.ResponseWriter, r *http.Request) {
    var webhook alertmanagerWebhook
    if err := json.NewDecoder(r.Body).Decode(&webhook); err != nil {
        http.Error(w, "bad request", http.StatusBadRequest)
        return
    }

    for _, alert := range webhook.Alerts {
        if alert.Status != "firing" {
            continue
        }
        if alert.Labels["alertname"] == "ModelDriftDetected" {
            modelName := alert.Labels["model"]
            log.Info().Str("model", modelName).Msg("drift alert received, triggering retraining")

            go s.triggerRetraining(r.Context(), modelName)
        }
    }

    w.WriteHeader(http.StatusOK)
}

func (s *Server) triggerRetraining(ctx context.Context, modelName string) {
    // Submit a KubeFlow Pipelines run via the KFP API
    runReq := &kfpv2.CreateRunRequest{
        Run: &kfpv2.Run{
            DisplayName:        fmt.Sprintf("auto-retrain-%s-%s", modelName, time.Now().Format("2006-01-02")),
            PipelineVersionReference: &kfpv2.PipelineVersionReference{
                PipelineId: s.pipelineIDs[modelName],
            },
            RuntimeConfig: &kfpv2.RuntimeConfig{
                Parameters: map[string]*structpb.Value{
                    "triggered_by": structpb.NewStringValue("drift_alert"),
                },
            },
        },
    }

    run, err := s.kfpClient.CreateRun(ctx, runReq)
    if err != nil {
        log.Error().Err(err).Str("model", modelName).Msg("failed to trigger retraining")
        return
    }
    log.Info().
        Str("model", modelName).
        Str("run_id", run.GetRunId()).
        Msg("retraining pipeline submitted")
}

The full automated loop: drift detection CronJob → Prometheus alert → Alertmanager webhook → ML controller → KubeFlow pipeline run → MLFlow model registration → evaluation → Staging promotion → manual PR review → ArgoCD deploy.

The Go Model Metadata Service

Application services need to query which model version is active and its provenance. Rather than each service calling MLFlow directly, I expose a simple Go service that wraps the queries and adds caching.

// internal/modelmetadata/service.go
package modelmetadata

import (
    "context"
    "encoding/json"
    "fmt"
    "net/http"
    "time"
)

type ModelInfo struct {
    ModelName      string    `json:"model_name"`
    Version        string    `json:"version"`
    Stage          string    `json:"stage"`
    RunID          string    `json:"run_id"`
    TrainAccuracy  float64   `json:"train_accuracy"`
    TestAccuracy   float64   `json:"test_accuracy"`
    TrainedAt      time.Time `json:"trained_at"`
    PromotedAt     time.Time `json:"promoted_at"`
    GitCommitHash  string    `json:"git_commit_hash"`   // The pipeline code commit
    DataLookback   int       `json:"data_lookback_days"`
}

type AuditEntry struct {
    Timestamp   time.Time `json:"timestamp"`
    Action      string    `json:"action"`   // "promoted", "archived", "rejected"
    Actor       string    `json:"actor"`    // "pipeline" or GitHub username
    Version     string    `json:"version"`
    Reason      string    `json:"reason"`
}

// GetModelInfo returns the current production model's full provenance.
// This is used by the LLM and ML gateways to include model version in responses.
func (s *Service) GetModelInfo(ctx context.Context, modelName string) (*ModelInfo, error) {
    if cached, ok := s.cache.Get(modelName); ok {
        return cached.(*ModelInfo), nil
    }

    info, err := s.fetchFromMLFlow(ctx, modelName)
    if err != nil {
        return nil, err
    }

    s.cache.Set(modelName, info, 10*time.Minute)
    return info, nil
}

// GetAuditTrail returns the promotion history for a model.
// Stored as MLFlow model version tags, not a separate database.
func (s *Service) GetAuditTrail(ctx context.Context, modelName string) ([]AuditEntry, error) {
    url := fmt.Sprintf(
        "%s/api/2.0/mlflow/model-versions/search?filter=name='%s'&max_results=50",
        s.mlflowURL, modelName,
    )
    // ... fetch and parse MLFlow model version history
}

This service exposes an internal HTTP API at http://model-metadata.go-reliable-production/api/v1/models/{name}. Other services can call it to include model provenance in their logs and traces without each service needing MLFlow client code.

Model Rollback

When a promoted model performs worse in production than expected, I need to roll back. Because deployment is GitOps-managed (the model URI is in a values file in the GitOps repo), rolling back a model is exactly like reverting any other deployment change: git revert the promotion PR and merge.

# Check currently active model and its run ID
kubectl get inferenceservice recommendation-model -n go-reliable-production \
  -o jsonpath='{.spec.predictor.sklearn.storageUri}'

# Identify previous version from MLFlow
curl "https://mlflow.internal.go-reliable.dev/api/2.0/mlflow/registered-models/get-latest-versions?name=recommendation-model&stages=Archived" \
  | jq '.model_versions[] | {version, run_id, tags}'

# Revert: update the GitOps repo to point to previous model run_id
# Then push PR and merge — ArgoCD picks it up in <3 minutes

I also add a recording rule for production model accuracy proxy (not the training accuracy, but a business metric that correlates with it):

# infrastructure/prometheus/rules/model-accuracy.yaml
groups:
  - name: model_accuracy
    interval: 5m
    rules:
      # Order completion rate as a proxy for recommendation quality
      # If recommendation model starts returning garbage, fewer users complete orders
      - record: goreliable:recommendation_click_through_rate:5m
        expr: |
          sum(rate(goreliable_orders_created_total{source="recommendation"}[5m]))
          /
          sum(rate(goreliable_recommendations_served_total[5m]))

A sudden drop in click-through rate (recommendations being followed by actual orders) is a business signal that the model quality has degraded — separate from the latency and error rate SLIs.

Governance Checklist

Before a model version is promoted to Production, the pull request template requires:

## Model Promotion Checklist

- [ ] MLFlow run link attached
- [ ] Test accuracy >= 0.80 (current: _____)
- [ ] Test accuracy >= current production model (current prod: _____)
- [ ] Katib HPO completed or hyperparameters justified
- [ ] Population stability index check passed (< 0.2 for all features)
- [ ] Training data freshness: trained on data up to _____ (< 30 days old)
- [ ] Git commit of pipeline code attached: _____
- [ ] Reviewer has checked the model card in MLFlow

## Model Card Summary
- Training data: ___ orders from ___ to ___
- Algorithm: ___
- Known limitations: ___
- Intended use: ___

In Part 12, I bring everything together — showing how the complete GitOps repository structure looks when managing microservices, ML infrastructure, LLM serving, and governance jobs all through ArgoCD, and introducing ApplicationSets at scale with multi-cluster patterns.

PreviousPart 10: LLMOps — Operating Large Language Models Reliably NextPart 12: GitOps at Scale — ArgoCD Orchestrating the Full Platform

Last updated 4 days ago

hashtagThe Problem with "Ship It and Forget It"

hashtagDrift Detection with Evidently AI

hashtagDrift Alert to Retraining Trigger

hashtagThe Go Model Metadata Service

hashtagModel Rollback

hashtagGovernance Checklist