Part 11: ModelOps — Governance, Drift Detection, and Production Lifecycle

Part of the SRE Playbook series

What You'll Learn: This article covers the operational governance layer for production models — detecting when a model's input distribution drifts from training data, building an automated retraining trigger from drift signal through KubeFlow pipeline to MLFlow to ArgoCD, creating a Go model metadata service for audit trail queries, and implementing rollback strategies for model regressions. This is the layer that keeps models behaving in production over time.

The Problem with "Ship It and Forget It"

Six weeks after I deployed the recommendation model, the accuracy degraded. I didn't notice until a user reported that the recommendations made no sense. Looking at the MLFlow metrics, the training accuracy was 0.84 — as measured at training time on 90-day-old data. But production data had shifted: a seasonal sale introduced product categories that hadn't appeared in the training set.

This is data drift: the distribution of inputs the model sees in production diverges from what it was trained on. The model doesn't fail with an error; it silently returns lower-quality predictions.

ModelOps adds the feedback loop: observe production data, detect when it diverges from training data, trigger retraining when it does, and track the full lifecycle.

Drift Detection with Evidently AI

I use Evidently AI to compute statistical tests on production input data vs. the training reference dataset. It runs as a Kubernetes CronJob that produces a Prometheus metric.

# jobs/drift_detector/detector.py
import json
import os
from datetime import datetime, timedelta

import pandas as pd
from evidently.metrics import (
    DatasetDriftMetric,
    DatasetMissingValuesMetric,
)
from evidently.report import Report
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway

PUSHGATEWAY_URL = os.environ["PUSHGATEWAY_URL"]
MLFLOW_TRACKING_URI = os.environ["MLFLOW_TRACKING_URI"]
REFERENCE_DATASET_PATH = os.environ["REFERENCE_DATASET_PATH"]


def fetch_production_data(hours: int = 24) -> pd.DataFrame:
    """Fetch the last N hours of production inference inputs."""
    # In practice, I log inference inputs to a PostgreSQL table
    # This is the same DB as the application, in a separate schema
    import psycopg2
    conn = psycopg2.connect(os.environ["DATABASE_URL"])
    df = pd.read_sql(
        f"""
        SELECT feature_vector, predicted_at
        FROM ml.recommendation_inference_log
        WHERE predicted_at >= NOW() - INTERVAL '{hours} hours'
        """,
        conn,
    )
    conn.close()
    return df


def run_drift_detection():
    reference_df = pd.read_parquet(REFERENCE_DATASET_PATH)
    current_df = fetch_production_data(hours=24)

    if len(current_df) < 100:
        print("Not enough production samples for drift detection, skipping")
        return

    report = Report(metrics=[
        DatasetDriftMetric(),
        DatasetMissingValuesMetric(),
    ])
    report.run(reference_data=reference_df, current_data=current_df)

    result = report.as_dict()
    drift_detected = result["metrics"][0]["result"]["dataset_drift"]
    drift_share = result["metrics"][0]["result"]["share_of_drifted_columns"]

    # Push metrics to Prometheus via Pushgateway
    registry = CollectorRegistry()
    drift_gauge = Gauge(
        "goreliable_model_drift_detected",
        "1 if dataset drift detected, 0 otherwise",
        ["model"],
        registry=registry,
    )
    drift_share_gauge = Gauge(
        "goreliable_model_drift_share",
        "Share of features that have drifted",
        ["model"],
        registry=registry,
    )

    drift_gauge.labels(model="recommendation").set(1 if drift_detected else 0)
    drift_share_gauge.labels(model="recommendation").set(drift_share)

    push_to_gateway(PUSHGATEWAY_URL, job="drift-detector", registry=registry)

    print(f"Drift detected: {drift_detected}, share: {drift_share:.2%}")
    return drift_detected


if __name__ == "__main__":
    run_drift_detection()

Drift Alert to Retraining Trigger

When drift is detected, I want to automatically kick off a retraining pipeline. The alert fires in Alertmanager:

The Alertmanager routes team: ml-platform alerts to a specific webhook receiver that triggers the KubeFlow retraining pipeline:

The full automated loop: drift detection CronJob → Prometheus alert → Alertmanager webhook → ML controller → KubeFlow pipeline run → MLFlow model registration → evaluation → Staging promotion → manual PR review → ArgoCD deploy.

The Go Model Metadata Service

Application services need to query which model version is active and its provenance. Rather than each service calling MLFlow directly, I expose a simple Go service that wraps the queries and adds caching.

This service exposes an internal HTTP API at http://model-metadata.go-reliable-production/api/v1/models/{name}. Other services can call it to include model provenance in their logs and traces without each service needing MLFlow client code.

Model Rollback

When a promoted model performs worse in production than expected, I need to roll back. Because deployment is GitOps-managed (the model URI is in a values file in the GitOps repo), rolling back a model is exactly like reverting any other deployment change: git revert the promotion PR and merge.

I also add a recording rule for production model accuracy proxy (not the training accuracy, but a business metric that correlates with it):

A sudden drop in click-through rate (recommendations being followed by actual orders) is a business signal that the model quality has degraded — separate from the latency and error rate SLIs.

Governance Checklist

Before a model version is promoted to Production, the pull request template requires:

In Part 12, I bring everything together — showing how the complete GitOps repository structure looks when managing microservices, ML infrastructure, LLM serving, and governance jobs all through ArgoCD, and introducing ApplicationSets at scale with multi-cluster patterns.

Last updated