Part 9: MLFlow — Experiment Tracking and Model Registry

Part of the SRE Playbook series

What You'll Learn: This article covers deploying MLFlow on Kubernetes with PostgreSQL for metadata and MinIO for artifact storage, integrating it with KubeFlow Pipelines and KServe, using the Model Registry to control production promotions, and building a Go service that queries the MLFlow API to know which model version is currently active. By the end, you have a full audit trail from training run to production prediction.

MLFlow's Role in the Stack

After building the KubeFlow training pipelines in Part 8, I realized I had a problem: the KubeFlow UI showed me pipeline runs, but I couldn't easily answer questions like:

What was the accuracy of the model currently running in production?
How does today's retrained model compare to last week's?
Who approved model version 3 for production?

MLFlow fills that gap. It stores experiment runs, metrics, parameters, and model artifacts in one place. The Model Registry layer adds a lifecycle: models transition from None → Staging → Production.

For MLFlow fundamentals, see the MLOps 101 series.

Deploying MLFlow on Kubernetes

MLFlow is a stateful service. I give it a PostgreSQL database (separate from the application DB) for metadata and MinIO for model artifacts.

# infrastructure/mlflow/values.yaml
mlflow:
  image:
    repository: ghcr.io/mlflow/mlflow
    tag: "2.11.0"

  backendStore:
    postgres:
      enabled: true
      host: mlflow-postgres.mlflow.svc.cluster.local
      port: 5432
      database: mlflow
      user: mlflow
      # Password injected from External Secrets Operator
      password:
        secretRef:
          name: mlflow-postgres-credentials
          key: password

  artifactRoot:
    s3:
      enabled: true
      bucket: go-reliable-mlflow-artifacts
      # Using IAM role for service account (IRSA) — no long-lived credentials
      awsServiceAccount:
        create: true
        annotations:
          eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT_ID:role/mlflow-artifact-store"

  service:
    type: ClusterIP
    port: 5000

  ingress:
    enabled: true
    host: mlflow.internal.go-reliable.dev
    annotations:
      nginx.ingress.kubernetes.io/auth-type: basic
      nginx.ingress.kubernetes.io/auth-secret: mlflow-basic-auth

The Helm chart is managed as an ArgoCD Application alongside the other infrastructure:

# argocd/apps/mlflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: mlflow
  namespace: argocd
  annotations:
    argocd.argoproj.io/sync-wave: "3"
spec:
  project: go-reliable
  source:
    repoURL: https://community-charts.github.io/helm-charts
    chart: mlflow
    targetRevision: "1.1.1"
    helm:
      valueFiles:
        - $values/infrastructure/mlflow/values.yaml
  sources:
    - repoURL: https://github.com/htunn/go-reliable-gitops.git
      targetRevision: main
      ref: values
  destination:
    server: https://kubernetes.default.svc
    namespace: mlflow
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Integrating MLFlow with Training Pipelines

Each KubeFlow pipeline step that trains a model logs to MLFlow. I showed this in Part 8; here's the full logging pattern I use:

# pipelines/recommendation/train.py
import mlflow
import mlflow.sklearn

def train_and_register(
    X_train, y_train,
    X_test, y_test,
    params: dict,
    experiment_name: str,
    run_name: str,
):
    mlflow.set_tracking_uri("http://mlflow.mlflow.svc.cluster.local:5000")
    mlflow.set_experiment(experiment_name)

    with mlflow.start_run(run_name=run_name) as run:
        # Log hyperparameters
        mlflow.log_params(params)

        # Log pipeline context
        mlflow.set_tags({
            "kubeflow.pipeline.run_id": os.environ.get("KFP_RUN_ID", "manual"),
            "training.source": "scheduled" if os.environ.get("SCHEDULED") else "manual",
            "data.lookback_days": str(params.get("lookback_days", 90)),
        })

        model = GradientBoostingClassifier(**{
            k: v for k, v in params.items()
            if k in ("n_estimators", "max_depth", "learning_rate")
        })
        model.fit(X_train, y_train)

        # Compute and log metrics
        train_accuracy = accuracy_score(y_train, model.predict(X_train))
        test_accuracy  = accuracy_score(y_test,  model.predict(X_test))
        mlflow.log_metrics({
            "train_accuracy": train_accuracy,
            "test_accuracy":  test_accuracy,
            "n_train_samples": len(X_train),
            "n_test_samples":  len(y_test),
        })

        # Log model + register in Model Registry
        model_info = mlflow.sklearn.log_model(
            model,
            "recommendation-model",
            registered_model_name="recommendation-model",
            input_example=X_train[:5],
            signature=mlflow.models.infer_signature(X_train, model.predict(X_train)),
        )

        print(f"Run ID: {run.info.run_id}")
        print(f"Model URI: {model_info.model_uri}")
        print(f"Test accuracy: {test_accuracy:.4f}")
        print(f"Registered version: {model_info.registered_model_version}")

        return run.info.run_id, model_info.registered_model_version

After each successful training run, the model is registered in the Model Registry as a new version with None stage (unreviewed).

Model Registry Lifecycle

The Model Registry enforces a promotion workflow. A model goes through stages:

Stage

Meaning

Who Sets It

None

Just registered, not evaluated

Pipeline (automatic)

Staging

Passed automated evaluation, ready for review

Evaluation script (automatic)

Production

Approved for serving

Human (via PR or MLFlow API)

Archived

Replaced or retired

Human

Automated Evaluation to Staging

After training completes, an evaluation step checks if the new model meets minimum quality thresholds:

# pipelines/recommendation/evaluate.py
def evaluate_and_promote(
    run_id: str,
    model_version: str,
    accuracy_threshold: float = 0.80,
):
    client = mlflow.MlflowClient()

    run = client.get_run(run_id)
    test_accuracy = float(run.data.metrics["test_accuracy"])

    if test_accuracy < accuracy_threshold:
        print(f"Model did not meet threshold: {test_accuracy:.4f} < {accuracy_threshold}")
        # Do not promote; alert the team
        client.set_model_version_tag(
            "recommendation-model",
            model_version,
            "evaluation.result", "rejected",
        )
        client.set_model_version_tag(
            "recommendation-model",
            model_version,
            "evaluation.accuracy", str(test_accuracy),
        )
        return False

    # Compare to current production model
    try:
        prod_versions = client.get_latest_versions(
            "recommendation-model", stages=["Production"]
        )
        if prod_versions:
            current_prod_run_id = prod_versions[0].run_id
            current_run = client.get_run(current_prod_run_id)
            current_accuracy = float(current_run.data.metrics["test_accuracy"])

            # Only promote if new model is better
            if test_accuracy <= current_accuracy:
                print(
                    f"New model ({test_accuracy:.4f}) not better than production ({current_accuracy:.4f})"
                )
                return False
    except Exception:
        pass  # No current production model, promote unconditionally

    # Transition to Staging
    client.transition_model_version_stage(
        name="recommendation-model",
        version=model_version,
        stage="Staging",
        archive_existing_versions=False,
    )
    print(f"Model version {model_version} promoted to Staging (accuracy: {test_accuracy:.4f})")
    return True

Production Promotion

Promoting from Staging to Production requires a human approval step. I implement this as a pull request in the GitOps repo:

The training pipeline creates a PR that updates the KServe InferenceService to point to the new model version. A reviewer checks the MLFlow comparison (I link to the MLFlow run in the PR description), then merges.

# scripts/create_promotion_pr.py
def create_model_promotion_pr(model_version: str, run_id: str):
    """Creates a PR in the GitOps repo to promote a model to production."""
    import subprocess

    run = mlflow_client.get_run(run_id)
    accuracy = run.data.metrics["test_accuracy"]

    # Update the KServe InferenceService manifest
    # Change storageUri to point to the new model version
    new_storage_uri = f"s3://go-reliable-mlflow-artifacts/0/{run_id}/artifacts/recommendation-model"

    branch_name = f"model-promotion/recommendation-v{model_version}"
    subprocess.run(["git", "checkout", "-b", branch_name])

    # Update the values file
    update_file_with_new_model_uri("kserve/recommendation-model.yaml", new_storage_uri)

    subprocess.run(["git", "add", "-A"])
    subprocess.run(["git", "commit", "-m",
        f"feat(ml): promote recommendation model v{model_version} to production\n\n"
        f"Test accuracy: {accuracy:.4f}\n"
        f"MLFlow run: {run_id}\n"
        f"See: https://mlflow.internal.go-reliable.dev/#/experiments/1/runs/{run_id}"
    ])
    subprocess.run(["git", "push", "origin", branch_name])
    # GitHub API call to open PR ...

When the PR merges, ArgoCD picks up the change and updates the KServe InferenceService with the new model.

The Model Config Service in Go

The ML Inference Gateway needs to know which model version is active and what metadata it carries. Rather than hardcoding model version in the Go service config, it queries the MLFlow API at startup and caches the result.

// internal/mlgateway/model_config.go
package mlgateway

import (
    "context"
    "encoding/json"
    "fmt"
    "net/http"
    "sync"
    "time"
)

type ModelVersion struct {
    Version      string    `json:"version"`
    Stage        string    `json:"current_stage"`
    RunID        string    `json:"run_id"`
    LastUpdated  time.Time `json:"-"`
}

type ModelConfigFetcher struct {
    mlflowURL   string
    modelName   string
    httpClient  *http.Client

    mu            sync.RWMutex
    activeVersion *ModelVersion
}

func NewModelConfigFetcher(mlflowURL, modelName string) *ModelConfigFetcher {
    return &ModelConfigFetcher{
        mlflowURL:  mlflowURL,
        modelName:  modelName,
        httpClient: &http.Client{Timeout: 5 * time.Second},
    }
}

// Refresh gets the current production version from MLFlow Model Registry.
// This is called at startup and then periodically in a background goroutine.
func (f *ModelConfigFetcher) Refresh(ctx context.Context) error {
    url := fmt.Sprintf(
        "%s/api/2.0/mlflow/registered-models/get-latest-versions?name=%s&stages=Production",
        f.mlflowURL, f.modelName,
    )

    req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
    if err != nil {
        return err
    }

    resp, err := f.httpClient.Do(req)
    if err != nil {
        return fmt.Errorf("mlflow request: %w", err)
    }
    defer resp.Body.Close()

    if resp.StatusCode != http.StatusOK {
        return fmt.Errorf("mlflow returned %d", resp.StatusCode)
    }

    var result struct {
        ModelVersions []ModelVersion `json:"model_versions"`
    }
    if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
        return err
    }

    if len(result.ModelVersions) == 0 {
        return fmt.Errorf("no production version found for model %s", f.modelName)
    }

    f.mu.Lock()
    f.activeVersion = &result.ModelVersions[0]
    f.activeVersion.LastUpdated = time.Now()
    f.mu.Unlock()
    return nil
}

func (f *ModelConfigFetcher) ActiveVersion() *ModelVersion {
    f.mu.RLock()
    defer f.mu.RUnlock()
    return f.activeVersion
}

// StartBackgroundRefresh keeps the cached version fresh without blocking requests.
func (f *ModelConfigFetcher) StartBackgroundRefresh(ctx context.Context, interval time.Duration) {
    go func() {
        ticker := time.NewTicker(interval)
        defer ticker.Stop()
        for {
            select {
            case <-ticker.C:
                if err := f.Refresh(ctx); err != nil {
                    // Log the error but keep using the last known version
                    // The service continues to work with stale metadata
                }
            case <-ctx.Done():
                return
            }
        }
    }()
}

This means the Go service always knows which model version it's using for predictions — this version string is attached to every trace span and every prediction metric label.

What the Audit Trail Looks Like

With MLFlow + GitOps together, I can answer every question about a production model:

What model is serving? Check the KServe InferenceService in the GitOps repo.
What accuracy does it have? Follow the run_id from the PR to the MLFlow run.
When was it trained? MLFlow start_time on the run.
What data was used? MLFlow tags: data.lookback_days, training.source.
Who approved it? The PR merge in the GitOps repo shows the reviewer.
How does it compare to the last model? MLFlow experiment comparison view.

This is the same infrastructure-as-code audit trail I have for application changes — just applied to models.

In Part 10, I extend the platform to host a large language model for order description embedding, introducing GPU scheduling, vLLM deployment, and new SLIs specific to LLM workloads.

PreviousPart 8: MLOps with KubeFlow — Training Pipelines on Kubernetes NextPart 10: LLMOps — Operating Large Language Models Reliably

Last updated 4 days ago

hashtagMLFlow's Role in the Stack

hashtagDeploying MLFlow on Kubernetes

hashtagIntegrating MLFlow with Training Pipelines

hashtagModel Registry Lifecycle

hashtagAutomated Evaluation to Staging

hashtagProduction Promotion

hashtagThe Model Config Service in Go

hashtagWhat the Audit Trail Looks Like