Part 9: MLFlow — Experiment Tracking and Model Registry

Part of the SRE Playbook series

What You'll Learn: This article covers deploying MLFlow on Kubernetes with PostgreSQL for metadata and MinIO for artifact storage, integrating it with KubeFlow Pipelines and KServe, using the Model Registry to control production promotions, and building a Go service that queries the MLFlow API to know which model version is currently active. By the end, you have a full audit trail from training run to production prediction.

MLFlow's Role in the Stack

After building the KubeFlow training pipelines in Part 8, I realized I had a problem: the KubeFlow UI showed me pipeline runs, but I couldn't easily answer questions like:

  • What was the accuracy of the model currently running in production?

  • How does today's retrained model compare to last week's?

  • Who approved model version 3 for production?

MLFlow fills that gap. It stores experiment runs, metrics, parameters, and model artifacts in one place. The Model Registry layer adds a lifecycle: models transition from NoneStagingProduction.

For MLFlow fundamentals, see the MLOps 101 series.

Deploying MLFlow on Kubernetes

MLFlow is a stateful service. I give it a PostgreSQL database (separate from the application DB) for metadata and MinIO for model artifacts.

# infrastructure/mlflow/values.yaml
mlflow:
  image:
    repository: ghcr.io/mlflow/mlflow
    tag: "2.11.0"

  backendStore:
    postgres:
      enabled: true
      host: mlflow-postgres.mlflow.svc.cluster.local
      port: 5432
      database: mlflow
      user: mlflow
      # Password injected from External Secrets Operator
      password:
        secretRef:
          name: mlflow-postgres-credentials
          key: password

  artifactRoot:
    s3:
      enabled: true
      bucket: go-reliable-mlflow-artifacts
      # Using IAM role for service account (IRSA) — no long-lived credentials
      awsServiceAccount:
        create: true
        annotations:
          eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT_ID:role/mlflow-artifact-store"

  service:
    type: ClusterIP
    port: 5000

  ingress:
    enabled: true
    host: mlflow.internal.go-reliable.dev
    annotations:
      nginx.ingress.kubernetes.io/auth-type: basic
      nginx.ingress.kubernetes.io/auth-secret: mlflow-basic-auth

The Helm chart is managed as an ArgoCD Application alongside the other infrastructure:

Integrating MLFlow with Training Pipelines

Each KubeFlow pipeline step that trains a model logs to MLFlow. I showed this in Part 8; here's the full logging pattern I use:

After each successful training run, the model is registered in the Model Registry as a new version with None stage (unreviewed).

Model Registry Lifecycle

The Model Registry enforces a promotion workflow. A model goes through stages:

Stage
Meaning
Who Sets It

None

Just registered, not evaluated

Pipeline (automatic)

Staging

Passed automated evaluation, ready for review

Evaluation script (automatic)

Production

Approved for serving

Human (via PR or MLFlow API)

Archived

Replaced or retired

Human

Automated Evaluation to Staging

After training completes, an evaluation step checks if the new model meets minimum quality thresholds:

Production Promotion

Promoting from Staging to Production requires a human approval step. I implement this as a pull request in the GitOps repo:

The training pipeline creates a PR that updates the KServe InferenceService to point to the new model version. A reviewer checks the MLFlow comparison (I link to the MLFlow run in the PR description), then merges.

When the PR merges, ArgoCD picks up the change and updates the KServe InferenceService with the new model.

The Model Config Service in Go

The ML Inference Gateway needs to know which model version is active and what metadata it carries. Rather than hardcoding model version in the Go service config, it queries the MLFlow API at startup and caches the result.

This means the Go service always knows which model version it's using for predictions — this version string is attached to every trace span and every prediction metric label.

What the Audit Trail Looks Like

With MLFlow + GitOps together, I can answer every question about a production model:

  1. What model is serving? Check the KServe InferenceService in the GitOps repo.

  2. What accuracy does it have? Follow the run_id from the PR to the MLFlow run.

  3. When was it trained? MLFlow start_time on the run.

  4. What data was used? MLFlow tags: data.lookback_days, training.source.

  5. Who approved it? The PR merge in the GitOps repo shows the reviewer.

  6. How does it compare to the last model? MLFlow experiment comparison view.

This is the same infrastructure-as-code audit trail I have for application changes — just applied to models.

In Part 10, I extend the platform to host a large language model for order description embedding, introducing GPU scheduling, vLLM deployment, and new SLIs specific to LLM workloads.

Last updated