Part 8: MLOps with KubeFlow — Training Pipelines on Kubernetes

Part of the SRE Playbook series

What You'll Learn: This article covers how I added ML operations to the GoReliable platform — deploying KubeFlow on the same Kubernetes cluster, building a training pipeline for a recommendation model, using Katib for hyperparameter tuning, deploying the trained model with KServe, and building a Go prediction gateway that routes inference requests. I also show how I apply SRE principles to ML workloads — they need SLIs too.

Why MLOps Belongs in an SRE Playbook

When I first added a recommendation model to the GoReliable platform, I treated it differently from the other services. I'd SSH into a VM, train a model manually, copy the artifact somewhere, and update a config file with a new model path. That worked for the first iteration.

Then I needed to retrain. And retrain again when my training data grew. And I wanted to try different hyperparameters. Within two weeks, I had no idea which model version was running in production, where the training code was, or how to reproduce the training run that produced it.

MLOps is SRE applied to model pipelines. The same principles apply: reproducibility, observability, automated delivery, and reliability. KubeFlow provides the platform; the same GitOps workflow from Part 3 manages it.

For MLOps fundamentals, see the MLOps 101 series. This article focuses on the platform integration.

Deploying KubeFlow via ArgoCD

I deploy KubeFlow into a dedicated kubeflow namespace on the same cluster. Since KubeFlow has many components, I use ArgoCD's sync waves to order the deployment.

# argocd/appsets/kubeflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: kubeflow
  namespace: argocd
spec:
  generators:
    - list:
        elements:
          - name: kubeflow-cert-manager
            wave: "1"
          - name: kubeflow-istio
            wave: "2"
          - name: kubeflow-dex
            wave: "3"
          - name: kubeflow-pipelines
            wave: "4"
          - name: kubeflow-katib
            wave: "5"
          - name: kserve
            wave: "6"
  template:
    metadata:
      name: "kubeflow-{{name}}"
      annotations:
        argocd.argoproj.io/sync-wave: "{{wave}}"
    spec:
      project: go-reliable
      source:
        repoURL: https://github.com/htunn/go-reliable-gitops.git
        targetRevision: main
        path: infrastructure/kubeflow/{{name}}
      destination:
        server: https://kubernetes.default.svc
        namespace: kubeflow
      syncPolicy:
        automated:
          prune: false           # Don't auto-prune KubeFlow components
          selfHeal: true
        syncOptions:
          - CreateNamespace=true
          - ServerSideApply=true  # KubeFlow CRDs require server-side apply

KubeFlow uses Istio for its internal service mesh. I configure Istio to not intercept traffic in my application namespaces — I don't want KubeFlow's Istio installation interfering with the go-reliable-production namespace.

The Recommendation Model

The model I trained is a simple order-based recommendation: given a user's order history, recommend what they're likely to order next. It's not a state-of-the-art deep learning model — it's a gradient boosting classifier trained on purchase sequences.

What matters is that it runs reliably in production and has the same SRE treatment as any other service.

Training Pipeline

I define the training pipeline using the KubeFlow Pipelines Python SDK v2:

Submitting the Pipeline

I trigger this pipeline via a scheduled Kubernetes CronJob (weekly retraining) and also manually when I want to experiment with hyperparameters.

Hyperparameter Tuning with Katib

Instead of manually trying hyperparameter combinations, I use Katib to run a structured search:

The Bayesian optimization algorithm uses results from previous trials to make informed choices about which hyperparameters to try next. My 20-trial search found parameters that improved accuracy from 0.79 to 0.84 — meaningfully better than my initial manual guess.

Model Serving with KServe

After training, I deploy the model as a KServe InferenceService. KServe handles the serving infrastructure — I define what model to serve and it handles scaling, load balancing, and the prediction API.

KServe automatically:

  • Downloads the model from S3 on startup

  • Exposes a prediction REST API at /v2/models/recommendation-model/infer

  • Scales based on concurrency

  • Logs prediction requests to the observability stack

The Go ML Inference Gateway

Models don't know about authentication, routing, or the GoReliable API contract. The ML Inference Gateway is the Go service that bridges them.

SLIs for Model Serving

The ML inference endpoint gets the same SRE treatment as the other services. I define two additional SLIs:

Model serving availability:

Inference latency (p99 < 200ms):

The 200ms latency threshold for inference is tighter than the 300ms I use for the order API. Recommendations are called in the user-facing path and perceptibly slow the page if they take too long.

In Part 9, I set up MLFlow as the experiment tracking and model registry layer — the system of record for what model is trained, what parameters produced which accuracy, and which version is approved for production.

Last updated