Kubernetes for MLOps

Why Kubernetes for Machine Learning?

When I first started deploying ML models, I wondered: "Why do I need Kubernetes? Can't I just run my model in a Docker container on a server?"

The answer is yes—for one model with light traffic. But the moment you need to:

  • Run multiple models simultaneously

  • Scale inference based on traffic

  • Schedule resource-intensive training jobs

  • Handle GPU allocation efficiently

  • Ensure high availability

  • Update models without downtime

...you need an orchestrator. And Kubernetes has become the de facto standard.

Kubernetes Basics for ML Engineers

If you're coming from a data science background, Kubernetes might seem intimidating. Let me break down the core concepts you actually need to know.

Core Kubernetes Objects

1. Pods

The smallest deployable unit in Kubernetes—essentially a wrapper around one or more containers.

ML Use Case: A training job runs as a pod. When training completes, the pod terminates.

Key Learning: Pods are ephemeral. When they die, they don't restart automatically. You need higher-level objects for that.

2. Deployments

Manages a set of identical pods and ensures a specified number are always running.

ML Use Case: Serving a model to handle inference requests.

Why This Matters: If one pod crashes, Kubernetes automatically starts a replacement. If you update the image, Kubernetes performs a rolling update.

3. Services

Provides a stable network endpoint to access pods (which have dynamic IPs).

ML Use Case: Expose your model inference API.

In Practice: Your application calls http://model-service/predict rather than tracking individual pod IPs.

4. Jobs

Runs a pod to completion, useful for one-off tasks.

ML Use Case: Running a training job or batch inference.

5. CronJobs

Schedules jobs to run periodically.

ML Use Case: Retraining models daily with fresh data.

Resource Management for ML Workloads

ML workloads are resource-intensive. Here's how to manage them effectively in Kubernetes.

CPU and Memory

Always specify resource requests and limits:

Important Distinction:

  • Requests: Kubernetes uses this for scheduling. Your pod won't be scheduled on a node that can't provide these resources.

  • Limits: If your container tries to use more than this, it gets throttled (CPU) or killed (memory).

My Rule of Thumb for ML:

  • Set requests based on typical usage

  • Set limits 1.5-2x higher for bursty workloads

  • For training, requests and limits can be the same (predictable usage)

GPU Support

GPUs require special handling in Kubernetes.

Prerequisites:

  1. Nodes must have GPUs

  2. NVIDIA device plugin must be installed

  3. GPU drivers must be installed on nodes

Critical: GPU requests and limits must be equal. Kubernetes doesn't support fractional GPU allocation (you can't request 0.5 GPUs with the standard plugin).

Node Selection

Direct pods to specific nodes using labels:

Use Cases:

  • Run training on GPU nodes

  • Run inference on CPU nodes

  • Separate dev and prod workloads

Storage for ML: Persistent Volumes

ML workloads need to store:

  • Training data

  • Model checkpoints

  • Trained model artifacts

PersistentVolumeClaim (PVC)

Request storage from the cluster:

Using PVC in Pods

Reality Check:

  • For large datasets (>1TB), consider object storage (S3, GCS) instead

  • PVCs are great for model checkpoints and temporary data

  • ReadWriteMany (shared access) is harder to get working—check your storage provider

Practical Example: Deploying a Python 3.12 Model Server

Let's put it all together with a complete example.

Step 1: Create the Model Server

Step 2: Create Dockerfile

Step 3: Kubernetes Manifests

Step 4: Deploy

Step 5: Test

Kubernetes for Different ML Workloads

Training Jobs

  • Use Jobs for one-time training

  • Use CronJobs for scheduled retraining

  • Request GPUs if needed

  • Use init containers to download data before training starts

Model Serving

  • Use Deployments for continuous serving

  • Set replicas based on traffic

  • Use HorizontalPodAutoscaler for auto-scaling

  • Add health checks for reliability

Batch Inference

  • Use Jobs for one-time batch processing

  • Use parallel jobs to process data in chunks

Common Kubernetes Issues in ML

Issue 1: Out of Memory (OOM) Kills

Symptom: Pods crash with OOMKilled status

Solution:

  • Increase memory limits

  • Optimize model size (quantization, pruning)

  • Use batch processing with smaller batches

Issue 2: GPU Not Available

Symptom: Pod stuck in Pending, events show insufficient GPU

Solution:

  • Check GPU node has capacity: kubectl describe nodes

  • Verify device plugin is running: kubectl get pods -n kube-system | grep nvidia

  • Check if other pods are hogging GPUs

Issue 3: Slow Model Loading

Symptom: Pods take minutes to become ready

Solution:

  • Increase initialDelaySeconds in probes

  • Use init containers to pre-load models

  • Consider loading models from fast storage (local SSD)

Issue 4: Inconsistent Predictions

Symptom: Same input gives different outputs across pods

Solution:

  • Ensure all pods use the same model version

  • Check model files are correctly mounted

  • Verify preprocessing is consistent

Key Takeaways

  1. Pods are ephemeral: Use Deployments for long-running services

  2. Resource management is critical: Always set requests and limits

  3. Storage needs planning: PVCs for small data, object storage for large datasets

  4. Health checks prevent issues: Implement liveness and readiness probes

  5. Start simple: One deployment, one service. Add complexity as needed.

What's Next?

Now that you understand how Kubernetes manages ML workloads, we're ready to layer Kubeflow on top. In Kubeflow Overview & Setup, we'll install Kubeflow and explore how it builds upon these Kubernetes primitives to provide a complete MLOps platform.


Additional Resources:

Last updated