Kubernetes for MLOps
Why Kubernetes for Machine Learning?
When I first started deploying ML models, I wondered: "Why do I need Kubernetes? Can't I just run my model in a Docker container on a server?"
The answer is yes—for one model with light traffic. But the moment you need to:
Run multiple models simultaneously
Scale inference based on traffic
Schedule resource-intensive training jobs
Handle GPU allocation efficiently
Ensure high availability
Update models without downtime
...you need an orchestrator. And Kubernetes has become the de facto standard.
Kubernetes Basics for ML Engineers
If you're coming from a data science background, Kubernetes might seem intimidating. Let me break down the core concepts you actually need to know.
Core Kubernetes Objects
1. Pods
The smallest deployable unit in Kubernetes—essentially a wrapper around one or more containers.
ML Use Case: A training job runs as a pod. When training completes, the pod terminates.
Key Learning: Pods are ephemeral. When they die, they don't restart automatically. You need higher-level objects for that.
2. Deployments
Manages a set of identical pods and ensures a specified number are always running.
ML Use Case: Serving a model to handle inference requests.
Why This Matters: If one pod crashes, Kubernetes automatically starts a replacement. If you update the image, Kubernetes performs a rolling update.
3. Services
Provides a stable network endpoint to access pods (which have dynamic IPs).
ML Use Case: Expose your model inference API.
In Practice: Your application calls http://model-service/predict rather than tracking individual pod IPs.
4. Jobs
Runs a pod to completion, useful for one-off tasks.
ML Use Case: Running a training job or batch inference.
5. CronJobs
Schedules jobs to run periodically.
ML Use Case: Retraining models daily with fresh data.
Resource Management for ML Workloads
ML workloads are resource-intensive. Here's how to manage them effectively in Kubernetes.
CPU and Memory
Always specify resource requests and limits:
Important Distinction:
Requests: Kubernetes uses this for scheduling. Your pod won't be scheduled on a node that can't provide these resources.
Limits: If your container tries to use more than this, it gets throttled (CPU) or killed (memory).
My Rule of Thumb for ML:
Set requests based on typical usage
Set limits 1.5-2x higher for bursty workloads
For training, requests and limits can be the same (predictable usage)
GPU Support
GPUs require special handling in Kubernetes.
Prerequisites:
Nodes must have GPUs
NVIDIA device plugin must be installed
GPU drivers must be installed on nodes
Critical: GPU requests and limits must be equal. Kubernetes doesn't support fractional GPU allocation (you can't request 0.5 GPUs with the standard plugin).
Node Selection
Direct pods to specific nodes using labels:
Use Cases:
Run training on GPU nodes
Run inference on CPU nodes
Separate dev and prod workloads
Storage for ML: Persistent Volumes
ML workloads need to store:
Training data
Model checkpoints
Trained model artifacts
PersistentVolumeClaim (PVC)
Request storage from the cluster:
Using PVC in Pods
Reality Check:
For large datasets (>1TB), consider object storage (S3, GCS) instead
PVCs are great for model checkpoints and temporary data
ReadWriteMany (shared access) is harder to get working—check your storage provider
Practical Example: Deploying a Python 3.12 Model Server
Let's put it all together with a complete example.
Step 1: Create the Model Server
Step 2: Create Dockerfile
Step 3: Kubernetes Manifests
Step 4: Deploy
Step 5: Test
Kubernetes for Different ML Workloads
Training Jobs
Use Jobs for one-time training
Use CronJobs for scheduled retraining
Request GPUs if needed
Use init containers to download data before training starts
Model Serving
Use Deployments for continuous serving
Set replicas based on traffic
Use HorizontalPodAutoscaler for auto-scaling
Add health checks for reliability
Batch Inference
Use Jobs for one-time batch processing
Use parallel jobs to process data in chunks
Common Kubernetes Issues in ML
Issue 1: Out of Memory (OOM) Kills
Symptom: Pods crash with OOMKilled status
Solution:
Increase memory limits
Optimize model size (quantization, pruning)
Use batch processing with smaller batches
Issue 2: GPU Not Available
Symptom: Pod stuck in Pending, events show insufficient GPU
Solution:
Check GPU node has capacity:
kubectl describe nodesVerify device plugin is running:
kubectl get pods -n kube-system | grep nvidiaCheck if other pods are hogging GPUs
Issue 3: Slow Model Loading
Symptom: Pods take minutes to become ready
Solution:
Increase
initialDelaySecondsin probesUse init containers to pre-load models
Consider loading models from fast storage (local SSD)
Issue 4: Inconsistent Predictions
Symptom: Same input gives different outputs across pods
Solution:
Ensure all pods use the same model version
Check model files are correctly mounted
Verify preprocessing is consistent
Key Takeaways
Pods are ephemeral: Use Deployments for long-running services
Resource management is critical: Always set requests and limits
Storage needs planning: PVCs for small data, object storage for large datasets
Health checks prevent issues: Implement liveness and readiness probes
Start simple: One deployment, one service. Add complexity as needed.
What's Next?
Now that you understand how Kubernetes manages ML workloads, we're ready to layer Kubeflow on top. In Kubeflow Overview & Setup, we'll install Kubeflow and explore how it builds upon these Kubernetes primitives to provide a complete MLOps platform.
Additional Resources:
Last updated