Kubernetes for MLOps

Why Kubernetes for Machine Learning?

When I first started deploying ML models, I wondered: "Why do I need Kubernetes? Can't I just run my model in a Docker container on a server?"

The answer is yes—for one model with light traffic. But the moment you need to:

Run multiple models simultaneously
Scale inference based on traffic
Schedule resource-intensive training jobs
Handle GPU allocation efficiently
Ensure high availability
Update models without downtime

...you need an orchestrator. And Kubernetes has become the de facto standard.

Kubernetes Basics for ML Engineers

If you're coming from a data science background, Kubernetes might seem intimidating. Let me break down the core concepts you actually need to know.

Core Kubernetes Objects

1. Pods

The smallest deployable unit in Kubernetes—essentially a wrapper around one or more containers.

ML Use Case: A training job runs as a pod. When training completes, the pod terminates.

# Simple training pod example
apiVersion: v1
kind: Pod
metadata:
  name: model-training
spec:
  containers:
  - name: trainer
    image: my-training-image:latest
    command: ["python3.12", "train.py"]
    resources:
      requests:
        memory: "4Gi"
        cpu: "2"
      limits:
        memory: "8Gi"
        cpu: "4"

Key Learning: Pods are ephemeral. When they die, they don't restart automatically. You need higher-level objects for that.

2. Deployments

Manages a set of identical pods and ensures a specified number are always running.

ML Use Case: Serving a model to handle inference requests.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-server
spec:
  replicas: 3  # Run 3 copies for high availability
  selector:
    matchLabels:
      app: model-server
  template:
    metadata:
      labels:
        app: model-server
    spec:
      containers:
      - name: server
        image: my-model-server:v1.2.3
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"

Why This Matters: If one pod crashes, Kubernetes automatically starts a replacement. If you update the image, Kubernetes performs a rolling update.

3. Services

Provides a stable network endpoint to access pods (which have dynamic IPs).

ML Use Case: Expose your model inference API.

apiVersion: v1
kind: Service
metadata:
  name: model-service
spec:
  selector:
    app: model-server
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: LoadBalancer  # Or ClusterIP for internal access

In Practice: Your application calls http://model-service/predict rather than tracking individual pod IPs.

4. Jobs

Runs a pod to completion, useful for one-off tasks.

ML Use Case: Running a training job or batch inference.

apiVersion: batch/v1
kind: Job
metadata:
  name: batch-inference
spec:
  template:
    spec:
      containers:
      - name: inference
        image: my-batch-inference:latest
        command: ["python3.12", "batch_predict.py"]
      restartPolicy: Never
  backoffLimit: 3  # Retry up to 3 times on failure

5. CronJobs

Schedules jobs to run periodically.

ML Use Case: Retraining models daily with fresh data.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-retrain
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: retrain
            image: my-training-image:latest
            command: ["python3.12", "retrain.py"]
          restartPolicy: Never

Resource Management for ML Workloads

ML workloads are resource-intensive. Here's how to manage them effectively in Kubernetes.

CPU and Memory

Always specify resource requests and limits:

resources:
  requests:  # Minimum resources needed
    memory: "4Gi"
    cpu: "2"
  limits:    # Maximum resources allowed
    memory: "8Gi"
    cpu: "4"

Important Distinction:

Requests: Kubernetes uses this for scheduling. Your pod won't be scheduled on a node that can't provide these resources.
Limits: If your container tries to use more than this, it gets throttled (CPU) or killed (memory).

My Rule of Thumb for ML:

Set requests based on typical usage
Set limits 1.5-2x higher for bursty workloads
For training, requests and limits can be the same (predictable usage)

GPU Support

GPUs require special handling in Kubernetes.

resources:
  requests:
    nvidia.com/gpu: 1  # Request 1 GPU
  limits:
    nvidia.com/gpu: 1  # Must match request

Prerequisites:

Nodes must have GPUs
NVIDIA device plugin must be installed
GPU drivers must be installed on nodes

Critical: GPU requests and limits must be equal. Kubernetes doesn't support fractional GPU allocation (you can't request 0.5 GPUs with the standard plugin).

Node Selection

Direct pods to specific nodes using labels:

# Label a GPU node
kubectl label nodes node-1 hardware=gpu

# Pod specification
spec:
  nodeSelector:
    hardware: gpu
  containers:
  - name: gpu-trainer
    image: my-gpu-training:latest

Use Cases:

Run training on GPU nodes
Run inference on CPU nodes
Separate dev and prod workloads

Storage for ML: Persistent Volumes

ML workloads need to store:

Training data
Model checkpoints
Trained model artifacts

PersistentVolumeClaim (PVC)

Request storage from the cluster:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: training-data
spec:
  accessModes:
  - ReadWriteOnce  # Single node can mount as read-write
  resources:
    requests:
      storage: 100Gi
  storageClassName: fast-ssd

Using PVC in Pods

spec:
  containers:
  - name: trainer
    image: my-training-image:latest
    volumeMounts:
    - name: data
      mountPath: /data
    - name: models
      mountPath: /models
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: training-data
  - name: models
    persistentVolumeClaim:
      claimName: model-storage

Reality Check:

For large datasets (>1TB), consider object storage (S3, GCS) instead
PVCs are great for model checkpoints and temporary data
ReadWriteMany (shared access) is harder to get working—check your storage provider

Practical Example: Deploying a Python 3.12 Model Server

Let's put it all together with a complete example.

Step 1: Create the Model Server

# model_server.py
from flask import Flask, request, jsonify
import joblib
import numpy as np

app = Flask(__name__)

# Load model at startup
model = joblib.load('/models/model.pkl')

@app.route('/health', methods=['GET'])
def health():
    return jsonify({'status': 'healthy'}), 200

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    features = np.array(data['features']).reshape(1, -1)
    prediction = model.predict(features)
    return jsonify({
        'prediction': prediction.tolist()
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Step 2: Create Dockerfile

FROM python:3.12-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY model_server.py .

# Model will be mounted as volume
VOLUME /models

EXPOSE 8080

CMD ["python3.12", "model_server.py"]

Step 3: Kubernetes Manifests

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sklearn-model-server
  labels:
    app: sklearn-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sklearn-server
  template:
    metadata:
      labels:
        app: sklearn-server
    spec:
      containers:
      - name: server
        image: my-registry/sklearn-server:v1.0.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        volumeMounts:
        - name: model-storage
          mountPath: /models
          readOnly: true
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc
---
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: sklearn-model-service
spec:
  selector:
    app: sklearn-server
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: LoadBalancer

Step 4: Deploy

# Create namespace
kubectl create namespace ml-serving

# Deploy
kubectl apply -f deployment.yaml -n ml-serving
kubectl apply -f service.yaml -n ml-serving

# Check status
kubectl get pods -n ml-serving
kubectl get svc -n ml-serving

# Get service endpoint
kubectl get svc sklearn-model-service -n ml-serving

Step 5: Test

# Get the external IP
EXTERNAL_IP=$(kubectl get svc sklearn-model-service -n ml-serving -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

# Make prediction
curl -X POST http://$EXTERNAL_IP/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [1.0, 2.0, 3.0, 4.0]}'

Kubernetes for Different ML Workloads

Training Jobs

Use Jobs for one-time training
Use CronJobs for scheduled retraining
Request GPUs if needed
Use init containers to download data before training starts

initContainers:
- name: data-downloader
  image: amazon/aws-cli
  command: 
  - aws
  - s3
  - cp
  - s3://my-bucket/training-data
  - /data
  - --recursive
  volumeMounts:
  - name: data
    mountPath: /data

Model Serving

Use Deployments for continuous serving
Set replicas based on traffic
Use HorizontalPodAutoscaler for auto-scaling
Add health checks for reliability

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: sklearn-model-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Batch Inference

Use Jobs for one-time batch processing
Use parallel jobs to process data in chunks

apiVersion: batch/v1
kind: Job
metadata:
  name: batch-inference
spec:
  parallelism: 5  # Run 5 pods in parallel
  completions: 20  # Total 20 tasks to complete
  template:
    spec:
      containers:
      - name: inference
        image: my-batch-inference:latest
        env:
        - name: TASK_ID
          value: "$(POD_INDEX)"
      restartPolicy: OnFailure

Common Kubernetes Issues in ML

Issue 1: Out of Memory (OOM) Kills

Symptom: Pods crash with OOMKilled status

Solution:

Increase memory limits
Optimize model size (quantization, pruning)
Use batch processing with smaller batches

Issue 2: GPU Not Available

Symptom: Pod stuck in Pending, events show insufficient GPU

Solution:

Check GPU node has capacity: kubectl describe nodes
Verify device plugin is running: kubectl get pods -n kube-system | grep nvidia
Check if other pods are hogging GPUs

Issue 3: Slow Model Loading

Symptom: Pods take minutes to become ready

Solution:

Increase initialDelaySeconds in probes
Use init containers to pre-load models
Consider loading models from fast storage (local SSD)

Issue 4: Inconsistent Predictions

Symptom: Same input gives different outputs across pods

Solution:

Ensure all pods use the same model version
Check model files are correctly mounted
Verify preprocessing is consistent

Key Takeaways

Pods are ephemeral: Use Deployments for long-running services
Resource management is critical: Always set requests and limits
Storage needs planning: PVCs for small data, object storage for large datasets
Health checks prevent issues: Implement liveness and readiness probes
Start simple: One deployment, one service. Add complexity as needed.

What's Next?

Now that you understand how Kubernetes manages ML workloads, we're ready to layer Kubeflow on top. In Kubeflow Overview & Setup, we'll install Kubeflow and explore how it builds upon these Kubernetes primitives to provide a complete MLOps platform.

Additional Resources:

PreviousMLOps Fundamentals NextKubeflow Overview & Setup

Last updated 1 month ago

hashtagWhy Kubernetes for Machine Learning?

hashtagKubernetes Basics for ML Engineers

hashtagCore Kubernetes Objects

hashtag1. Pods

hashtag2. Deployments

hashtag3. Services

hashtag4. Jobs

hashtag5. CronJobs

hashtagResource Management for ML Workloads

hashtagCPU and Memory

hashtagGPU Support

hashtagNode Selection

hashtagStorage for ML: Persistent Volumes

hashtagPersistentVolumeClaim (PVC)

hashtagUsing PVC in Pods

hashtagPractical Example: Deploying a Python 3.12 Model Server

hashtagStep 1: Create the Model Server

hashtagStep 2: Create Dockerfile

hashtagStep 3: Kubernetes Manifests

hashtagStep 4: Deploy

hashtagStep 5: Test

hashtagKubernetes for Different ML Workloads

hashtagTraining Jobs

hashtagModel Serving

hashtagBatch Inference

hashtagCommon Kubernetes Issues in ML

hashtagIssue 1: Out of Memory (OOM) Kills

hashtagIssue 2: GPU Not Available

hashtagIssue 3: Slow Model Loading

hashtagIssue 4: Inconsistent Predictions

hashtagKey Takeaways

hashtagWhat's Next?

Why Kubernetes for Machine Learning?

Kubernetes Basics for ML Engineers

Core Kubernetes Objects

1. Pods

2. Deployments

3. Services

4. Jobs

5. CronJobs

Resource Management for ML Workloads

CPU and Memory

GPU Support

Node Selection

Storage for ML: Persistent Volumes

PersistentVolumeClaim (PVC)

Using PVC in Pods

Practical Example: Deploying a Python 3.12 Model Server

Step 1: Create the Model Server

Step 2: Create Dockerfile

Step 3: Kubernetes Manifests

Step 4: Deploy

Step 5: Test

Kubernetes for Different ML Workloads

Training Jobs

Model Serving

Batch Inference

Common Kubernetes Issues in ML

Issue 1: Out of Memory (OOM) Kills

Issue 2: GPU Not Available

Issue 3: Slow Model Loading

Issue 4: Inconsistent Predictions

Key Takeaways

What's Next?