Model Serving with KServe

From Model to API

You've trained a great model. Now what?

In production, you need to:

Expose the model via an API
Handle thousands of requests per second
Scale up and down based on traffic
Update models without downtime
Monitor inference performance
Support multiple model formats

KServe (formerly KFServing) handles all of this.

What is KServe?

KServe is a Kubernetes-based model serving platform that provides:

Multi-framework support: TensorFlow, PyTorch, scikit-learn, XGBoost, and more
Auto-scaling: Scale to zero when idle, scale up under load
Canary deployments: Test new models with a percentage of traffic
Model versioning: Serve multiple versions simultaneously
Request/response logging: Track all predictions

Your First Inference Service

Step 1: Save Your Model

Train and save a model:

# train_and_save.py (Python 3.12)
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import joblib
import json

# Train
X, y = load_iris(return_X_y=True)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# Save model
joblib.dump(model, 'model.pkl')

# Save model metadata
metadata = {
    'model_type': 'RandomForestClassifier',
    'framework': 'sklearn',
    'input_shape': [4],
    'output_shape': [3],
    'features': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
    'classes': ['setosa', 'versicolor', 'virginica']
}

with open('metadata.json', 'w') as f:
    json.dump(metadata, f)

print("Model saved successfully")

Step 2: Upload Model to Storage

KServe loads models from object storage (S3, GCS, Azure Blob).

Using S3:

# Upload to S3
aws s3 cp model.pkl s3://my-models/sklearn/iris/v1/model.pkl
aws s3 cp metadata.json s3://my-models/sklearn/iris/v1/metadata.json

Using GCS:

# Upload to Google Cloud Storage
gsutil cp model.pkl gs://my-models/sklearn/iris/v1/model.pkl
gsutil cp metadata.json gs://my-models/sklearn/iris/v1/metadata.json

Local testing with Minikube:

# Create a PVC and copy model
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage
  namespace: ml-workspace
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
EOF

# Copy model to PVC (using a helper pod)
kubectl run upload --image=busybox -n ml-workspace --command -- sleep 3600
kubectl cp model.pkl ml-workspace/upload:/data/model.pkl
kubectl delete pod upload -n ml-workspace

Step 3: Deploy Inference Service

Create an InferenceService resource:

# sklearn-inference.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
  namespace: ml-workspace
spec:
  predictor:
    sklearn:
      storageUri: "s3://my-models/sklearn/iris/v1"
      # Or for local: "pvc://model-storage"
      resources:
        requests:
          cpu: "100m"
          memory: "256Mi"
        limits:
          cpu: "1"
          memory: "1Gi"

Deploy:

kubectl apply -f sklearn-inference.yaml

# Check status
kubectl get inferenceservice sklearn-iris -n ml-workspace

# Wait for READY=True
kubectl wait --for=condition=Ready inferenceservice/sklearn-iris -n ml-workspace

Step 4: Test Your Inference Service

Get the endpoint:

# Get inference service URL
kubectl get inferenceservice sklearn-iris -n ml-workspace -o jsonpath='{.status.url}'

Make a prediction:

# Sample input
curl -X POST \
  http://sklearn-iris.ml-workspace.example.com/v1/models/sklearn-iris:predict \
  -H "Content-Type: application/json" \
  -d '{
    "instances": [
      [5.1, 3.5, 1.4, 0.2],
      [6.3, 2.5, 5.0, 1.9]
    ]
  }'

# Expected response:
# {
#   "predictions": [0, 2]
# }

For local Minikube testing:

# Port forward
kubectl port-forward -n ml-workspace \
  svc/sklearn-iris-predictor-default 8080:80

# Test locally
curl -X POST http://localhost:8080/v1/models/sklearn-iris:predict \
  -H "Content-Type: application/json" \
  -d '{"instances": [[5.1, 3.5, 1.4, 0.2]]}'

Serving Different Model Types

PyTorch Models

# Save PyTorch model
import torch
import torch.nn as nn

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 2)
    
    def forward(self, x):
        return self.fc(x)

model = SimpleModel()
# Train model...

# Save
torch.save(model.state_dict(), 'model.pth')

# For TorchServe, create model archive
# torch-model-archiver --model-name iris \
#   --version 1.0 \
#   --model-file model.py \
#   --serialized-file model.pth \
#   --handler image_classifier

InferenceService:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: pytorch-model
spec:
  predictor:
    pytorch:
      storageUri: "s3://my-models/pytorch/v1"
      resources:
        requests:
          cpu: "500m"
          memory: "1Gi"
        limits:
          nvidia.com/gpu: 1

TensorFlow Models

# Save TensorFlow model
import tensorflow as tf

# Create and train model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(10,)),
    tf.keras.layers.Dense(3, activation='softmax')
])

# Train...

# Save in SavedModel format
model.save('saved_model/1')  # Version directory

Upload structure:

s3://my-models/tensorflow/iris/
└── 1/
    ├── saved_model.pb
    └── variables/
        ├── variables.data-00000-of-00001
        └── variables.index

InferenceService:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: tensorflow-model
spec:
  predictor:
    tensorflow:
      storageUri: "s3://my-models/tensorflow/iris"
      resources:
        limits:
          nvidia.com/gpu: 1

Custom Python Server

For custom inference logic:

# custom_server.py
from kserve import Model, ModelServer
import joblib
import numpy as np

class CustomModel(Model):
    def __init__(self, name: str):
        super().__init__(name)
        self.model = None
        self.ready = False
    
    def load(self):
        """Load model from storage."""
        self.model = joblib.load('/mnt/models/model.pkl')
        self.ready = True
    
    def predict(self, payload: dict) -> dict:
        """Custom inference logic."""
        instances = payload['instances']
        
        # Preprocess
        inputs = np.array(instances)
        
        # Add custom logic here
        # E.g., ensemble, post-processing, business rules
        
        # Predict
        predictions = self.model.predict(inputs)
        
        # Post-process
        results = predictions.tolist()
        
        return {'predictions': results}

if __name__ == '__main__':
    model = CustomModel('custom-model')
    model.load()
    ModelServer().start([model])

Dockerfile:

FROM python:3.12-slim

RUN pip install kserve==0.12.0 scikit-learn==1.3.0

COPY custom_server.py /app/
WORKDIR /app

ENTRYPOINT ["python", "custom_server.py"]

InferenceService:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: custom-model
spec:
  predictor:
    containers:
    - name: kserve-container
      image: my-registry/custom-server:v1
      env:
      - name: STORAGE_URI
        value: "s3://my-models/custom/v1"

Advanced Features

Auto-Scaling

KServe automatically scales based on traffic.

Configure scaling:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: autoscaling-model
  annotations:
    autoscaling.knative.dev/target: "10"  # Target 10 concurrent requests
    autoscaling.knative.dev/minScale: "1"  # Min replicas
    autoscaling.knative.dev/maxScale: "10" # Max replicas
    autoscaling.knative.dev/scaleToZero: "true"  # Scale to 0 when idle
spec:
  predictor:
    sklearn:
      storageUri: "s3://my-models/sklearn/v1"

Scale to zero: Pods terminate after inactivity (default 60s). First request after sleep has cold start latency.

Canary Deployments

Gradually roll out new models:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: canary-model
spec:
  predictor:
    sklearn:
      storageUri: "s3://my-models/sklearn/v1"  # Current version
  canaryTrafficPercent: 20  # Send 20% to canary
  predictor:
    sklearn:
      storageUri: "s3://my-models/sklearn/v2"  # New version

Monitor canary performance, then:

Increase percentage if performing well
Roll back if issues detected

Model Versioning

Serve multiple versions:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: versioned-model
spec:
  predictor:
    sklearn:
      storageUri: "s3://my-models/sklearn/"
      # Storage structure:
      # sklearn/
      # ├── v1/model.pkl
      # ├── v2/model.pkl
      # └── v3/model.pkl

Request specific version:

curl -X POST http://model-service/v1/models/versioned-model/versions/v2:predict \
  -d '{"instances": [[1, 2, 3, 4]]}'

Request/Response Logging

Enable prediction logging:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: logged-model
spec:
  predictor:
    logger:
      mode: all  # Log requests and responses
      url: http://kafka-broker.logging.svc.cluster.local
    sklearn:
      storageUri: "s3://my-models/sklearn/v1"

Logs go to specified sink (Kafka, CloudWatch, etc.) for:

Monitoring predictions
Detecting data drift
Debugging issues
Compliance/audit

Performance Optimization

Batching

Process multiple requests together:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: batched-model
spec:
  predictor:
    batcher:
      maxBatchSize: 32
      maxLatency: 100  # milliseconds
    sklearn:
      storageUri: "s3://my-models/sklearn/v1"

Improves throughput for GPU models.

GPU Acceleration

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: gpu-model
spec:
  predictor:
    pytorch:
      storageUri: "s3://my-models/pytorch/v1"
      resources:
        limits:
          nvidia.com/gpu: 1
          memory: "8Gi"

Model Optimization

Before deploying:

# Quantization (PyTorch)
import torch

model = torch.load('model.pth')
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)
torch.save(quantized_model.state_dict(), 'model_quantized.pth')

# For TensorFlow, use TF Lite
converter = tf.lite.TFLiteConverter.from_saved_model('saved_model')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

Monitoring Inference Services

Check Status

# Get service status
kubectl get inferenceservice -n ml-workspace

# Detailed info
kubectl describe inferenceservice sklearn-iris -n ml-workspace

# View logs
kubectl logs -n ml-workspace \
  -l serving.kserve.io/inferenceservice=sklearn-iris \
  -c kserve-container

Metrics

KServe exposes Prometheus metrics:

# Access metrics
kubectl port-forward -n ml-workspace \
  svc/sklearn-iris-predictor-default-metrics 9090:9090

# View metrics
curl http://localhost:9090/metrics

Key metrics:

request_count: Total requests
request_latency: Response time
request_errors: Failed requests

Performance Testing

# Load test with Apache Bench
ab -n 1000 -c 10 -T 'application/json' \
  -p payload.json \
  http://localhost:8080/v1/models/sklearn-iris:predict

# payload.json:
# {"instances": [[5.1, 3.5, 1.4, 0.2]]}

Common Issues

Service Not Ready

Check events:

kubectl describe inferenceservice sklearn-iris -n ml-workspace

Common causes:

Storage URI incorrect or inaccessible
Model file format issues
Insufficient resources
Image pull errors

Slow Cold Starts

Problem: First request after scale-to-zero is slow

Solutions:

Set minScale: 1 (keep 1 replica always)
Use smaller base images
Optimize model loading code

High Latency

Debug:

# Add timing in custom server
import time

def predict(self, payload):
    start = time.time()
    
    # Preprocessing
    prep_start = time.time()
    inputs = preprocess(payload)
    print(f"Preprocessing: {time.time() - prep_start:.3f}s")
    
    # Inference
    inf_start = time.time()
    predictions = self.model.predict(inputs)
    print(f"Inference: {time.time() - inf_start:.3f}s")
    
    # Post-processing
    post_start = time.time()
    results = postprocess(predictions)
    print(f"Postprocessing: {time.time() - post_start:.3f}s")
    
    print(f"Total: {time.time() - start:.3f}s")
    return results

Optimize: Focus on the slowest component.

Best Practices

1. Version Models Explicitly

s3://models/
├── iris/
│   ├── v1.0.0/model.pkl
│   ├── v1.1.0/model.pkl
│   └── v2.0.0/model.pkl

2. Use Health Checks

Models should validate on startup:

def load(self):
    self.model = joblib.load('/mnt/models/model.pkl')
    
    # Validate
    test_input = np.array([[1, 2, 3, 4]])
    try:
        _ = self.model.predict(test_input)
        self.ready = True
    except Exception as e:
        print(f"Model validation failed: {e}")
        raise

3. Set Resource Limits

resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "2"
    memory: "4Gi"

4. Monitor Everything

Request latency
Error rate
Model metrics (accuracy if ground truth available)
Resource usage

5. Plan for Failures

# Add retries
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    autoscaling.knative.dev/initialScale: "2"  # Start with 2 replicas for redundancy

Key Takeaways

KServe simplifies model serving on Kubernetes
Supports multiple frameworks out-of-the-box
Auto-scaling handles variable traffic
Canary deployments enable safe rollouts
Monitor latency, errors, and resource usage

Next Steps

With models deployed, we need to track them. In Model Registry, we'll learn how to manage model versions, metadata, and lineage.

Resources:

PreviousModel Training with Katib NextModel Registry

Last updated 1 month ago

hashtagFrom Model to API

hashtagWhat is KServe?

hashtagYour First Inference Service

hashtagStep 1: Save Your Model

hashtagStep 2: Upload Model to Storage

hashtagStep 3: Deploy Inference Service

hashtagStep 4: Test Your Inference Service

hashtagServing Different Model Types

hashtagPyTorch Models

hashtagTensorFlow Models

hashtagCustom Python Server

hashtagAdvanced Features

hashtagAuto-Scaling

hashtagCanary Deployments

hashtagModel Versioning

hashtagRequest/Response Logging

hashtagPerformance Optimization

hashtagBatching

hashtagGPU Acceleration

hashtagModel Optimization

hashtagMonitoring Inference Services

hashtagCheck Status

hashtagMetrics

hashtagPerformance Testing

hashtagCommon Issues

hashtagService Not Ready

hashtagSlow Cold Starts

hashtagHigh Latency

hashtagBest Practices

hashtag1. Version Models Explicitly

hashtag2. Use Health Checks

hashtag3. Set Resource Limits

hashtag4. Monitor Everything

hashtag5. Plan for Failures

hashtagKey Takeaways

hashtagNext Steps

From Model to API

What is KServe?

Your First Inference Service

Step 1: Save Your Model

Step 2: Upload Model to Storage

Step 3: Deploy Inference Service

Step 4: Test Your Inference Service

Serving Different Model Types

PyTorch Models

TensorFlow Models

Custom Python Server

Advanced Features

Auto-Scaling

Canary Deployments

Model Versioning

Request/Response Logging

Performance Optimization

Batching

GPU Acceleration

Model Optimization

Monitoring Inference Services

Check Status

Metrics

Performance Testing

Common Issues

Service Not Ready

Slow Cold Starts

High Latency

Best Practices

1. Version Models Explicitly

2. Use Health Checks

3. Set Resource Limits

4. Monitor Everything

5. Plan for Failures

Key Takeaways

Next Steps