Model Serving with KServe

From Model to API

You've trained a great model. Now what?

In production, you need to:

  • Expose the model via an API

  • Handle thousands of requests per second

  • Scale up and down based on traffic

  • Update models without downtime

  • Monitor inference performance

  • Support multiple model formats

KServe (formerly KFServing) handles all of this.

What is KServe?

KServe is a Kubernetes-based model serving platform that provides:

  • Multi-framework support: TensorFlow, PyTorch, scikit-learn, XGBoost, and more

  • Auto-scaling: Scale to zero when idle, scale up under load

  • Canary deployments: Test new models with a percentage of traffic

  • Model versioning: Serve multiple versions simultaneously

  • Request/response logging: Track all predictions

Your First Inference Service

Step 1: Save Your Model

Train and save a model:

Step 2: Upload Model to Storage

KServe loads models from object storage (S3, GCS, Azure Blob).

Using S3:

Using GCS:

Local testing with Minikube:

Step 3: Deploy Inference Service

Create an InferenceService resource:

Deploy:

Step 4: Test Your Inference Service

Get the endpoint:

Make a prediction:

For local Minikube testing:

Serving Different Model Types

PyTorch Models

InferenceService:

TensorFlow Models

Upload structure:

InferenceService:

Custom Python Server

For custom inference logic:

Dockerfile:

InferenceService:

Advanced Features

Auto-Scaling

KServe automatically scales based on traffic.

Configure scaling:

Scale to zero: Pods terminate after inactivity (default 60s). First request after sleep has cold start latency.

Canary Deployments

Gradually roll out new models:

Monitor canary performance, then:

  • Increase percentage if performing well

  • Roll back if issues detected

Model Versioning

Serve multiple versions:

Request specific version:

Request/Response Logging

Enable prediction logging:

Logs go to specified sink (Kafka, CloudWatch, etc.) for:

  • Monitoring predictions

  • Detecting data drift

  • Debugging issues

  • Compliance/audit

Performance Optimization

Batching

Process multiple requests together:

Improves throughput for GPU models.

GPU Acceleration

Model Optimization

Before deploying:

Monitoring Inference Services

Check Status

Metrics

KServe exposes Prometheus metrics:

Key metrics:

  • request_count: Total requests

  • request_latency: Response time

  • request_errors: Failed requests

Performance Testing

Common Issues

Service Not Ready

Check events:

Common causes:

  • Storage URI incorrect or inaccessible

  • Model file format issues

  • Insufficient resources

  • Image pull errors

Slow Cold Starts

Problem: First request after scale-to-zero is slow

Solutions:

  1. Set minScale: 1 (keep 1 replica always)

  2. Use smaller base images

  3. Optimize model loading code

High Latency

Debug:

Optimize: Focus on the slowest component.

Best Practices

1. Version Models Explicitly

2. Use Health Checks

Models should validate on startup:

3. Set Resource Limits

4. Monitor Everything

  • Request latency

  • Error rate

  • Model metrics (accuracy if ground truth available)

  • Resource usage

5. Plan for Failures

Key Takeaways

  1. KServe simplifies model serving on Kubernetes

  2. Supports multiple frameworks out-of-the-box

  3. Auto-scaling handles variable traffic

  4. Canary deployments enable safe rollouts

  5. Monitor latency, errors, and resource usage

Next Steps

With models deployed, we need to track them. In Model Registry, we'll learn how to manage model versions, metadata, and lineage.


Resources:

Last updated