Part 5: Production Deployment and Best Practices

My First Production Deployment

Got a model to 95% validation accuracy. Felt ready to deploy.

Put it in production behind a REST API. First day: 200ms average latency. Not great but acceptable.

Second day: API timeouts, angry users, emergency rollback.

What happened? Memory leak from not properly managing PyTorch tensors in the API. Each request created tensors that weren't released.

Spent 2 days debugging. Then learned: Production deployment is completely different from training.

Let me save you those 2 days.

Model Saving and Loading

Basic Save/Load

import torch

# Save entire model (not recommended)
torch.save(model, 'model.pth')
loaded_model = torch.load('model.pth')

# Save state dict (recommended)
torch.save(model.state_dict(), 'model_weights.pth')

# Load state dict
model = MyModel()
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()

Always use state_dict() - more portable and flexible.

Production Checkpoint

# Save complete training state
checkpoint = {
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'epoch': epoch,
    'loss': loss,
    'accuracy': accuracy,
    'config': config  # Model hyperparameters
}

torch.save(checkpoint, 'checkpoint.pth')

# Load checkpoint
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']

I save checkpoints every epoch - can resume training if interrupted.

Cross-Device Loading

# Save on GPU, load on CPU
device = torch.device('cpu')
model.load_state_dict(torch.load('model.pth', map_location=device))

# Save on CPU, load on GPU
device = torch.device('cuda')
model.load_state_dict(torch.load('model.pth', map_location=device))
model.to(device)

Model Optimization for Inference

TorchScript

Convert PyTorch model to optimized format - faster inference, can run without Python.

import torch

class MyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(10, 5)
    
    def forward(self, x):
        return self.fc(x)

model = MyModel()
model.eval()

# Tracing - record operations
example_input = torch.randn(1, 10)
traced_model = torch.jit.trace(model, example_input)

# Save
traced_model.save('model_traced.pt')

# Load and use
loaded = torch.jit.load('model_traced.pt')
output = loaded(example_input)

Tracing works for most models. For models with control flow:

# Scripting - analyzes Python code
scripted_model = torch.jit.script(model)
scripted_model.save('model_scripted.pt')

I use TorchScript for production - typically 1.5-2x faster than eager mode.

ONNX Export

ONNX = Open Neural Network Exchange. Deploy to non-PyTorch environments (TensorFlow Serving, ONNX Runtime, etc.).

import torch
import torch.onnx

model = MyModel()
model.eval()

# Example input
dummy_input = torch.randn(1, 10)

# Export
torch.onnx.export(
    model,
    dummy_input,
    'model.onnx',
    export_params=True,
    opset_version=12,
    do_constant_folding=True,
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

print("ONNX model exported successfully")

Use ONNX Runtime for inference:

import onnxruntime as ort
import numpy as np

# Load ONNX model
session = ort.InferenceSession('model.onnx')

# Run inference
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name

input_data = np.random.randn(1, 10).astype(np.float32)
result = session.run([output_name], {input_name: input_data})

print(f"Output: {result}")

ONNX Runtime is often faster than PyTorch for CPU inference.

Quantization

Reduce model size and increase speed by using 8-bit integers instead of 32-bit floats.

Dynamic Quantization

import torch

model = MyModel()
model.eval()

# Quantize - easiest method
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},  # Layers to quantize
    dtype=torch.qint8
)

# Test
input = torch.randn(1, 10)
output = quantized_model(input)

# Save
torch.save(quantized_model.state_dict(), 'model_quantized.pth')

# Compare sizes
import os
print(f"Original size: {os.path.getsize('model.pth') / 1e6:.2f} MB")
print(f"Quantized size: {os.path.getsize('model_quantized.pth') / 1e6:.2f} MB")

Dynamic quantization works for LSTM/Transformer - typically 2-4x smaller, 2-3x faster.

Static Quantization (More Advanced)

import torch
from torch.quantization import get_default_qconfig, prepare, convert

model = MyModel()
model.eval()

# Set quantization config
model.qconfig = get_default_qconfig('fbgemm')

# Prepare for calibration
model_prepared = prepare(model)

# Calibrate with representative data
with torch.no_grad():
    for data, _ in calibration_loader:
        model_prepared(data)

# Convert to quantized model
model_quantized = convert(model_prepared)

# Save
torch.save(model_quantized.state_dict(), 'model_static_quantized.pth')

Static quantization is more complex but faster - I use it for mobile deployment.

REST API with FastAPI

Serve model via HTTP API. Here's my production setup:

from fastapi import FastAPI, File, UploadFile
from fastapi.responses import JSONResponse
import torch
import torch.nn as nn
from PIL import Image
import io
from torchvision import transforms
import uvicorn

# Load model
class ImageClassifier(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Flatten(),
            nn.Linear(64 * 112 * 112, num_classes)
        )
    
    def forward(self, x):
        return self.features(x)

# Initialize
app = FastAPI(title="Image Classification API")

# Load model globally (once)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = ImageClassifier(num_classes=10)
model.load_state_dict(torch.load('model.pth', map_location=device))
model = model.to(device)
model.eval()

# Preprocessing
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                       std=[0.229, 0.224, 0.225])
])

# Class names
CLASS_NAMES = ['cat', 'dog', 'bird', 'fish', 'horse', 
               'deer', 'frog', 'ship', 'truck', 'plane']

@app.post("/predict")
async def predict(file: UploadFile = File(...)):
    """Predict image class."""
    try:
        # Read image
        image_bytes = await file.read()
        image = Image.open(io.BytesIO(image_bytes)).convert('RGB')
        
        # Preprocess
        input_tensor = transform(image).unsqueeze(0).to(device)
        
        # Inference
        with torch.no_grad():
            output = model(input_tensor)
            probabilities = torch.nn.functional.softmax(output, dim=1)
            confidence, predicted_idx = torch.max(probabilities, 1)
        
        # Response
        return JSONResponse({
            'class': CLASS_NAMES[predicted_idx.item()],
            'confidence': confidence.item(),
            'all_probabilities': {
                CLASS_NAMES[i]: probabilities[0][i].item()
                for i in range(len(CLASS_NAMES))
            }
        })
    
    except Exception as e:
        return JSONResponse({'error': str(e)}, status_code=500)

@app.get("/health")
async def health():
    """Health check."""
    return {'status': 'healthy'}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Run server:

python api.py

Test:

curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: multipart/form-data" \
  -F "[email protected]"

Batch Inference API

For better throughput:

from fastapi import FastAPI
from pydantic import BaseModel
from typing import List
import torch
import asyncio
from collections import deque

app = FastAPI()

# Batch processor
class BatchProcessor:
    def __init__(self, model, max_batch_size=32, max_wait_time=0.1):
        self.model = model
        self.max_batch_size = max_batch_size
        self.max_wait_time = max_wait_time
        self.queue = deque()
        self.processing = False
    
    async def predict(self, input_tensor):
        """Add to batch queue and wait for result."""
        future = asyncio.Future()
        self.queue.append((input_tensor, future))
        
        if not self.processing:
            asyncio.create_task(self.process_batch())
        
        return await future
    
    async def process_batch(self):
        """Process accumulated batch."""
        self.processing = True
        await asyncio.sleep(self.max_wait_time)
        
        if not self.queue:
            self.processing = False
            return
        
        # Collect batch
        batch_items = []
        batch_size = min(len(self.queue), self.max_batch_size)
        
        for _ in range(batch_size):
            batch_items.append(self.queue.popleft())
        
        # Prepare batch
        inputs = torch.stack([item[0] for item in batch_items])
        futures = [item[1] for item in batch_items]
        
        # Inference
        with torch.no_grad():
            outputs = self.model(inputs)
        
        # Set results
        for i, future in enumerate(futures):
            future.set_result(outputs[i])
        
        self.processing = False

# Initialize batch processor
batch_processor = BatchProcessor(model, max_batch_size=32)

@app.post("/predict")
async def predict(input_data: List[float]):
    """Predict with batching."""
    input_tensor = torch.tensor(input_data).unsqueeze(0)
    result = await batch_processor.predict(input_tensor)
    return {'prediction': result.tolist()}

Batch processing gives 3-5x higher throughput in production.

Docker Deployment

Containerize for consistent deployment:

# Dockerfile
FROM python:3.9-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and code
COPY model.pth .
COPY api.py .

# Expose port
EXPOSE 8000

# Run
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]

requirements.txt:

torch==2.0.0
fastapi==0.95.0
uvicorn==0.21.0
pillow==9.5.0
python-multipart==0.0.6

Build and run:

docker build -t pytorch-api .
docker run -p 8000:8000 pytorch-api

With GPU:

FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04

# Install Python
RUN apt-get update && apt-get install -y python3 python3-pip

WORKDIR /app

# Install PyTorch with CUDA
RUN pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu118

# Copy and install requirements
COPY requirements.txt .
RUN pip3 install -r requirements.txt

COPY . .

EXPOSE 8000

CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]

Run with GPU:

docker run --gpus all -p 8000:8000 pytorch-api

Performance Optimization

Memory Management

import torch
import gc

# Clear cache
torch.cuda.empty_cache()

# Delete tensors explicitly
del large_tensor
gc.collect()

# Use no_grad for inference
with torch.no_grad():
    output = model(input)

# Move results to CPU immediately
output_cpu = output.cpu()

Batch Size Tuning

def find_optimal_batch_size(model, input_shape, max_batch_size=512):
    """Find largest batch size that fits in memory."""
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    model.eval()
    
    batch_size = 1
    while batch_size <= max_batch_size:
        try:
            input = torch.randn(batch_size, *input_shape).to(device)
            with torch.no_grad():
                _ = model(input)
            
            print(f"Batch size {batch_size}: OK")
            batch_size *= 2
            
        except RuntimeError as e:
            if 'out of memory' in str(e):
                print(f"Batch size {batch_size}: OOM")
                optimal = batch_size // 2
                print(f"Optimal batch size: {optimal}")
                return optimal
            else:
                raise e
    
    return max_batch_size

# Find optimal batch size
optimal_bs = find_optimal_batch_size(model, input_shape=(3, 224, 224))

Multi-GPU Inference

import torch.nn as nn

# DataParallel - simple multi-GPU
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs")
    model = nn.DataParallel(model)

model = model.to('cuda')

# Inference
output = model(input)

Monitoring

Performance Metrics

import time
import psutil
import torch

class InferenceMonitor:
    """Monitor inference performance."""
    
    def __init__(self):
        self.latencies = []
        self.memory_usage = []
    
    def measure(self, model, input):
        """Measure single inference."""
        # GPU warm-up
        if torch.cuda.is_available():
            torch.cuda.synchronize()
        
        # Measure
        start = time.time()
        
        with torch.no_grad():
            output = model(input)
        
        if torch.cuda.is_available():
            torch.cuda.synchronize()
        
        latency = time.time() - start
        self.latencies.append(latency)
        
        # Memory
        if torch.cuda.is_available():
            memory = torch.cuda.memory_allocated() / 1e9
        else:
            memory = psutil.Process().memory_info().rss / 1e9
        
        self.memory_usage.append(memory)
        
        return output, latency
    
    def report(self):
        """Generate performance report."""
        import numpy as np
        
        latencies = np.array(self.latencies) * 1000  # Convert to ms
        
        print("Performance Report")
        print("-" * 50)
        print(f"Average latency: {latencies.mean():.2f} ms")
        print(f"P50 latency: {np.percentile(latencies, 50):.2f} ms")
        print(f"P95 latency: {np.percentile(latencies, 95):.2f} ms")
        print(f"P99 latency: {np.percentile(latencies, 99):.2f} ms")
        print(f"Max latency: {latencies.max():.2f} ms")
        print(f"Throughput: {1000 / latencies.mean():.2f} req/s")
        print(f"Memory usage: {np.mean(self.memory_usage):.2f} GB")

# Usage
monitor = InferenceMonitor()

for _ in range(100):
    input = torch.randn(1, 3, 224, 224).to('cuda')
    output, latency = monitor.measure(model, input)

monitor.report()

Logging

import logging
from datetime import datetime

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('model_api.log'),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)

@app.post("/predict")
async def predict(file: UploadFile):
    start_time = time.time()
    
    try:
        # Prediction logic
        result = await make_prediction(file)
        
        latency = time.time() - start_time
        
        logger.info(f"Prediction successful | Latency: {latency:.3f}s | "
                   f"Class: {result['class']} | Confidence: {result['confidence']:.3f}")
        
        return result
    
    except Exception as e:
        logger.error(f"Prediction failed | Error: {str(e)}")
        raise

Production Checklist

From my deployments:

Before deploying:

✅ Model evaluation on test set ✅ Benchmark inference latency ✅ Memory profiling ✅ Error handling for edge cases ✅ Input validation ✅ Rate limiting ✅ Health check endpoint ✅ Logging and monitoring ✅ Model versioning ✅ Rollback plan

Code:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, validator
import torch

app = FastAPI()

class PredictionRequest(BaseModel):
    """Request validation."""
    data: List[float]
    
    @validator('data')
    def validate_data(cls, v):
        if len(v) != 10:
            raise ValueError('Input must have 10 features')
        if any(abs(x) > 1000 for x in v):
            raise ValueError('Feature values out of range')
        return v

@app.post("/predict")
async def predict(request: PredictionRequest):
    """Predict with validation."""
    try:
        input_tensor = torch.tensor(request.data).unsqueeze(0)
        
        with torch.no_grad():
            output = model(input_tensor)
        
        return {'prediction': output.tolist()}
    
    except Exception as e:
        logger.error(f"Prediction error: {e}")
        raise HTTPException(status_code=500, detail=str(e))

Best Practices Summary

1. Always use model.eval() for inference:

model.eval()
with torch.no_grad():
    output = model(input)

2. Load model once, reuse:

# Good - load globally
model = load_model()

@app.post("/predict")
async def predict(input):
    return model(input)

# Bad - load per request
@app.post("/predict")
async def predict(input):
    model = load_model()  # Slow!
    return model(input)

3. Batch requests when possible:

# Process multiple inputs together
inputs = torch.stack([input1, input2, input3])
outputs = model(inputs)

4. Use appropriate device:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
input = input.to(device)

5. Profile before optimizing:

# Measure first
latency = measure_latency(model, input)
print(f"Baseline latency: {latency:.3f}s")

# Then optimize

Real Production Architecture

Here's my actual production setup:

┌─────────────┐
│   Client    │
└──────┬──────┘
       │
       ▼
┌─────────────────┐
│  Load Balancer  │
└──────┬──────────┘
       │
       ▼
┌──────────────────────────────┐
│   FastAPI Instances (3x)     │
│   - Model loaded in memory   │
│   - Batch processing         │
│   - GPU inference            │
└──────┬───────────────────────┘
       │
       ▼
┌─────────────────┐
│  Monitoring     │
│  - Prometheus   │
│  - Grafana      │
│  - ELK Stack    │
└─────────────────┘

This handles 200+ req/s with p99 latency under 100ms.

You've Completed PyTorch 101!

Congratulations! You now know:

PyTorch fundamentals and tensors
Automatic differentiation
Building neural networks
Training and optimization
Production deployment

What's next?

Build real projects
Experiment with architectures
Read PyTorch documentation
Join PyTorch community

The best way to learn is by doing. Start building!

Previous: Part 4 - Training and Optimization Series Home: PyTorch 101 Overview

This article is part of the PyTorch 101 series. All examples use Python 3 and are based on real projects.

PreviousPart 4: Training and Optimization NextLLM API Development 101

Last updated 2 days ago

hashtagMy First Production Deployment

hashtagModel Saving and Loading

hashtagBasic Save/Load

hashtagProduction Checkpoint

hashtagCross-Device Loading

hashtagModel Optimization for Inference

hashtagTorchScript

hashtagONNX Export

hashtagQuantization

hashtagDynamic Quantization

hashtagStatic Quantization (More Advanced)

hashtagREST API with FastAPI

hashtagBatch Inference API

hashtagDocker Deployment

hashtagPerformance Optimization

hashtagMemory Management

hashtagBatch Size Tuning

hashtagMulti-GPU Inference

hashtagMonitoring

hashtagPerformance Metrics

hashtagLogging

hashtagProduction Checklist

hashtagBest Practices Summary

hashtagReal Production Architecture

hashtagYou've Completed PyTorch 101!