Part 5: Production Deployment and Best Practices

Part of the PyTorch 101 Series

My First Production Deployment

Got a model to 95% validation accuracy. Felt ready to deploy.

Put it in production behind a REST API. First day: 200ms average latency. Not great but acceptable.

Second day: API timeouts, angry users, emergency rollback.

What happened? Memory leak from not properly managing PyTorch tensors in the API. Each request created tensors that weren't released.

Spent 2 days debugging. Then learned: Production deployment is completely different from training.

Let me save you those 2 days.

Model Saving and Loading

Basic Save/Load

import torch

# Save entire model (not recommended)
torch.save(model, 'model.pth')
loaded_model = torch.load('model.pth')

# Save state dict (recommended)
torch.save(model.state_dict(), 'model_weights.pth')

# Load state dict
model = MyModel()
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()

Always use state_dict() - more portable and flexible.

Production Checkpoint

I save checkpoints every epoch - can resume training if interrupted.

Cross-Device Loading

Model Optimization for Inference

TorchScript

Convert PyTorch model to optimized format - faster inference, can run without Python.

Tracing works for most models. For models with control flow:

I use TorchScript for production - typically 1.5-2x faster than eager mode.

ONNX Export

ONNX = Open Neural Network Exchange. Deploy to non-PyTorch environments (TensorFlow Serving, ONNX Runtime, etc.).

Use ONNX Runtime for inference:

ONNX Runtime is often faster than PyTorch for CPU inference.

Quantization

Reduce model size and increase speed by using 8-bit integers instead of 32-bit floats.

Dynamic Quantization

Dynamic quantization works for LSTM/Transformer - typically 2-4x smaller, 2-3x faster.

Static Quantization (More Advanced)

Static quantization is more complex but faster - I use it for mobile deployment.

REST API with FastAPI

Serve model via HTTP API. Here's my production setup:

Run server:

Test:

Batch Inference API

For better throughput:

Batch processing gives 3-5x higher throughput in production.

Docker Deployment

Containerize for consistent deployment:

requirements.txt:

Build and run:

With GPU:

Run with GPU:

Performance Optimization

Memory Management

Batch Size Tuning

Multi-GPU Inference

Monitoring

Performance Metrics

Logging

Production Checklist

From my deployments:

Before deploying:

βœ… Model evaluation on test set βœ… Benchmark inference latency βœ… Memory profiling βœ… Error handling for edge cases βœ… Input validation βœ… Rate limiting βœ… Health check endpoint βœ… Logging and monitoring βœ… Model versioning βœ… Rollback plan

Code:

Best Practices Summary

1. Always use model.eval() for inference:

2. Load model once, reuse:

3. Batch requests when possible:

4. Use appropriate device:

5. Profile before optimizing:

Real Production Architecture

Here's my actual production setup:

This handles 200+ req/s with p99 latency under 100ms.

You've Completed PyTorch 101!

Congratulations! You now know:

  • PyTorch fundamentals and tensors

  • Automatic differentiation

  • Building neural networks

  • Training and optimization

  • Production deployment

What's next?

  • Build real projects

  • Experiment with architectures

  • Read PyTorch documentation

  • Join PyTorch community

The best way to learn is by doing. Start building!


Previous: Part 4 - Training and Optimization Series Home: PyTorch 101 Overview

This article is part of the PyTorch 101 series. All examples use Python 3 and are based on real projects.

Last updated