Part 5: Production Deployment and Optimization

Part of the Hugging Face Transformers 101 Series

From Notebook to Production

I'll never forget my first production deployment disaster:

Model worked perfectly in Jupyter notebooks
Deployed to production... and crashed within an hour
5-second response times (users expected < 500ms)
Memory leaks killed the server
No monitoring - we were flying blind

Production is different from experimentation. Performance, reliability, and cost matter.

After deploying dozens of models, here's what I learned about production ML systems.

Production Checklist

Before deploying, ensure:

✓ Model Performance

Acceptable accuracy on test set
Tested on edge cases and adversarial examples
Latency meets requirements (< 500ms for real-time)

✓ Infrastructure

Proper error handling and logging
Health checks and monitoring
Auto-scaling configured
Backup and rollback plan

✓ Cost Optimization

Model size minimized (quantization, distillation)
Batch processing where possible
Efficient serving infrastructure

✓ Security

Input validation and sanitization
Rate limiting
Authentication/authorization
Data privacy compliance

Let's build a production system.

Serving Models with FastAPI

FastAPI is my go-to for serving ML models.

Basic API Server

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import pipeline
import torch
import uvicorn

# Initialize FastAPI app
app = FastAPI(
    title="Sentiment Analysis API",
    description="Production sentiment analysis service",
    version="1.0.0"
)

# Load model at startup
model = None

@app.on_event("startup")
async def load_model():
    """Load model when server starts."""
    global model
    device = 0 if torch.cuda.is_available() else -1
    model = pipeline(
        "sentiment-analysis",
        model="distilbert-base-uncased-finetuned-sst-2-english",
        device=device
    )
    print("Model loaded successfully")

# Request/Response models
class TextRequest(BaseModel):
    text: str
    
class SentimentResponse(BaseModel):
    text: str
    label: str
    confidence: float

@app.post("/predict", response_model=SentimentResponse)
async def predict_sentiment(request: TextRequest):
    """Predict sentiment of text."""
    try:
        # Validate input
        if not request.text or len(request.text) > 512:
            raise HTTPException(status_code=400, detail="Invalid text length")
        
        # Predict
        result = model(request.text)[0]
        
        return SentimentResponse(
            text=request.text,
            label=result['label'],
            confidence=result['score']
        )
    
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health_check():
    """Health check endpoint."""
    return {"status": "healthy", "model_loaded": model is not None}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Run it:

python api.py

Test it:

curl -X POST "http://localhost:8000/predict" \
     -H "Content-Type: application/json" \
     -d '{"text": "This product is amazing!"}'

Production-Ready API

With proper error handling, logging, and monitoring:

from fastapi import FastAPI, HTTPException, Request
from pydantic import BaseModel, validator
from transformers import pipeline
import torch
import uvicorn
import logging
import time
from prometheus_client import Counter, Histogram, make_asgi_app
from functools import wraps

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Prometheus metrics
REQUEST_COUNT = Counter('api_requests_total', 'Total API requests', ['endpoint', 'status'])
REQUEST_LATENCY = Histogram('api_request_latency_seconds', 'Request latency', ['endpoint'])
PREDICTION_COUNT = Counter('predictions_total', 'Total predictions', ['label'])

# Initialize FastAPI
app = FastAPI(title="Production Sentiment API", version="1.0.0")

# Model cache
model_cache = {}

class Config:
    """Configuration."""
    MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"
    MAX_TEXT_LENGTH = 512
    BATCH_SIZE = 32
    DEVICE = 0 if torch.cuda.is_available() else -1

@app.on_event("startup")
async def load_model():
    """Load model at startup."""
    try:
        logger.info(f"Loading model: {Config.MODEL_NAME}")
        model_cache['sentiment'] = pipeline(
            "sentiment-analysis",
            model=Config.MODEL_NAME,
            device=Config.DEVICE
        )
        logger.info("Model loaded successfully")
    except Exception as e:
        logger.error(f"Failed to load model: {e}")
        raise

# Request validation
class TextRequest(BaseModel):
    text: str
    
    @validator('text')
    def validate_text(cls, v):
        if not v or not v.strip():
            raise ValueError('Text cannot be empty')
        if len(v) > Config.MAX_TEXT_LENGTH:
            raise ValueError(f'Text exceeds maximum length of {Config.MAX_TEXT_LENGTH}')
        return v.strip()

class BatchTextRequest(BaseModel):
    texts: list[str]
    
    @validator('texts')
    def validate_texts(cls, v):
        if not v:
            raise ValueError('Texts list cannot be empty')
        if len(v) > Config.BATCH_SIZE:
            raise ValueError(f'Batch size exceeds maximum of {Config.BATCH_SIZE}')
        return v

class SentimentResponse(BaseModel):
    text: str
    label: str
    confidence: float
    processing_time: float

class BatchSentimentResponse(BaseModel):
    predictions: list[SentimentResponse]
    total_processing_time: float

# Middleware for logging
@app.middleware("http")
async def log_requests(request: Request, call_next):
    """Log all requests."""
    start_time = time.time()
    
    # Process request
    response = await call_next(request)
    
    # Log
    duration = time.time() - start_time
    logger.info(
        f"Path: {request.url.path} | "
        f"Method: {request.method} | "
        f"Status: {response.status_code} | "
        f"Duration: {duration:.3f}s"
    )
    
    # Metrics
    REQUEST_COUNT.labels(endpoint=request.url.path, status=response.status_code).inc()
    REQUEST_LATENCY.labels(endpoint=request.url.path).observe(duration)
    
    return response

@app.post("/predict", response_model=SentimentResponse)
async def predict_sentiment(request: TextRequest):
    """Predict sentiment of single text."""
    start_time = time.time()
    
    try:
        # Get model
        model = model_cache.get('sentiment')
        if model is None:
            raise HTTPException(status_code=503, detail="Model not loaded")
        
        # Predict
        result = model(request.text)[0]
        
        # Track metrics
        PREDICTION_COUNT.labels(label=result['label']).inc()
        
        processing_time = time.time() - start_time
        
        return SentimentResponse(
            text=request.text,
            label=result['label'],
            confidence=result['score'],
            processing_time=processing_time
        )
    
    except ValueError as e:
        logger.warning(f"Validation error: {e}")
        raise HTTPException(status_code=400, detail=str(e))
    
    except Exception as e:
        logger.error(f"Prediction error: {e}")
        raise HTTPException(status_code=500, detail="Internal server error")

@app.post("/predict/batch", response_model=BatchSentimentResponse)
async def predict_sentiment_batch(request: BatchTextRequest):
    """Predict sentiment of multiple texts (batch processing)."""
    start_time = time.time()
    
    try:
        # Get model
        model = model_cache.get('sentiment')
        if model is None:
            raise HTTPException(status_code=503, detail="Model not loaded")
        
        # Batch predict
        results = model(request.texts)
        
        # Format responses
        predictions = []
        for text, result in zip(request.texts, results):
            predictions.append(SentimentResponse(
                text=text,
                label=result['label'],
                confidence=result['score'],
                processing_time=0.0  # Individual time not tracked in batch
            ))
            PREDICTION_COUNT.labels(label=result['label']).inc()
        
        total_time = time.time() - start_time
        
        return BatchSentimentResponse(
            predictions=predictions,
            total_processing_time=total_time
        )
    
    except Exception as e:
        logger.error(f"Batch prediction error: {e}")
        raise HTTPException(status_code=500, detail="Internal server error")

@app.get("/health")
async def health_check():
    """Health check endpoint."""
    return {
        "status": "healthy",
        "model_loaded": 'sentiment' in model_cache,
        "device": "cuda" if torch.cuda.is_available() else "cpu"
    }

@app.get("/metrics")
async def metrics():
    """Prometheus metrics endpoint."""
    return make_asgi_app()

if __name__ == "__main__":
    uvicorn.run(
        app,
        host="0.0.0.0",
        port=8000,
        workers=1,  # Single worker for GPU
        log_level="info"
    )

Features:

✓ Input validation
✓ Error handling
✓ Logging
✓ Prometheus metrics
✓ Batch processing
✓ Health checks

Docker Deployment

Dockerfile:

FROM python:3.9-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY . .

# Expose port
EXPOSE 8000

# Run application
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]

requirements.txt:

fastapi==0.104.1
uvicorn==0.24.0
transformers==4.35.0
torch==2.1.0
pydantic==2.5.0
prometheus-client==0.19.0

Build and run:

# Build image
docker build -t sentiment-api:latest .

# Run container
docker run -p 8000:8000 sentiment-api:latest

# With GPU support
docker run --gpus all -p 8000:8000 sentiment-api:latest

Model Optimization for Production

1. Quantization

Reduce model size and increase speed:

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)

# Quantize to int8
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

# Save quantized model
torch.save(quantized_model.state_dict(), "quantized_model.pth")

# 4x smaller, 2-3x faster inference

2. ONNX Export

For maximum performance:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create dummy input
dummy_input = tokenizer("This is a test", return_tensors="pt")

# Export to ONNX
torch.onnx.export(
    model,
    (dummy_input['input_ids'], dummy_input['attention_mask']),
    "model.onnx",
    input_names=['input_ids', 'attention_mask'],
    output_names=['output'],
    dynamic_axes={
        'input_ids': {0: 'batch', 1: 'sequence'},
        'attention_mask': {0: 'batch', 1: 'sequence'},
        'output': {0: 'batch'}
    },
    opset_version=14
)

Use ONNX Runtime for inference:

import onnxruntime as ort
import numpy as np

# Load ONNX model
session = ort.InferenceSession("model.onnx")

# Prepare input
text = "This is amazing!"
inputs = tokenizer(text, return_tensors="np")

# Run inference
outputs = session.run(
    None,
    {
        "input_ids": inputs['input_ids'],
        "attention_mask": inputs['attention_mask']
    }
)

# Get prediction
logits = outputs[0]
prediction = np.argmax(logits, axis=-1)
print(f"Prediction: {prediction}")

2-5x faster than PyTorch, especially on CPU.

3. TorchScript

Alternative to ONNX:

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)
tokenizer = AutoTokenizer.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)

# Convert to TorchScript
model.eval()
dummy_input = tokenizer("Test", return_tensors="pt")

traced_model = torch.jit.trace(
    model,
    (dummy_input['input_ids'], dummy_input['attention_mask'])
)

# Save
traced_model.save("model_traced.pt")

# Load and use
loaded_model = torch.jit.load("model_traced.pt")
outputs = loaded_model(dummy_input['input_ids'], dummy_input['attention_mask'])

4. Model Distillation

Train smaller model to mimic larger one:

from transformers import (
    DistilBertForSequenceClassification,
    BertForSequenceClassification,
    Trainer,
    TrainingArguments
)
import torch.nn as nn

# Teacher model (large)
teacher = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased-finetuned-sst-2-english"
)

# Student model (small)
student = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2
)

# Custom loss combining student loss and distillation loss
class DistillationTrainer(Trainer):
    def __init__(self, teacher_model, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher = teacher_model
        self.teacher.eval()
    
    def compute_loss(self, model, inputs, return_outputs=False):
        # Student forward pass
        outputs_student = model(**inputs)
        
        # Teacher forward pass
        with torch.no_grad():
            outputs_teacher = self.teacher(**inputs)
        
        # Distillation loss (KL divergence)
        loss_fct = nn.KLDivLoss(reduction="batchmean")
        loss_distill = loss_fct(
            nn.functional.log_softmax(outputs_student.logits / 2.0, dim=-1),
            nn.functional.softmax(outputs_teacher.logits / 2.0, dim=-1)
        )
        
        # Student loss (cross-entropy)
        loss_student = outputs_student.loss
        
        # Combined loss
        loss = 0.5 * loss_student + 0.5 * loss_distill
        
        return (loss, outputs_student) if return_outputs else loss

# Train
trainer = DistillationTrainer(
    teacher_model=teacher,
    model=student,
    args=TrainingArguments(output_dir="./distilled-model"),
    train_dataset=train_dataset,
)

trainer.train()

Distilled models: 40-60% smaller, 2-3x faster, 95%+ accuracy of original.

Batch Processing

For high-throughput scenarios:

from transformers import pipeline
import time

# Load model
classifier = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device=0  # GPU
)

# Sample data
texts = [f"Sample text {i}" for i in range(1000)]

# Sequential processing (slow)
start = time.time()
results_sequential = [classifier(text)[0] for text in texts]
sequential_time = time.time() - start

# Batch processing (fast)
start = time.time()
results_batch = classifier(texts, batch_size=32)
batch_time = time.time() - start

print(f"Sequential: {sequential_time:.2f}s")
print(f"Batch: {batch_time:.2f}s")
print(f"Speedup: {sequential_time/batch_time:.2f}x")

# Output:
# Sequential: 45.23s
# Batch: 5.67s
# Speedup: 7.98x

Always use batch processing when possible.

Dynamic Batching

Collect requests and batch them:

import asyncio
from collections import deque
from typing import List
import time

class DynamicBatcher:
    """Dynamic batching for inference."""
    
    def __init__(self, model, max_batch_size=32, max_wait_time=0.1):
        self.model = model
        self.max_batch_size = max_batch_size
        self.max_wait_time = max_wait_time
        self.queue = deque()
        self.results = {}
        self.processing = False
        
        # Start background processing
        asyncio.create_task(self._process_batches())
    
    async def predict(self, text: str, request_id: str):
        """Add prediction request to queue."""
        # Add to queue
        self.queue.append((request_id, text))
        
        # Wait for result
        while request_id not in self.results:
            await asyncio.sleep(0.01)
        
        # Get and remove result
        result = self.results.pop(request_id)
        return result
    
    async def _process_batches(self):
        """Process batches in background."""
        while True:
            if len(self.queue) >= self.max_batch_size or \
               (len(self.queue) > 0 and time.time() - self.last_process > self.max_wait_time):
                
                # Get batch
                batch_size = min(len(self.queue), self.max_batch_size)
                batch = [self.queue.popleft() for _ in range(batch_size)]
                
                # Separate IDs and texts
                ids, texts = zip(*batch)
                
                # Process batch
                predictions = self.model(list(texts))
                
                # Store results
                for request_id, result in zip(ids, predictions):
                    self.results[request_id] = result
                
                self.last_process = time.time()
            
            await asyncio.sleep(0.01)

# Usage in FastAPI
batcher = None

@app.on_event("startup")
async def startup():
    global batcher
    model = pipeline("sentiment-analysis")
    batcher = DynamicBatcher(model, max_batch_size=32, max_wait_time=0.1)

@app.post("/predict")
async def predict(request: TextRequest):
    request_id = str(uuid.uuid4())
    result = await batcher.predict(request.text, request_id)
    return result

Dynamic batching improves throughput by 5-10x in production.

Monitoring and Observability

Prometheus + Grafana

Expose metrics:

from prometheus_client import Counter, Histogram, Gauge, generate_latest
from fastapi import Response

# Metrics
REQUESTS_TOTAL = Counter('requests_total', 'Total requests', ['endpoint', 'status'])
REQUEST_DURATION = Histogram('request_duration_seconds', 'Request duration', ['endpoint'])
MODEL_PREDICTIONS = Counter('model_predictions_total', 'Total predictions', ['label'])
GPU_MEMORY = Gauge('gpu_memory_used_bytes', 'GPU memory usage')
QUEUE_SIZE = Gauge('prediction_queue_size', 'Prediction queue size')

@app.get("/metrics")
async def metrics():
    """Prometheus metrics endpoint."""
    return Response(generate_latest(), media_type="text/plain")

# Update metrics
@app.middleware("http")
async def track_metrics(request: Request, call_next):
    start_time = time.time()
    
    response = await call_next(request)
    
    duration = time.time() - start_time
    REQUESTS_TOTAL.labels(endpoint=request.url.path, status=response.status_code).inc()
    REQUEST_DURATION.labels(endpoint=request.url.path).observe(duration)
    
    return response

Scrape with Prometheus (prometheus.yml):

scrape_configs:
  - job_name: 'sentiment-api'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:8000']

Visualize in Grafana - create dashboards with request rates, latencies, error rates.

Application Logging

import logging
import json
from datetime import datetime

class JSONFormatter(logging.Formatter):
    """JSON log formatter."""
    
    def format(self, record):
        log_data = {
            'timestamp': datetime.utcnow().isoformat(),
            'level': record.levelname,
            'message': record.getMessage(),
            'module': record.module,
            'function': record.funcName,
        }
        
        if hasattr(record, 'extra'):
            log_data.update(record.extra)
        
        return json.dumps(log_data)

# Configure
logger = logging.getLogger(__name__)
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# Use
logger.info("Prediction completed", extra={
    'text_length': len(text),
    'label': result['label'],
    'confidence': result['score'],
    'latency': latency
})

Horizontal Scaling

Kubernetes Deployment

deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sentiment-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sentiment-api
  template:
    metadata:
      labels:
        app: sentiment-api
    spec:
      containers:
      - name: sentiment-api
        image: sentiment-api:latest
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
            nvidia.com/gpu: "1"
          limits:
            memory: "4Gi"
            cpu: "2"
            nvidia.com/gpu: "1"
        env:
        - name: MODEL_NAME
          value: "distilbert-base-uncased-finetuned-sst-2-english"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: sentiment-api-service
spec:
  selector:
    app: sentiment-api
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: LoadBalancer
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: sentiment-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: sentiment-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Deploy:

kubectl apply -f deployment.yaml
kubectl get pods
kubectl logs -f sentiment-api-xxxxx

Cost Optimization

Strategies I use:

1. Use smaller models:

DistilBERT instead of BERT (60% size)
TinyBERT for extreme efficiency (10% size)

2. Quantize models:

8-bit: 4x smaller
4-bit: 8x smaller

3. Spot instances (AWS, GCP):

70-90% cheaper
For non-critical workloads

4. Batch processing:

Group requests
Higher throughput per $ spent

5. Cache results:

from functools import lru_cache

@lru_cache(maxsize=10000)
def cached_predict(text: str):
    return model(text)[0]

6. Model sharing:

One base model, multiple LoRA adapters
Swap adapters instead of models

Security Best Practices

1. Input validation:

import re

def sanitize_input(text: str) -> str:
    # Remove potential injection attacks
    text = re.sub(r'[^\w\s\.\,\!\?\-]', '', text)
    # Limit length
    text = text[:512]
    return text

2. Rate limiting:

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter

@app.post("/predict")
@limiter.limit("100/minute")
async def predict(request: Request, text: TextRequest):
    # ...

3. API authentication:

from fastapi import Security, HTTPException
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

security = HTTPBearer()

async def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
    if credentials.credentials != os.getenv("API_KEY"):
        raise HTTPException(status_code=401, detail="Invalid token")
    return credentials

@app.post("/predict")
async def predict(request: TextRequest, credentials = Security(verify_token)):
    # ...

4. HTTPS only:

from fastapi.middleware.httpsredirect import HTTPSRedirectMiddleware
app.add_middleware(HTTPSRedirectMiddleware)

Lessons from Production

What I learned the hard way:

Monitor everything - if you can't measure it, you can't improve it
Start small - deploy to 5% of traffic first
Have rollback plans - deployments fail, be ready
Test edge cases - empty strings, very long text, special characters
Budget for errors - 99.9% uptime = 40 minutes downtime/month
Cache aggressively - most inputs repeat
Use batch processing - 10x throughput improvement
Quantize models - 75% size reduction, minimal accuracy loss
Log structured data - JSON logs for easy parsing
Security is not optional - validate inputs, rate limit, authenticate

What's Next?

Congratulations! You've completed the Hugging Face Transformers 101 series.

You now know:

✓ How to use pre-trained models with pipelines
✓ Understanding of models, tokenizers, and preprocessing
✓ Fine-tuning models on custom data
✓ Advanced techniques (PEFT, quantization, multi-modal)
✓ Production deployment and optimization

Continue your journey:

Explore Hugging Face Hub - 300k+ models
Join Hugging Face Discord - active community
Read Course - free deep dive
Build projects - best way to learn

Thank you for reading!

Previous: Part 4 - Advanced Features and Techniques Back to: Series Overview

This article is part of the Hugging Face Transformers 101 series. Share your feedback and projects!

PreviousPart 4: Advanced Features and Techniques NextPyTorch 101

Last updated 2 days ago

hashtagFrom Notebook to Production

hashtagProduction Checklist

hashtagServing Models with FastAPI

hashtagBasic API Server

hashtagProduction-Ready API

hashtagDocker Deployment

hashtagModel Optimization for Production

hashtag1. Quantization

hashtag2. ONNX Export

hashtag3. TorchScript

hashtag4. Model Distillation

hashtagBatch Processing

hashtagDynamic Batching

hashtagMonitoring and Observability

hashtagPrometheus + Grafana

hashtagApplication Logging

hashtagHorizontal Scaling

hashtagKubernetes Deployment

hashtagCost Optimization

hashtagSecurity Best Practices

hashtagLessons from Production

hashtagWhat's Next?