Part 5: Deployment and Scaling

Part of the LLM API Development 101 Series

My First Production Deployment Disaster

Deployed my chatbot API to AWS. Worked perfectly on my laptop. Tested thoroughly locally.

Production: Immediate crashes. API key not found. Redis connection failed. Environment variables missing.

Spent 6 hours debugging what turned out to be simple configuration issues. The code was fine - infrastructure and deployment were wrong.

Learned the hard way: Deployment is a separate skill. Let me show you how to do it right.

Docker Containerization

Docker ensures consistent behavior across environments.

Basic Dockerfile

FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY . .

# Expose port
EXPOSE 8000

# Run application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Build and run:

docker build -t claude-api .
docker run -p 8000:8000 claude-api

Production Dockerfile

My production-ready Dockerfile:

# Multi-stage build for smaller image
FROM python:3.11-slim as builder

WORKDIR /app

# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Final stage
FROM python:3.11-slim

WORKDIR /app

# Copy Python dependencies from builder
COPY --from=builder /root/.local /root/.local

# Copy application code
COPY app/ ./app/
COPY main.py .

# Make sure scripts in .local are usable
ENV PATH=/root/.local/bin:$PATH

# Create non-root user
RUN useradd -m -u 1000 appuser && \
    chown -R appuser:appuser /app

USER appuser

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:8000/health')"

# Expose port
EXPOSE 8000

# Run with proper settings
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

Key improvements:

Multi-stage build (smaller image)
Non-root user (security)
Health check
Multiple workers

requirements.txt

fastapi==0.109.0
uvicorn[standard]==0.27.0
anthropic==0.25.0
pydantic==2.6.0
python-dotenv==1.0.0
redis==5.0.0
prometheus-client==0.19.0
requests==2.31.0

Docker Compose

For local development with dependencies:

version: '3.8'

services:
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - REDIS_URL=redis://redis:6379
    depends_on:
      - redis
    volumes:
      - ./app:/app/app  # Hot reload during development
    command: uvicorn main:app --host 0.0.0.0 --port 8000 --reload
  
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
  
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
  
  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana
    depends_on:
      - prometheus

volumes:
  redis_data:
  prometheus_data:
  grafana_data:

Run everything:

docker-compose up -d

Access:

API: http://localhost:8000
Prometheus: http://localhost:9090
Grafana: http://localhost:3000

Environment Configuration

Proper secrets management is critical.

.env File (Local Development)

# API Keys
ANTHROPIC_API_KEY=sk-ant-api03-...

# Database
REDIS_URL=redis://localhost:6379

# Application
ENVIRONMENT=development
LOG_LEVEL=INFO
DEBUG=true

# Rate Limiting
RATE_LIMIT_REQUESTS=100
RATE_LIMIT_PERIOD=minute

# Monitoring
ENABLE_METRICS=true

Settings Management

from pydantic_settings import BaseSettings
from functools import lru_cache
from typing import Optional

class Settings(BaseSettings):
    """Application settings."""
    
    # API Configuration
    anthropic_api_key: str
    
    # Redis
    redis_url: str = "redis://localhost:6379"
    
    # Application
    environment: str = "development"
    log_level: str = "INFO"
    debug: bool = False
    
    # Server
    host: str = "0.0.0.0"
    port: int = 8000
    workers: int = 4
    
    # Rate Limiting
    rate_limit_requests: int = 100
    rate_limit_period: str = "minute"
    
    # Token Budgets
    daily_token_limit: int = 1_000_000
    hourly_token_limit: int = 100_000
    
    # Model Defaults
    default_model: str = "claude-3-5-sonnet-20241022"
    default_max_tokens: int = 2048
    default_temperature: float = 1.0
    
    # Caching
    cache_ttl: int = 3600
    enable_semantic_cache: bool = False
    
    # Monitoring
    enable_metrics: bool = True
    
    class Config:
        env_file = ".env"
        env_file_encoding = "utf-8"
        case_sensitive = False

@lru_cache()
def get_settings() -> Settings:
    """Get cached settings instance."""
    return Settings()

# Usage
settings = get_settings()

Environment-Specific Configs

import os
from enum import Enum

class Environment(str, Enum):
    DEVELOPMENT = "development"
    STAGING = "staging"
    PRODUCTION = "production"

def get_environment() -> Environment:
    """Get current environment."""
    env = os.getenv("ENVIRONMENT", "development").lower()
    return Environment(env)

class Config:
    """Base configuration."""
    DEBUG = False
    TESTING = False
    
class DevelopmentConfig(Config):
    """Development configuration."""
    DEBUG = True
    LOG_LEVEL = "DEBUG"
    WORKERS = 1

class StagingConfig(Config):
    """Staging configuration."""
    DEBUG = False
    LOG_LEVEL = "INFO"
    WORKERS = 2

class ProductionConfig(Config):
    """Production configuration."""
    DEBUG = False
    LOG_LEVEL = "WARNING"
    WORKERS = 4

config_by_env = {
    Environment.DEVELOPMENT: DevelopmentConfig,
    Environment.STAGING: StagingConfig,
    Environment.PRODUCTION: ProductionConfig,
}

def get_config():
    """Get configuration for current environment."""
    env = get_environment()
    return config_by_env[env]()

AWS Deployment

Deploy to AWS using ECS.

AWS ECS Task Definition

{
  "family": "claude-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "containerDefinitions": [
    {
      "name": "claude-api",
      "image": "your-account.dkr.ecr.us-east-1.amazonaws.com/claude-api:latest",
      "portMappings": [
        {
          "containerPort": 8000,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {
          "name": "ENVIRONMENT",
          "value": "production"
        },
        {
          "name": "REDIS_URL",
          "value": "redis://your-redis-url:6379"
        }
      ],
      "secrets": [
        {
          "name": "ANTHROPIC_API_KEY",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:account:secret:anthropic-api-key"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/claude-api",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "healthCheck": {
        "command": [
          "CMD-SHELL",
          "curl -f http://localhost:8000/health || exit 1"
        ],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      }
    }
  ]
}

Deploy Script

#!/bin/bash
# deploy.sh

set -e

# Configuration
AWS_REGION="us-east-1"
ECR_REPO="your-account.dkr.ecr.us-east-1.amazonaws.com/claude-api"
ECS_CLUSTER="production"
ECS_SERVICE="claude-api-service"

# Build Docker image
echo "Building Docker image..."
docker build -t claude-api .

# Tag image
docker tag claude-api:latest $ECR_REPO:latest

# Login to ECR
echo "Logging in to ECR..."
aws ecr get-login-password --region $AWS_REGION | \
    docker login --username AWS --password-stdin $ECR_REPO

# Push to ECR
echo "Pushing image to ECR..."
docker push $ECR_REPO:latest

# Update ECS service
echo "Updating ECS service..."
aws ecs update-service \
    --cluster $ECS_CLUSTER \
    --service $ECS_SERVICE \
    --force-new-deployment \
    --region $AWS_REGION

echo "Deployment initiated successfully!"

Terraform Configuration

Infrastructure as code:

# main.tf

provider "aws" {
  region = "us-east-1"
}

# ECR Repository
resource "aws_ecr_repository" "claude_api" {
  name = "claude-api"
  
  image_scanning_configuration {
    scan_on_push = true
  }
}

# ECS Cluster
resource "aws_ecs_cluster" "main" {
  name = "production"
}

# Task Definition
resource "aws_ecs_task_definition" "claude_api" {
  family                   = "claude-api"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = 512
  memory                   = 1024
  
  container_definitions = jsonencode([{
    name  = "claude-api"
    image = "${aws_ecr_repository.claude_api.repository_url}:latest"
    
    portMappings = [{
      containerPort = 8000
      protocol      = "tcp"
    }]
    
    environment = [
      {
        name  = "ENVIRONMENT"
        value = "production"
      }
    ]
    
    secrets = [
      {
        name      = "ANTHROPIC_API_KEY"
        valueFrom = aws_secretsmanager_secret.anthropic_key.arn
      }
    ]
    
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = "/ecs/claude-api"
        "awslogs-region"        = "us-east-1"
        "awslogs-stream-prefix" = "ecs"
      }
    }
  }])
}

# ECS Service
resource "aws_ecs_service" "claude_api" {
  name            = "claude-api-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.claude_api.arn
  desired_count   = 2
  launch_type     = "FARGATE"
  
  network_configuration {
    subnets         = var.subnet_ids
    security_groups = [aws_security_group.claude_api.id]
  }
  
  load_balancer {
    target_group_arn = aws_lb_target_group.claude_api.arn
    container_name   = "claude-api"
    container_port   = 8000
  }
}

# Application Load Balancer
resource "aws_lb" "claude_api" {
  name               = "claude-api-lb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = var.subnet_ids
}

# Target Group
resource "aws_lb_target_group" "claude_api" {
  name        = "claude-api-tg"
  port        = 8000
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  target_type = "ip"
  
  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 10
    timeout             = 5
    interval            = 30
  }
}

# Listener
resource "aws_lb_listener" "claude_api" {
  load_balancer_arn = aws_lb.claude_api.arn
  port              = "443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-2016-08"
  certificate_arn   = var.certificate_arn
  
  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.claude_api.arn
  }
}

Azure Deployment

Alternative: Deploy to Azure Container Apps.

Azure Deployment

#!/bin/bash
# deploy-azure.sh

# Configuration
RESOURCE_GROUP="claude-api-rg"
LOCATION="eastus"
CONTAINER_APP="claude-api"
CONTAINER_REGISTRY="yourregistry"

# Create resource group
az group create \
    --name $RESOURCE_GROUP \
    --location $LOCATION

# Create container registry
az acr create \
    --resource-group $RESOURCE_GROUP \
    --name $CONTAINER_REGISTRY \
    --sku Basic

# Build and push image
az acr build \
    --registry $CONTAINER_REGISTRY \
    --image claude-api:latest \
    .

# Create Container App Environment
az containerapp env create \
    --name claude-api-env \
    --resource-group $RESOURCE_GROUP \
    --location $LOCATION

# Deploy Container App
az containerapp create \
    --name $CONTAINER_APP \
    --resource-group $RESOURCE_GROUP \
    --environment claude-api-env \
    --image $CONTAINER_REGISTRY.azurecr.io/claude-api:latest \
    --target-port 8000 \
    --ingress external \
    --min-replicas 2 \
    --max-replicas 10 \
    --cpu 0.5 \
    --memory 1.0Gi \
    --secrets anthropic-key=$ANTHROPIC_API_KEY \
    --env-vars ANTHROPIC_API_KEY=secretref:anthropic-key

echo "Deployed to Azure Container Apps!"

Load Balancing and Scaling

Auto-scaling Configuration

AWS ECS auto-scaling:

# Auto-scaling target
resource "aws_appautoscaling_target" "ecs_target" {
  max_capacity       = 10
  min_capacity       = 2
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.claude_api.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

# Scale up policy
resource "aws_appautoscaling_policy" "scale_up" {
  name               = "scale-up"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_target.service_namespace
  
  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

Rate Limiting at Load Balancer

nginx configuration:

# nginx.conf

upstream claude_api {
    least_conn;
    server api1:8000;
    server api2:8000;
    server api3:8000;
}

# Rate limiting zone
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

server {
    listen 80;
    server_name api.example.com;
    
    # Rate limiting
    limit_req zone=api_limit burst=20 nodelay;
    
    location / {
        proxy_pass http://claude_api;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        
        # Timeouts for long LLM responses
        proxy_connect_timeout 60s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;
    }
    
    location /health {
        proxy_pass http://claude_api/health;
        access_log off;
    }
}

Monitoring and Logging

CloudWatch Logging

import logging
import watchtower
import boto3

# Configure CloudWatch handler
def setup_cloudwatch_logging():
    """Setup CloudWatch logging."""
    
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    
    # CloudWatch handler
    cw_handler = watchtower.CloudWatchLogHandler(
        log_group="/aws/ecs/claude-api",
        stream_name="application",
        boto3_client=boto3.client("logs", region_name="us-east-1")
    )
    
    formatter = logging.Formatter(
        '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
    )
    cw_handler.setFormatter(formatter)
    
    logger.addHandler(cw_handler)
    
    return logger

logger = setup_cloudwatch_logging()
logger.info("Application started")

Structured Logging

import structlog
import json

def setup_structured_logging():
    """Setup structured logging for production."""
    
    structlog.configure(
        processors=[
            structlog.stdlib.filter_by_level,
            structlog.stdlib.add_logger_name,
            structlog.stdlib.add_log_level,
            structlog.stdlib.PositionalArgumentsFormatter(),
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.processors.StackInfoRenderer(),
            structlog.processors.format_exc_info,
            structlog.processors.UnicodeDecoder(),
            structlog.processors.JSONRenderer()
        ],
        context_class=dict,
        logger_factory=structlog.stdlib.LoggerFactory(),
        cache_logger_on_first_use=True,
    )

# Usage
logger = structlog.get_logger()

logger.info("api_request", 
    method="POST",
    endpoint="/chat",
    user_id="user123",
    tokens=450
)

Application Performance Monitoring

Using Datadog:

from ddtrace import tracer, patch_all
import time

# Auto-instrument
patch_all()

@tracer.wrap(service="claude-api", resource="chat")
async def chat(request: ChatRequest):
    """Traced chat endpoint."""
    
    with tracer.trace("claude.api_call", service="claude-api") as span:
        span.set_tag("model", request.model)
        span.set_tag("max_tokens", request.max_tokens)
        
        response = await call_claude_async(request.messages)
        
        span.set_metric("tokens.input", response.usage.input_tokens)
        span.set_metric("tokens.output", response.usage.output_tokens)
    
    return response

Production Checklist

Before going live:

✅ Infrastructure:

Docker containerization
Health checks configured
Auto-scaling enabled
Load balancer configured

✅ Security:

Secrets in secure storage (AWS Secrets Manager/Azure Key Vault)
HTTPS enforced
Rate limiting implemented
Non-root container user

✅ Monitoring:

Centralized logging (CloudWatch/Azure Monitor)
Metrics collection (Prometheus)
Alerting configured
APM enabled

✅ Reliability:

Circuit breakers implemented
Retry logic in place
Graceful degradation
Error handling

✅ Performance:

Caching enabled
Token budgets configured
Model selection optimized
Connection pooling

✅ Documentation:

API documentation (Swagger)
Deployment runbook
Incident response plan
Architecture diagrams

Congratulations!

You've completed the LLM API Development 101 series!

You now know how to:

✅ Use Claude API effectively
✅ Build FastAPI applications
✅ Implement streaming responses
✅ Apply production patterns
✅ Deploy and scale

What's next?

Build your own LLM application
Experiment with different models
Optimize for your specific use case
Share what you build!

Thank you for following along! 🚀

Previous: Part 4 - Production Patterns and Best Practices Series Home: LLM API Development 101

This article is part of the LLM API Development 101 series. All examples use Python 3 and FastAPI based on real production applications.

PreviousPart 4: Production Patterns and Best Practices NextMLOps 101

Last updated 2 days ago

hashtagMy First Production Deployment Disaster

hashtagDocker Containerization

hashtagBasic Dockerfile

hashtagProduction Dockerfile

hashtagrequirements.txt

hashtagDocker Compose

hashtagEnvironment Configuration

hashtag.env File (Local Development)

hashtagSettings Management

hashtagEnvironment-Specific Configs

hashtagAWS Deployment

hashtagAWS ECS Task Definition

hashtagDeploy Script

hashtagTerraform Configuration

hashtagAzure Deployment

hashtagAzure Deployment

hashtagLoad Balancing and Scaling

hashtagAuto-scaling Configuration

hashtagRate Limiting at Load Balancer

hashtagMonitoring and Logging

hashtagCloudWatch Logging

hashtagStructured Logging

hashtagApplication Performance Monitoring

hashtagProduction Checklist

hashtagCongratulations!

My First Production Deployment Disaster

Docker Containerization

Basic Dockerfile

Production Dockerfile

requirements.txt

Docker Compose

Environment Configuration

.env File (Local Development)

Settings Management

Environment-Specific Configs

AWS Deployment

AWS ECS Task Definition

Deploy Script

Terraform Configuration

Azure Deployment

Azure Deployment

Load Balancing and Scaling

Auto-scaling Configuration

Rate Limiting at Load Balancer

Monitoring and Logging

CloudWatch Logging

Structured Logging

Application Performance Monitoring

Production Checklist

Congratulations!