Scalability Patterns

← Back to System Design 101 | ← Previous: Fundamentals

Understanding Scalability

Scalability is about handling growth—more users, more data, more requests—without degrading performance or requiring a complete system redesign. Through building systems that went from hundreds to millions of users, I've learned that scalability isn't just about adding more servers. It's about designing systems that can grow efficiently.

Horizontal vs Vertical Scaling

Vertical Scaling (Scale Up)

Adding more power to existing machines: more CPU, RAM, or disk space.

When I use vertical scaling:

PostgreSQL primary database (before considering read replicas)
Legacy monolithic applications that can't easily be distributed
In-memory caches that need fast access to large datasets
When operational simplicity is more important than unlimited scale

Example from a real project:

# Configuration for vertical scaling - PostgreSQL on larger instance
# docker-compose.yml for development
version: '3.8'
services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: myapp
      POSTGRES_USER: app_user
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    # Vertical scaling: Allocate more resources
    deploy:
      resources:
        limits:
          cpus: '4.0'      # Increased from 2.0
          memory: 16G      # Increased from 8G
        reservations:
          cpus: '2.0'
          memory: 8G
    volumes:
      - postgres_data:/var/lib/postgresql/data
    # Performance tuning for larger instance
    command: >
      postgres
      -c shared_buffers=4GB
      -c effective_cache_size=12GB
      -c maintenance_work_mem=1GB
      -c max_connections=200
      -c work_mem=20MB

volumes:
  postgres_data:

Limitations I've hit:

Hit AWS RDS instance size limits (we needed more than the largest available)
Cost increases exponentially with size
Still a single point of failure
Downtime required for upgrades

Horizontal Scaling (Scale Out)

Adding more machines to distribute the load.

When I use horizontal scaling:

Stateless API servers
Worker processes for background jobs
Read replicas for databases
Microservices architecture

Real implementation example:

# FastAPI application designed for horizontal scaling
from fastapi import FastAPI, Request
from fastapi.middleware.cors import CORSMiddleware
import os
import socket

app = FastAPI(title="Scalable API")

# Add CORS middleware
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

# Health check for load balancer
@app.get("/health")
async def health_check():
    """Load balancer uses this to determine if instance is healthy."""
    return {
        "status": "healthy",
        "hostname": socket.gethostname(),
        "instance_id": os.environ.get("INSTANCE_ID", "unknown")
    }

# Stateless endpoint - can run on any instance
@app.get("/api/users/{user_id}")
async def get_user(user_id: str):
    """
    Stateless endpoint that can run on any instance.
    User data is fetched from database, not stored in memory.
    """
    # Fetch from database (shared state)
    user = await db.users.find_one({"id": user_id})
    return user

# Background job processing - horizontally scalable
from celery import Celery

celery_app = Celery(
    "tasks",
    broker=os.environ.get("REDIS_URL", "redis://localhost:6379/0"),
    backend=os.environ.get("REDIS_URL", "redis://localhost:6379/0")
)

@celery_app.task
def process_analytics(user_id: str, event_data: dict):
    """
    Process analytics in background.
    Multiple workers can process these tasks in parallel.
    """
    # Process event
    analytics.track_event(user_id, event_data)
    return {"status": "processed"}

Kubernetes deployment for horizontal scaling:

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3  # Start with 3 instances
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
      - name: api
        image: myregistry/api-server:latest
        ports:
        - containerPort: 8000
        env:
        - name: INSTANCE_ID
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5

---
# Auto-scaling based on CPU and memory
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 50  # Scale down by max 50% of current pods
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100  # Can double the number of pods
        periodSeconds: 15

Load Balancing Strategies

Load balancers distribute traffic across multiple servers. I've used different strategies depending on the use case:

1. Round Robin

Distributes requests evenly across all servers.

When I use it:

All servers have equal capacity
Requests have similar processing time
Simple setup is preferred

# nginx.conf - Round robin load balancing
upstream api_backend {
    server api-1.internal:8000;
    server api-2.internal:8000;
    server api-3.internal:8000;
}

server {
    listen 80;
    server_name api.example.com;

    location / {
        proxy_pass http://api_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # Timeouts
        proxy_connect_timeout 5s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;
    }
}

2. Least Connections

Routes to the server with the fewest active connections.

When I use it:

Requests have varying processing times
Some requests are long-running (WebSockets, file uploads)

upstream api_backend {
    least_conn;  # Use least connections algorithm
    server api-1.internal:8000;
    server api-2.internal:8000;
    server api-3.internal:8000;
}

3. IP Hash / Sticky Sessions

Routes requests from the same client to the same server.

When I use it (sparingly):

Legacy applications that store session state in memory
WebSocket connections that need to maintain state

upstream api_backend {
    ip_hash;  # Same client IP goes to same server
    server api-1.internal:8000;
    server api-2.internal:8000;
    server api-3.internal:8000;
}

⚠️ Warning: I try to avoid sticky sessions. They make scaling harder and create issues when a server fails. Better to use external session storage.

4. Weighted Load Balancing

Distributes more traffic to more powerful servers.

When I use it:

Servers have different capacities
During gradual rollouts (canary deployments)

upstream api_backend {
    server api-1.internal:8000 weight=3;  # Gets 3x more traffic
    server api-2.internal:8000 weight=2;
    server api-3.internal:8000 weight=1;
}

Stateless vs Stateful Services

The most important scalability decision is whether your service maintains state.

Stateless Services

No session data stored on the server. Each request contains all necessary information.

Example of stateless design:

from fastapi import FastAPI, Depends, HTTPException, Header
from typing import Optional
import jwt
from datetime import datetime, timedelta

app = FastAPI()

SECRET_KEY = "your-secret-key-here"  # Store in environment variable

def verify_token(authorization: Optional[str] = Header(None)) -> dict:
    """
    Verify JWT token from request header.
    No session state stored on server - everything is in the token.
    """
    if not authorization or not authorization.startswith("Bearer "):
        raise HTTPException(status_code=401, detail="Missing or invalid token")
    
    token = authorization.replace("Bearer ", "")
    try:
        payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"])
        return payload
    except jwt.ExpiredSignatureError:
        raise HTTPException(status_code=401, detail="Token expired")
    except jwt.InvalidTokenError:
        raise HTTPException(status_code=401, detail="Invalid token")

@app.get("/api/profile")
async def get_profile(user: dict = Depends(verify_token)):
    """
    Stateless endpoint - user identity comes from JWT token.
    Can be handled by any server instance.
    """
    user_id = user["user_id"]
    
    # Fetch current data from database
    profile = await db.users.find_one({"id": user_id})
    return profile

@app.post("/api/auth/login")
async def login(email: str, password: str):
    """Generate JWT token for authenticated user."""
    user = await authenticate_user(email, password)
    if not user:
        raise HTTPException(status_code=401, detail="Invalid credentials")
    
    # Create JWT token with user data
    payload = {
        "user_id": user["id"],
        "email": user["email"],
        "exp": datetime.utcnow() + timedelta(hours=24)
    }
    token = jwt.encode(payload, SECRET_KEY, algorithm="HS256")
    
    return {"token": token}

Benefits I've experienced:

Easy to scale horizontally
No session synchronization needed
Servers can be added/removed freely
Simple load balancing

Stateful Services

Maintain session or connection state on the server.

When I need stateful services:

WebSocket connections
Real-time collaboration (like Google Docs)
Gaming servers
Video streaming sessions

Example: Stateful WebSocket server with Redis for session sharing:

from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from typing import Dict, Set
import redis
import json
import asyncio

app = FastAPI()

# Redis for shared state across instances
redis_client = redis.Redis(
    host='redis.internal',
    port=6379,
    decode_responses=True
)

class ConnectionManager:
    """
    Manage WebSocket connections.
    Uses Redis to coordinate across multiple server instances.
    """
    
    def __init__(self):
        # Local connections for this instance
        self.active_connections: Dict[str, WebSocket] = {}
    
    async def connect(self, user_id: str, websocket: WebSocket):
        """Register new connection."""
        await websocket.accept()
        self.active_connections[user_id] = websocket
        
        # Register in Redis for cross-instance coordination
        redis_client.hset(
            "active_connections",
            user_id,
            json.dumps({
                "instance": os.environ.get("INSTANCE_ID"),
                "connected_at": datetime.utcnow().isoformat()
            })
        )
    
    def disconnect(self, user_id: str):
        """Remove connection."""
        if user_id in self.active_connections:
            del self.active_connections[user_id]
        redis_client.hdel("active_connections", user_id)
    
    async def send_personal_message(self, message: str, user_id: str):
        """Send message to specific user if connected to this instance."""
        if user_id in self.active_connections:
            await self.active_connections[user_id].send_text(message)
    
    async def broadcast(self, message: str):
        """Broadcast to all connections on this instance."""
        for connection in self.active_connections.values():
            await connection.send_text(message)

manager = ConnectionManager()

@app.websocket("/ws/{user_id}")
async def websocket_endpoint(websocket: WebSocket, user_id: str):
    """
    WebSocket endpoint for real-time communication.
    Maintains stateful connection.
    """
    await manager.connect(user_id, websocket)
    try:
        while True:
            data = await websocket.receive_text()
            # Process message
            await manager.broadcast(f"User {user_id}: {data}")
    except WebSocketDisconnect:
        manager.disconnect(user_id)

Auto-Scaling Patterns

Auto-scaling automatically adjusts the number of instances based on demand.

Metrics I Use for Auto-Scaling

CPU Utilization: Scale when average CPU > 70%
Memory Utilization: Scale when average memory > 80%
Request Queue Depth: Scale when queue > 100 pending requests
Custom Metrics: Application-specific (e.g., active WebSocket connections)

Auto-Scaling Configuration (AWS)

# AWS Auto Scaling with boto3
import boto3

autoscaling = boto3.client('autoscaling')

# Create auto-scaling policy
response = autoscaling.put_scaling_policy(
    AutoScalingGroupName='api-server-asg',
    PolicyName='scale-on-cpu',
    PolicyType='TargetTrackingScaling',
    TargetTrackingConfiguration={
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'ASGAverageCPUUtilization',
        },
        'TargetValue': 70.0,  # Target 70% CPU utilization
    }
)

# Scheduled scaling for predictable traffic patterns
autoscaling.put_scheduled_update_group_action(
    AutoScalingGroupName='api-server-asg',
    ScheduledActionName='scale-up-morning',
    Recurrence='0 8 * * MON-FRI',  # 8 AM weekdays
    MinSize=5,
    MaxSize=20,
    DesiredCapacity=10
)

autoscaling.put_scheduled_update_group_action(
    AutoScalingGroupName='api-server-asg',
    ScheduledActionName='scale-down-evening',
    Recurrence='0 20 * * *',  # 8 PM every day
    MinSize=2,
    MaxSize=10,
    DesiredCapacity=3
)

Lessons Learned from Auto-Scaling

What worked:

Set minimum instances to handle baseline load
Use longer cool-down periods for scale-down (5-10 minutes)
Scale up aggressively, scale down conservatively
Monitor the metrics that matter to your application

What didn't work:

Scaling based on a single metric (use composite metrics)
Too aggressive scale-down (caused flapping)
Not accounting for instance startup time
Ignoring the cost of constantly spinning up/down instances

Database Scalability

Scaling databases requires different strategies than scaling application servers.

Read Replicas

# Database router for read replicas
from typing import Literal
import random

class DatabaseRouter:
    """
    Route database queries to primary or read replicas.
    I use this pattern to scale read-heavy workloads.
    """
    
    def __init__(self, primary_url: str, replica_urls: list[str]):
        self.primary = create_connection(primary_url)
        self.replicas = [create_connection(url) for url in replica_urls]
    
    def get_connection(self, operation: Literal["read", "write"]):
        """Get appropriate database connection."""
        if operation == "write":
            return self.primary
        else:
            # Distribute reads across replicas
            return random.choice(self.replicas) if self.replicas else self.primary

# Usage
db_router = DatabaseRouter(
    primary_url="postgresql://primary.db.internal:5432/myapp",
    replica_urls=[
        "postgresql://replica-1.db.internal:5432/myapp",
        "postgresql://replica-2.db.internal:5432/myapp",
        "postgresql://replica-3.db.internal:5432/myapp",
    ]
)

# Write to primary
def create_user(user_data: dict):
    conn = db_router.get_connection("write")
    # Insert user...

# Read from replica
def get_user(user_id: str):
    conn = db_router.get_connection("read")
    # Fetch user...

Real-World Scaling Journey

Let me share how I scaled a real application from 1,000 to 1,000,000 users:

Phase 1: Single Server (0-10K users)

One server running everything (app + database)
Vertical scaling when needed
Cost: $50/month

Phase 2: Separate Database (10K-50K users)

Moved database to dedicated server
App server can now scale independently
Cost: $200/month

Phase 3: Horizontal Scaling (50K-200K users)

Multiple app servers behind load balancer
Database read replicas
Redis for caching and sessions
Cost: $800/month

Phase 4: Full Distribution (200K-1M users)

10-20 app servers with auto-scaling
Database sharding by user ID
CDN for static assets
Message queue for async processing
Cost: $3,000/month

Key insight: We scaled gradually, adding complexity only when needed. Starting with microservices would have been premature optimization.

What's Next

Now that you understand scalability patterns, let's explore how caching can dramatically improve performance:

Caching Strategies →: Speed up your system with intelligent caching

Navigation:

PreviousSystem Design Fundamentals NextCaching Strategies

Last updated 1 month ago

hashtagUnderstanding Scalability

hashtagHorizontal vs Vertical Scaling

hashtagVertical Scaling (Scale Up)

hashtagHorizontal Scaling (Scale Out)

hashtagLoad Balancing Strategies

hashtag1. Round Robin

hashtag2. Least Connections

hashtag3. IP Hash / Sticky Sessions

hashtag4. Weighted Load Balancing

hashtagStateless vs Stateful Services

hashtagStateless Services

hashtagStateful Services

hashtagAuto-Scaling Patterns

hashtagMetrics I Use for Auto-Scaling

hashtagAuto-Scaling Configuration (AWS)

hashtagLessons Learned from Auto-Scaling

hashtagDatabase Scalability

hashtagRead Replicas

hashtagReal-World Scaling Journey

hashtagWhat's Next

Understanding Scalability

Horizontal vs Vertical Scaling

Vertical Scaling (Scale Up)

Horizontal Scaling (Scale Out)

Load Balancing Strategies

1. Round Robin

2. Least Connections

3. IP Hash / Sticky Sessions

4. Weighted Load Balancing

Stateless vs Stateful Services

Stateless Services

Stateful Services

Auto-Scaling Patterns

Metrics I Use for Auto-Scaling

Auto-Scaling Configuration (AWS)

Lessons Learned from Auto-Scaling

Database Scalability

Read Replicas

Real-World Scaling Journey

What's Next