System Design Fundamentals

What is System Design?

System design is the process of defining the architecture, components, and interactions needed to build a software system that meets specific requirements. It's about making deliberate choices on how your application will handle data, scale with users, recover from failures, and evolve over time.

Early in my career, I thought system design was about choosing the right technologies. I was wrong. System design is primarily about understanding trade-offs. Every decision you make—whether it's choosing a database, designing an API, or structuring your services—involves trading one benefit for another.

Core Principles

Through building and maintaining distributed systems, I've found these principles to be fundamental:

1. Scalability

Scalability is the ability of a system to handle increased load. This could mean more users, more data, more transactions, or more complex operations.

Two dimensions of scalability:

Vertical Scaling (Scale Up): Adding more power to your existing machines (CPU, RAM, disk)
- Pros: Simple, no code changes needed
- Cons: Physical limits, single point of failure, expensive
- When I use it: Databases that need strong consistency, legacy applications
Horizontal Scaling (Scale Out): Adding more machines to your pool of resources
- Pros: Nearly unlimited scaling, better fault tolerance, cost-effective
- Cons: Increased complexity, distributed system challenges
- When I use it: Stateless services, read replicas, cache layers

2. Reliability

A reliable system continues to work correctly even when things go wrong—hardware failures, software bugs, or human errors.

Key reliability patterns I've used:

# Example: Retry with exponential backoff
import time
from typing import Callable, Any

def retry_with_backoff(
    func: Callable,
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 60.0
) -> Any:
    """
    Retry a function with exponential backoff.
    Based on patterns I use in production services.
    """
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            
            delay = min(base_delay * (2 ** attempt), max_delay)
            print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay}s...")
            time.sleep(delay)

# Usage in a real scenario
def fetch_user_data(user_id: str) -> dict:
    """Fetch user data from a potentially unreliable external service."""
    import requests
    response = requests.get(f"https://api.example.com/users/{user_id}", timeout=5)
    response.raise_for_status()
    return response.json()

# Apply retry logic
user = retry_with_backoff(lambda: fetch_user_data("12345"))

Lessons learned:

Always have timeouts—I learned this the hard way when a downstream service hung
Design for failure—assume every network call will fail eventually
Use circuit breakers to prevent cascading failures

3. Availability

Availability is the percentage of time your system is operational and accessible. It's typically measured in "nines":

99.9% (three nines): ~8.76 hours downtime/year
99.99% (four nines): ~52.56 minutes downtime/year
99.999% (five nines): ~5.26 minutes downtime/year

My experience with availability:

In one project, we committed to 99.95% availability. To achieve this, I implemented:

Redundancy: Multiple instances across different availability zones
Health checks: Automated monitoring and alerting
Graceful degradation: Core features remained available even when non-critical services failed
Zero-downtime deployments: Blue-green deployments with automated rollback

# Simple health check endpoint I use in FastAPI services
from fastapi import FastAPI, status
from fastapi.responses import JSONResponse
import psycopg2

app = FastAPI()

@app.get("/health")
async def health_check():
    """
    Health check endpoint that verifies database connectivity.
    Load balancers use this to determine if an instance is healthy.
    """
    health_status = {
        "status": "healthy",
        "checks": {}
    }
    
    # Check database connectivity
    try:
        conn = psycopg2.connect(
            dbname="myapp",
            user="app_user",
            host="db.example.com",
            connect_timeout=3
        )
        conn.close()
        health_status["checks"]["database"] = "up"
    except Exception as e:
        health_status["status"] = "unhealthy"
        health_status["checks"]["database"] = f"down: {str(e)}"
        return JSONResponse(
            content=health_status,
            status_code=status.HTTP_503_SERVICE_UNAVAILABLE
        )
    
    return health_status

4. Maintainability

Maintainability is about making your system easy to operate, modify, and extend. This includes code quality, documentation, monitoring, and operational simplicity.

What I've learned about maintainability:

Simple is better than clever: I've refactored "clever" code too many times at 2 AM
Document decisions, not just code: ADRs (Architecture Decision Records) are invaluable
Observability from day one: You can't fix what you can't see
Automate operations: Manual processes lead to human errors

The CAP Theorem

The CAP theorem states that in a distributed system, you can only guarantee two out of three properties:

Consistency (C): All nodes see the same data at the same time
Availability (A): Every request receives a response (success or failure)
Partition Tolerance (P): System continues to operate despite network partitions

Since network partitions are inevitable in distributed systems, you're really choosing between CP (Consistency + Partition Tolerance) and AP (Availability + Partition Tolerance).

My practical experience:

System Type

Choice

Reason

Financial transactions

Consistency is critical—can't have duplicate charges

Social media feeds

Better to show slightly stale data than no data

Inventory system

Prevent overselling products

Analytics dashboard

Eventual consistency is acceptable for metrics

# Example: Implementing eventual consistency with message queues
from typing import Dict
import json
from datetime import datetime

class EventStore:
    """
    Simple event store pattern I use for eventual consistency.
    Events are published to a message queue and processed asynchronously.
    """
    
    def __init__(self, message_queue):
        self.queue = message_queue
    
    def publish_event(self, event_type: str, data: Dict):
        """Publish an event to the message queue."""
        event = {
            "event_type": event_type,
            "timestamp": datetime.utcnow().isoformat(),
            "data": data
        }
        self.queue.send(json.dumps(event))
        return event

# Usage: Update user profile
def update_user_profile(user_id: str, profile_data: Dict):
    """
    Update user profile and publish event for downstream services.
    The profile service updates immediately (consistency),
    while search index and cache update asynchronously (eventual consistency).
    """
    # Write to primary database (strongly consistent)
    db.users.update({"id": user_id}, profile_data)
    
    # Publish event for eventual consistency
    event_store.publish_event("user.profile.updated", {
        "user_id": user_id,
        "changes": profile_data
    })
    
    return {"status": "updated", "user_id": user_id}

Trade-offs Mindset

The most important skill in system design isn't knowing all the patterns—it's understanding trade-offs. Every architectural decision has costs and benefits.

Common trade-offs I encounter:

Consistency vs Latency
- Strong consistency requires coordination → higher latency
- Eventual consistency is faster but can show stale data
Normalization vs Denormalization
- Normalized data reduces duplication but requires joins
- Denormalized data is faster to read but harder to update
Synchronous vs Asynchronous
- Sync operations are simpler but block the caller
- Async operations are more complex but enable better scalability
Build vs Buy
- Building gives you control and customization
- Buying (managed services) is faster and requires less operational overhead

My decision framework:

Understand requirements: What are the actual needs? (not wants)
Identify constraints: Budget, time, team expertise, compliance
List alternatives: Multiple ways to solve the problem
Evaluate trade-offs: What do you gain? What do you lose?
Make a decision: Document it with rationale
Validate with metrics: Measure if it's working as expected

Performance Metrics

Understanding and measuring performance is crucial. Here are the metrics I monitor:

# Example: Custom performance metrics decorator
import time
from functools import wraps
from typing import Callable
import logging

logger = logging.getLogger(__name__)

def measure_performance(func: Callable) -> Callable:
    """
    Decorator to measure function performance.
    I use this pattern across services to track slow operations.
    """
    @wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        try:
            result = func(*args, **kwargs)
            duration = time.time() - start_time
            
            # Log performance metrics
            logger.info(
                f"Function: {func.__name__}, "
                f"Duration: {duration:.3f}s, "
                f"Status: success"
            )
            
            # Send to metrics system (Prometheus, CloudWatch, etc.)
            metrics.record_latency(
                metric_name=f"{func.__name__}_duration",
                value=duration,
                tags={"status": "success"}
            )
            
            return result
        except Exception as e:
            duration = time.time() - start_time
            logger.error(
                f"Function: {func.__name__}, "
                f"Duration: {duration:.3f}s, "
                f"Status: error, "
                f"Error: {str(e)}"
            )
            metrics.record_latency(
                metric_name=f"{func.__name__}_duration",
                value=duration,
                tags={"status": "error"}
            )
            raise
    
    return wrapper

# Usage
@measure_performance
def process_payment(order_id: str, amount: float):
    """Process payment for an order."""
    # Payment processing logic
    pass

Key metrics I track:

Latency: p50, p95, p99 response times
Throughput: Requests per second
Error Rate: Percentage of failed requests
Saturation: Resource utilization (CPU, memory, disk, network)

Real-World Challenges

Building distributed systems taught me that theory and practice often diverge. Here are challenges I've faced:

Challenge 1: Network is Unreliable

Networks fail, packets get lost, latency spikes happen. Design with this in mind.

Solutions:

Implement retries with exponential backoff
Use circuit breakers to prevent cascading failures
Set appropriate timeouts (I default to 5-10 seconds for external calls)
Have fallback mechanisms

Challenge 2: Data Consistency Across Services

In microservices, maintaining consistency across services is hard.

Solutions I've used:

Saga pattern for distributed transactions
Event sourcing to maintain audit trail
Two-phase commit (only when absolutely necessary—it's complex)
Accept eventual consistency where business logic allows

Challenge 3: Operational Complexity

More components = more things that can break.

Solutions:

Start with a monolith if you're unsure
Add complexity only when needed
Invest heavily in observability
Automate everything you can
Document runbooks for common issues

What's Next

Now that you understand the fundamentals, we'll dive into specific patterns and technologies:

Scalability Patterns: How to handle growth effectively
Caching Strategies: Speed up your system with intelligent caching
Database Design: Choose and design the right data storage

Navigation:

PreviousSystem Design 101 NextScalability Patterns

Last updated 1 month ago

hashtagWhat is System Design?

hashtagCore Principles

hashtag1. Scalability

hashtag2. Reliability

hashtag3. Availability

hashtag4. Maintainability

hashtagThe CAP Theorem

hashtagTrade-offs Mindset

hashtagPerformance Metrics

hashtagReal-World Challenges

hashtagChallenge 1: Network is Unreliable

hashtagChallenge 2: Data Consistency Across Services

hashtagChallenge 3: Operational Complexity

hashtagWhat's Next