System Design Fundamentals

← Back to System Design 101

What is System Design?

System design is the process of defining the architecture, components, and interactions needed to build a software system that meets specific requirements. It's about making deliberate choices on how your application will handle data, scale with users, recover from failures, and evolve over time.

Early in my career, I thought system design was about choosing the right technologies. I was wrong. System design is primarily about understanding trade-offs. Every decision you make—whether it's choosing a database, designing an API, or structuring your services—involves trading one benefit for another.

Core Principles

Through building and maintaining distributed systems, I've found these principles to be fundamental:

1. Scalability

Scalability is the ability of a system to handle increased load. This could mean more users, more data, more transactions, or more complex operations.

Two dimensions of scalability:

  • Vertical Scaling (Scale Up): Adding more power to your existing machines (CPU, RAM, disk)

    • Pros: Simple, no code changes needed

    • Cons: Physical limits, single point of failure, expensive

    • When I use it: Databases that need strong consistency, legacy applications

  • Horizontal Scaling (Scale Out): Adding more machines to your pool of resources

    • Pros: Nearly unlimited scaling, better fault tolerance, cost-effective

    • Cons: Increased complexity, distributed system challenges

    • When I use it: Stateless services, read replicas, cache layers

2. Reliability

A reliable system continues to work correctly even when things go wrong—hardware failures, software bugs, or human errors.

Key reliability patterns I've used:

Lessons learned:

  • Always have timeouts—I learned this the hard way when a downstream service hung

  • Design for failure—assume every network call will fail eventually

  • Use circuit breakers to prevent cascading failures

3. Availability

Availability is the percentage of time your system is operational and accessible. It's typically measured in "nines":

  • 99.9% (three nines): ~8.76 hours downtime/year

  • 99.99% (four nines): ~52.56 minutes downtime/year

  • 99.999% (five nines): ~5.26 minutes downtime/year

My experience with availability:

In one project, we committed to 99.95% availability. To achieve this, I implemented:

  1. Redundancy: Multiple instances across different availability zones

  2. Health checks: Automated monitoring and alerting

  3. Graceful degradation: Core features remained available even when non-critical services failed

  4. Zero-downtime deployments: Blue-green deployments with automated rollback

4. Maintainability

Maintainability is about making your system easy to operate, modify, and extend. This includes code quality, documentation, monitoring, and operational simplicity.

What I've learned about maintainability:

  • Simple is better than clever: I've refactored "clever" code too many times at 2 AM

  • Document decisions, not just code: ADRs (Architecture Decision Records) are invaluable

  • Observability from day one: You can't fix what you can't see

  • Automate operations: Manual processes lead to human errors

The CAP Theorem

The CAP theorem states that in a distributed system, you can only guarantee two out of three properties:

  • Consistency (C): All nodes see the same data at the same time

  • Availability (A): Every request receives a response (success or failure)

  • Partition Tolerance (P): System continues to operate despite network partitions

Since network partitions are inevitable in distributed systems, you're really choosing between CP (Consistency + Partition Tolerance) and AP (Availability + Partition Tolerance).

My practical experience:

System Type
Choice
Reason

Financial transactions

CP

Consistency is critical—can't have duplicate charges

Social media feeds

AP

Better to show slightly stale data than no data

Inventory system

CP

Prevent overselling products

Analytics dashboard

AP

Eventual consistency is acceptable for metrics

Trade-offs Mindset

The most important skill in system design isn't knowing all the patterns—it's understanding trade-offs. Every architectural decision has costs and benefits.

Common trade-offs I encounter:

  1. Consistency vs Latency

    • Strong consistency requires coordination → higher latency

    • Eventual consistency is faster but can show stale data

  2. Normalization vs Denormalization

    • Normalized data reduces duplication but requires joins

    • Denormalized data is faster to read but harder to update

  3. Synchronous vs Asynchronous

    • Sync operations are simpler but block the caller

    • Async operations are more complex but enable better scalability

  4. Build vs Buy

    • Building gives you control and customization

    • Buying (managed services) is faster and requires less operational overhead

My decision framework:

  1. Understand requirements: What are the actual needs? (not wants)

  2. Identify constraints: Budget, time, team expertise, compliance

  3. List alternatives: Multiple ways to solve the problem

  4. Evaluate trade-offs: What do you gain? What do you lose?

  5. Make a decision: Document it with rationale

  6. Validate with metrics: Measure if it's working as expected

Performance Metrics

Understanding and measuring performance is crucial. Here are the metrics I monitor:

Key metrics I track:

  • Latency: p50, p95, p99 response times

  • Throughput: Requests per second

  • Error Rate: Percentage of failed requests

  • Saturation: Resource utilization (CPU, memory, disk, network)

Real-World Challenges

Building distributed systems taught me that theory and practice often diverge. Here are challenges I've faced:

Challenge 1: Network is Unreliable

Networks fail, packets get lost, latency spikes happen. Design with this in mind.

Solutions:

  • Implement retries with exponential backoff

  • Use circuit breakers to prevent cascading failures

  • Set appropriate timeouts (I default to 5-10 seconds for external calls)

  • Have fallback mechanisms

Challenge 2: Data Consistency Across Services

In microservices, maintaining consistency across services is hard.

Solutions I've used:

  • Saga pattern for distributed transactions

  • Event sourcing to maintain audit trail

  • Two-phase commit (only when absolutely necessary—it's complex)

  • Accept eventual consistency where business logic allows

Challenge 3: Operational Complexity

More components = more things that can break.

Solutions:

  • Start with a monolith if you're unsure

  • Add complexity only when needed

  • Invest heavily in observability

  • Automate everything you can

  • Document runbooks for common issues

What's Next

Now that you understand the fundamentals, we'll dive into specific patterns and technologies:


Navigation:

Last updated