Distributed Systems

The Night the Database Was Fine But Everything Was Broken

The POS system was running across three services on the same machine during early development. Moving it to three different machines — each with its own network stack, clock, and failure domain — exposed assumptions I had baked into the code without realising it.

The most memorable incident: an order would be created successfully in the POS Core service, but the inventory reservation would sometimes not arrive. The payment would succeed, the order would exist, but the stock count was not decremented. The root cause was a race condition between an HTTP call and a process restart — the call was in-flight when the Inventory service restarted during a deployment.

I had assumed that if the HTTP call returned 200, the side effect was complete and durable. In a distributed system, that assumption does not hold. Networks drop packets. Services restart. Clocks drift. Distributed systems force you to be explicit about the guarantees you are relying on.

What Is a Distributed System?

A distributed system is a collection of independent processes that communicate over a network and appear to users as a coherent whole.

The key word is "network." Network communication introduces:

Latency — messages take time to travel
Unreliability — messages can be lost, reordered, or duplicated
Partial failure — some nodes can fail while others continue
No shared memory — state must be explicitly communicated
Clock skew — clocks on different machines drift apart

My POS system became a distributed system the moment its services ran on separate machines and communicated over HTTP.

Why Distribution Is Hard

The classic formulation is the Fallacies of Distributed Computing — eight assumptions that developers incorrectly treat as true:

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology does not change
There is one administrator
Transport cost is zero
The network is homogeneous

I have been burned by the first three. The network dropped packets during a load spike. Latency spiked when the database was under pressure. Bandwidth was exhausted when a service started logging full request bodies in a loop.

Each fallacy is a category of bugs waiting to happen.

The CAP Theorem

The CAP theorem states that a distributed system can provide only two of these three guarantees at the same time:

Consistency (C): Every read returns the most recent write, or an error. All nodes see the same data at the same time.

Availability (A): Every request receives a response (not an error), though the data may not be the most recent.

Partition Tolerance (P): The system continues to operate even when network partitions split nodes.

In practice, partition tolerance is non-negotiable in any real distributed system. Networks partition. The actual choice is between CP and AP during a partition.

In my POS system:

For payments and order confirmation: I choose CP — I would rather reject a request than confirm an order twice
For the chatbot's product cache: I choose AP — slightly stale product names are acceptable; the service must stay available

Consistency Models

Beyond the binary of CAP, there is a spectrum of consistency:

Model

Guarantee

Example

Strong consistency

All reads reflect the most recent write

Single PostgreSQL instance

Linearisability

Operations appear instantaneous; reads always see latest write

etcd / Zookeeper

Sequential consistency

Operations appear in some total order; all nodes see the same order

Some distributed queues

Eventual consistency

Given no new writes, all nodes will eventually converge

Redis replication, product cache in chatbot

Read-your-writes

A client always reads what it has written

Sessions pinned to primary DB replica

Most of my services use strong consistency within a service (single PostgreSQL instance) and eventual consistency across services (event-based updates to caches and read models).

Failure Modes

In a distributed system, failures are not binary. The worse failures are partial:

The timing failure is the one that bit me in the introduction. The Inventory service received the reservation request, committed the change to the database, but the response was lost in transit. POS Core never received the 200 OK. It was a pure timing/network failure — the operation succeeded on the receiving end, but the sender had no way to know.

The solution was idempotent operations with idempotency keys:

# inventory/services.py — idempotent reservation
class InventoryService:
    async def reserve_stock(
        self,
        tenant_id: str,
        product_id: int,
        quantity: int,
        idempotency_key: str
    ) -> ReservationResult:
        # Check if this reservation was already processed
        existing = await self._db.query(
            "SELECT id, status FROM reservations WHERE idempotency_key = $1",
            idempotency_key
        )
        if existing:
            return ReservationResult(id=existing["id"], status=existing["status"])

        # Process the reservation
        # ... business logic ...

        await self._db.execute(
            "INSERT INTO reservations (idempotency_key, product_id, quantity, status) VALUES ($1, $2, $3, 'confirmed')",
            idempotency_key, product_id, quantity
        )
        return ReservationResult(id=new_id, status="confirmed")

Now POS Core can safely retry the reservation call — if it was already processed, it gets the same result back.

Patterns for Reliability

Pattern

Problem It Solves

Idempotency keys

Safe retries — the same request can be sent multiple times without duplicate side effects

Outbox pattern

Ensures a database write and an event publication happen atomically

Circuit breaker

Stops cascading failures by short-circuiting calls to a failing downstream service

Saga pattern

Manages multi-step distributed transactions with compensating actions

Retry with backoff

Handles transient failures without overwhelming the recovering service

See Resilience & Fault Tolerance for detailed implementations of circuit breakers and retry patterns.

Practical Example: The Distributed POS Workflow

The idempotency key propagates through the entire workflow. If any call needs to be retried, the receiving service can detect the duplicate and return the same result.

Clock Drift and Ordering Events

Distributed systems cannot rely on wall-clock timestamps for event ordering. Two messages published at the "same millisecond" from different machines may have timestamps that are off by hundreds of milliseconds.

I learned this when trying to trace order events across services using datetime.utcnow(). Events from the Inventory service and Payment service would appear out of order in the logs because their clocks were not synchronised.

Solutions:

Use a monotonic sequence from the database — PostgreSQL sequences are reliable within a single service
Logical clocks (Lamport timestamps) — each event carries a counter incremented with every message; recipients update their own counter to max(local, received) + 1
Correlation IDs with sequence numbers — I pass a correlation ID + step number in the header so I can reconstruct the call order without relying on timestamps

# Logging with distributed context
import structlog

logger = structlog.get_logger()

def process_order(tenant_id: str, order_id: int, correlation_id: str, step: int):
    logger.info(
        "processing_order",
        tenant_id=tenant_id,
        order_id=order_id,
        correlation_id=correlation_id,
        step=step,             # Deterministic ordering
        service="pos-core"
    )

Lessons Learned

Assume the network will fail. Design every cross-service call to handle failure, timeout, and partial success.
Idempotency is not optional. Any operation that can be retried must be idempotent. Add idempotency keys from the beginning — retrofitting them is painful.
Eventual consistency is a product decision, not just a technical one. "The inventory might be slightly incorrect for a few seconds" needs sign-off from the product side, not just the engineering side.
Start monitoring distributed systems immediately. Correlation IDs, structured logs, and request tracing are not something to add later — add them on day one.
The simplest distributed system is the one you did not build. Every remote call is a potential failure point. Fewer services means fewer distribution problems to solve.

PreviousServerless Architecture NextPeer-to-Peer Architecture

Last updated 5 days ago

hashtagThe Night the Database Was Fine But Everything Was Broken

hashtagTable of Contents

hashtagWhat Is a Distributed System?

hashtagWhy Distribution Is Hard

hashtagThe CAP Theorem

hashtagConsistency Models

hashtagFailure Modes

hashtagPatterns for Reliability

hashtagPractical Example: The Distributed POS Workflow

hashtagClock Drift and Ordering Events

hashtagLessons Learned