# Distributed Systems

## The Night the Database Was Fine But Everything Was Broken

The POS system was running across three services on the same machine during early development. Moving it to three different machines — each with its own network stack, clock, and failure domain — exposed assumptions I had baked into the code without realising it.

The most memorable incident: an order would be created successfully in the POS Core service, but the inventory reservation would sometimes not arrive. The payment would succeed, the order would exist, but the stock count was not decremented. The root cause was a race condition between an HTTP call and a process restart — the call was in-flight when the Inventory service restarted during a deployment.

I had assumed that if the HTTP call returned 200, the side effect was complete and durable. In a distributed system, that assumption does not hold. Networks drop packets. Services restart. Clocks drift. Distributed systems force you to be explicit about the guarantees you are relying on.

## Table of Contents

* [What Is a Distributed System?](#what-is-a-distributed-system)
* [Why Distribution Is Hard](#why-distribution-is-hard)
* [The CAP Theorem](#the-cap-theorem)
* [Consistency Models](#consistency-models)
* [Failure Modes](#failure-modes)
* [Patterns for Reliability](#patterns-for-reliability)
* [Practical Example: The Distributed POS Workflow](#practical-example-the-distributed-pos-workflow)
* [Clock Drift and Ordering Events](#clock-drift-and-ordering-events)
* [Lessons Learned](#lessons-learned)

***

## What Is a Distributed System?

A distributed system is a collection of independent processes that **communicate over a network** and appear to users as a coherent whole.

The key word is "network." Network communication introduces:

* **Latency** — messages take time to travel
* **Unreliability** — messages can be lost, reordered, or duplicated
* **Partial failure** — some nodes can fail while others continue
* **No shared memory** — state must be explicitly communicated
* **Clock skew** — clocks on different machines drift apart

My POS system became a distributed system the moment its services ran on separate machines and communicated over HTTP.

{% @mermaid/diagram content="graph TB
subgraph "Single Machine"
P1\[Process A]
P2\[Process B]
P1 <-->|In-process call<br/>Reliable, fast| P2
end

```
subgraph "Distributed System"
    S1[Service A]
    S2[Service B]
    S1 <-->|Network call<br/>Can fail, slow, reorder| S2
end" %}
```

***

## Why Distribution Is Hard

The classic formulation is the **Fallacies of Distributed Computing** — eight assumptions that developers incorrectly treat as true:

1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The network is secure
5. Topology does not change
6. There is one administrator
7. Transport cost is zero
8. The network is homogeneous

I have been burned by the first three. The network dropped packets during a load spike. Latency spiked when the database was under pressure. Bandwidth was exhausted when a service started logging full request bodies in a loop.

Each fallacy is a category of bugs waiting to happen.

***

## The CAP Theorem

The CAP theorem states that a distributed system can provide only **two of these three guarantees** at the same time:

**Consistency (C):** Every read returns the most recent write, or an error. All nodes see the same data at the same time.

**Availability (A):** Every request receives a response (not an error), though the data may not be the most recent.

**Partition Tolerance (P):** The system continues to operate even when network partitions split nodes.

{% @mermaid/diagram content="graph TB
subgraph CAP\["CAP Theorem"]
C((Consistency))
A((Availability))
P((Partition<br/>Tolerance))
end

```
CP[CP Systems<br/>MongoDB, HBase, Zookeeper<br/>Consistent + Partition-tolerant<br/>May reject requests during partition]
AP[AP Systems<br/>CouchDB, DynamoDB, Cassandra<br/>Available + Partition-tolerant<br/>May return stale data]
CA[CA Systems<br/>Traditional RDBMS<br/>Consistent + Available<br/>Cannot tolerate partitions]

C --- CP
P --- CP

A --- AP
P --- AP

C --- CA
A --- CA" %}
```

**In practice, partition tolerance is non-negotiable** in any real distributed system. Networks partition. The actual choice is between CP and AP during a partition.

In my POS system:

* **For payments and order confirmation:** I choose CP — I would rather reject a request than confirm an order twice
* **For the chatbot's product cache:** I choose AP — slightly stale product names are acceptable; the service must stay available

***

## Consistency Models

Beyond the binary of CAP, there is a spectrum of consistency:

| Model                      | Guarantee                                                           | Example                                     |
| -------------------------- | ------------------------------------------------------------------- | ------------------------------------------- |
| **Strong consistency**     | All reads reflect the most recent write                             | Single PostgreSQL instance                  |
| **Linearisability**        | Operations appear instantaneous; reads always see latest write      | etcd / Zookeeper                            |
| **Sequential consistency** | Operations appear in some total order; all nodes see the same order | Some distributed queues                     |
| **Eventual consistency**   | Given no new writes, all nodes will eventually converge             | Redis replication, product cache in chatbot |
| **Read-your-writes**       | A client always reads what it has written                           | Sessions pinned to primary DB replica       |

Most of my services use strong consistency within a service (single PostgreSQL instance) and eventual consistency across services (event-based updates to caches and read models).

***

## Failure Modes

In a distributed system, failures are not binary. The worse failures are partial:

{% @mermaid/diagram content="graph LR
subgraph "Failures"
CRASH\[Crash Failure<br/>Service stops completely<br/>Detectable]
OMISSION\[Omission Failure<br/>Messages lost silently<br/>Hard to detect]
TIMING\[Timing Failure<br/>Response too slow<br/>May appear as success or failure]
BYZANTINE\[Byzantine Failure<br/>Service returns wrong data<br/>Hardest to detect]
end" %}

The **timing failure** is the one that bit me in the introduction. The Inventory service received the reservation request, committed the change to the database, but the response was lost in transit. POS Core never received the 200 OK. It was a pure timing/network failure — the operation succeeded on the receiving end, but the sender had no way to know.

The solution was **idempotent operations with idempotency keys**:

```python
# inventory/services.py — idempotent reservation
class InventoryService:
    async def reserve_stock(
        self,
        tenant_id: str,
        product_id: int,
        quantity: int,
        idempotency_key: str
    ) -> ReservationResult:
        # Check if this reservation was already processed
        existing = await self._db.query(
            "SELECT id, status FROM reservations WHERE idempotency_key = $1",
            idempotency_key
        )
        if existing:
            return ReservationResult(id=existing["id"], status=existing["status"])

        # Process the reservation
        # ... business logic ...

        await self._db.execute(
            "INSERT INTO reservations (idempotency_key, product_id, quantity, status) VALUES ($1, $2, $3, 'confirmed')",
            idempotency_key, product_id, quantity
        )
        return ReservationResult(id=new_id, status="confirmed")
```

Now POS Core can safely retry the reservation call — if it was already processed, it gets the same result back.

***

## Patterns for Reliability

| Pattern                | Problem It Solves                                                                         |
| ---------------------- | ----------------------------------------------------------------------------------------- |
| **Idempotency keys**   | Safe retries — the same request can be sent multiple times without duplicate side effects |
| **Outbox pattern**     | Ensures a database write and an event publication happen atomically                       |
| **Circuit breaker**    | Stops cascading failures by short-circuiting calls to a failing downstream service        |
| **Saga pattern**       | Manages multi-step distributed transactions with compensating actions                     |
| **Retry with backoff** | Handles transient failures without overwhelming the recovering service                    |

See [Resilience & Fault Tolerance](https://blog.htunnthuthu.com/architecture-and-design/architecture-and-patterns/software-architecture-101/11-resilience-fault-tolerance) for detailed implementations of circuit breakers and retry patterns.

***

## Practical Example: The Distributed POS Workflow

{% @mermaid/diagram content="sequenceDiagram
participant Client
participant POSCore as POS Core
participant Inventory
participant Payment
participant EventBus as Redis Event Bus

```
Client->>POSCore: POST /orders {idempotency_key}
POSCore->>Inventory: POST /reserve {idempotency_key}

Note over Inventory: Crash here? Client retries.<br/>Idempotency key prevents duplicate reservation.

Inventory-->>POSCore: 200 Reserved
POSCore->>Payment: POST /charge {idempotency_key}
Payment-->>POSCore: 200 Charged

POSCore->>POSCore: Commit order to DB
POSCore->>EventBus: Publish order.placed event

Note over EventBus: Async consumers<br/>Restaurant service<br/>Notification service

POSCore-->>Client: 201 Order Created" %}
```

The idempotency key propagates through the entire workflow. If any call needs to be retried, the receiving service can detect the duplicate and return the same result.

***

## Clock Drift and Ordering Events

Distributed systems cannot rely on wall-clock timestamps for event ordering. Two messages published at the "same millisecond" from different machines may have timestamps that are off by hundreds of milliseconds.

I learned this when trying to trace order events across services using `datetime.utcnow()`. Events from the Inventory service and Payment service would appear out of order in the logs because their clocks were not synchronised.

Solutions:

* **Use a monotonic sequence from the database** — PostgreSQL sequences are reliable within a single service
* **Logical clocks (Lamport timestamps)** — each event carries a counter incremented with every message; recipients update their own counter to max(local, received) + 1
* **Correlation IDs with sequence numbers** — I pass a correlation ID + step number in the header so I can reconstruct the call order without relying on timestamps

```python
# Logging with distributed context
import structlog

logger = structlog.get_logger()

def process_order(tenant_id: str, order_id: int, correlation_id: str, step: int):
    logger.info(
        "processing_order",
        tenant_id=tenant_id,
        order_id=order_id,
        correlation_id=correlation_id,
        step=step,             # Deterministic ordering
        service="pos-core"
    )
```

***

## Lessons Learned

* **Assume the network will fail.** Design every cross-service call to handle failure, timeout, and partial success.
* **Idempotency is not optional.** Any operation that can be retried must be idempotent. Add idempotency keys from the beginning — retrofitting them is painful.
* **Eventual consistency is a product decision, not just a technical one.** "The inventory might be slightly incorrect for a few seconds" needs sign-off from the product side, not just the engineering side.
* **Start monitoring distributed systems immediately.** Correlation IDs, structured logs, and request tracing are not something to add later — add them on day one.
* **The simplest distributed system is the one you did not build.** Every remote call is a potential failure point. Fewer services means fewer distribution problems to solve.
