Distributed Systems
The Night the Database Was Fine But Everything Was Broken
The POS system was running across three services on the same machine during early development. Moving it to three different machines β each with its own network stack, clock, and failure domain β exposed assumptions I had baked into the code without realising it.
The most memorable incident: an order would be created successfully in the POS Core service, but the inventory reservation would sometimes not arrive. The payment would succeed, the order would exist, but the stock count was not decremented. The root cause was a race condition between an HTTP call and a process restart β the call was in-flight when the Inventory service restarted during a deployment.
I had assumed that if the HTTP call returned 200, the side effect was complete and durable. In a distributed system, that assumption does not hold. Networks drop packets. Services restart. Clocks drift. Distributed systems force you to be explicit about the guarantees you are relying on.
Table of Contents
What Is a Distributed System?
A distributed system is a collection of independent processes that communicate over a network and appear to users as a coherent whole.
The key word is "network." Network communication introduces:
Latency β messages take time to travel
Unreliability β messages can be lost, reordered, or duplicated
Partial failure β some nodes can fail while others continue
No shared memory β state must be explicitly communicated
Clock skew β clocks on different machines drift apart
My POS system became a distributed system the moment its services ran on separate machines and communicated over HTTP.
Why Distribution Is Hard
The classic formulation is the Fallacies of Distributed Computing β eight assumptions that developers incorrectly treat as true:
The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology does not change
There is one administrator
Transport cost is zero
The network is homogeneous
I have been burned by the first three. The network dropped packets during a load spike. Latency spiked when the database was under pressure. Bandwidth was exhausted when a service started logging full request bodies in a loop.
Each fallacy is a category of bugs waiting to happen.
The CAP Theorem
The CAP theorem states that a distributed system can provide only two of these three guarantees at the same time:
Consistency (C): Every read returns the most recent write, or an error. All nodes see the same data at the same time.
Availability (A): Every request receives a response (not an error), though the data may not be the most recent.
Partition Tolerance (P): The system continues to operate even when network partitions split nodes.
In practice, partition tolerance is non-negotiable in any real distributed system. Networks partition. The actual choice is between CP and AP during a partition.
In my POS system:
For payments and order confirmation: I choose CP β I would rather reject a request than confirm an order twice
For the chatbot's product cache: I choose AP β slightly stale product names are acceptable; the service must stay available
Consistency Models
Beyond the binary of CAP, there is a spectrum of consistency:
Strong consistency
All reads reflect the most recent write
Single PostgreSQL instance
Linearisability
Operations appear instantaneous; reads always see latest write
etcd / Zookeeper
Sequential consistency
Operations appear in some total order; all nodes see the same order
Some distributed queues
Eventual consistency
Given no new writes, all nodes will eventually converge
Redis replication, product cache in chatbot
Read-your-writes
A client always reads what it has written
Sessions pinned to primary DB replica
Most of my services use strong consistency within a service (single PostgreSQL instance) and eventual consistency across services (event-based updates to caches and read models).
Failure Modes
In a distributed system, failures are not binary. The worse failures are partial:
The timing failure is the one that bit me in the introduction. The Inventory service received the reservation request, committed the change to the database, but the response was lost in transit. POS Core never received the 200 OK. It was a pure timing/network failure β the operation succeeded on the receiving end, but the sender had no way to know.
The solution was idempotent operations with idempotency keys:
Now POS Core can safely retry the reservation call β if it was already processed, it gets the same result back.
Patterns for Reliability
Idempotency keys
Safe retries β the same request can be sent multiple times without duplicate side effects
Outbox pattern
Ensures a database write and an event publication happen atomically
Circuit breaker
Stops cascading failures by short-circuiting calls to a failing downstream service
Saga pattern
Manages multi-step distributed transactions with compensating actions
Retry with backoff
Handles transient failures without overwhelming the recovering service
See Resilience & Fault Tolerance for detailed implementations of circuit breakers and retry patterns.
Practical Example: The Distributed POS Workflow
The idempotency key propagates through the entire workflow. If any call needs to be retried, the receiving service can detect the duplicate and return the same result.
Clock Drift and Ordering Events
Distributed systems cannot rely on wall-clock timestamps for event ordering. Two messages published at the "same millisecond" from different machines may have timestamps that are off by hundreds of milliseconds.
I learned this when trying to trace order events across services using datetime.utcnow(). Events from the Inventory service and Payment service would appear out of order in the logs because their clocks were not synchronised.
Solutions:
Use a monotonic sequence from the database β PostgreSQL sequences are reliable within a single service
Logical clocks (Lamport timestamps) β each event carries a counter incremented with every message; recipients update their own counter to max(local, received) + 1
Correlation IDs with sequence numbers β I pass a correlation ID + step number in the header so I can reconstruct the call order without relying on timestamps
Lessons Learned
Assume the network will fail. Design every cross-service call to handle failure, timeout, and partial success.
Idempotency is not optional. Any operation that can be retried must be idempotent. Add idempotency keys from the beginning β retrofitting them is painful.
Eventual consistency is a product decision, not just a technical one. "The inventory might be slightly incorrect for a few seconds" needs sign-off from the product side, not just the engineering side.
Start monitoring distributed systems immediately. Correlation IDs, structured logs, and request tracing are not something to add later β add them on day one.
The simplest distributed system is the one you did not build. Every remote call is a potential failure point. Fewer services means fewer distribution problems to solve.
Last updated