Resilience & Fault Tolerance
Table of Contents
Introduction
At 2:47 PM on a busy Friday, my Payment Service crashed. A bug in the payment gateway integration caused the service to hang indefinitely on every request. Within 90 seconds:
POS Core Service: 200+ pending requests, 100% CPU
Chatbot Service: Completely unresponsive
Restaurant Service: Timing out on all orders
Result: Entire POS system down, customers couldn't pay, revenue lost
The Payment Service was down for 8 minutes. But my entire system was offline for 23 minutes because I hadn't implemented proper resilience patterns.
That incident taught me this: In distributed systems, failures are not exceptional—they are normal. Services will crash, networks will be slow, dependencies will timeout. Your architecture must assume failure and handle it gracefully.
In this article, I'll share the resilience patterns I implemented after that painful outage: circuit breakers, retry logic, timeouts, and graceful degradation. These patterns transformed my POS system from fragile to antifragile.
The Payment Service Outage
Here's what happened during the outage:
Root cause: One service failure cascaded through the entire system because I had:
No circuit breakers (services kept calling failed dependencies)
No proper timeouts (threads waited forever)
No graceful degradation (all-or-nothing responses)
After this, I implemented the patterns below.
Circuit Breaker Pattern
A circuit breaker prevents cascading failures by stopping calls to failing services. It has three states:
Implementation
Using Circuit Breakers
With circuit breakers, when the Payment Service fails:
After 5 failures: Circuit opens, stops sending requests
For 60 seconds: Immediately return error (don't wait for timeout)
After 60 seconds: Try one request (half-open)
If 2 successes: Close circuit, resume normal operation
This prevents the cascade that took down my system.
Retry Strategies with Exponential Backoff
Transient failures (network blips, temporary overload) should be retried. But naive retries can make things worse.
Bad Retry (Fixed Interval)
Problem: If service is overloaded, this hammers it with requests every second, making the problem worse.
Good Retry (Exponential Backoff with Jitter)
Using Retries
Retry timing:
Attempt 1: Immediate
Attempt 2: ~0.5s later (base delay)
Attempt 3: ~1.0s later (0.5 * 2^1)
Attempt 4: ~2.0s later (0.5 * 2^2)
Jitter prevents thundering herd (all clients retrying at same time).
Timeout Configuration
Every network call needs a timeout. Always. No exceptions.
Service-Specific Timeouts
Timeout vs Deadline
Sometimes you need a deadline (absolute time limit) rather than per-call timeout:
Graceful Degradation
When dependencies fail, provide reduced functionality instead of complete failure.
Example: Chatbot with Degraded Service
Users get:
Full data when all services work
Estimated data when some services fail
Cached data when all services fail
Clear error message when even cache is empty
Bulkhead Pattern
Isolate resources to prevent one failure from exhausting all system resources.
Production Lessons Learned
Lesson 1: Monitor Circuit Breaker State
After implementing circuit breakers, I had no visibility into when they opened/closed.
Solution: Add metrics and alerts:
Lesson 2: Don't Retry Idempotency Errors
I retried a payment twice, charging the customer twice. Oops.
Solution: Distinguish retryable from non-retryable errors:
Lesson 3: Fallback Data Must Be Obvious
Users complained that stale cached data wasn't clearly marked.
Solution: Always indicate degraded mode:
Best Practices
Use circuit breakers for all external dependencies
Implement exponential backoff with jitter for retries
Set aggressive timeouts (don't wait forever for slow services)
Distinguish retryable from non-retryable errors
Provide graceful degradation (reduced functionality > complete failure)
Use bulkheads to isolate resource pools
Monitor circuit breaker states and alert when opened
Mark degraded responses clearly in the UI
Test failure scenarios regularly (chaos engineering)
Fail fast (timeout quickly rather than tying up resources)
Next Steps
Resilience patterns keep your system running during failures. But how do you know when failures occur? How do you debug issues across 6 distributed services?
In the final article, Observability & Monitoring Architecture, we'll explore distributed tracing, structured logging, metrics collection, and health checks—the patterns that give you visibility into your distributed system's behavior.
This is part of the Software Architecture 101 series, where I share lessons learned building a production multi-tenant POS system with 6 microservices.
Last updated