In distributed systems, failures are inevitable. Networks partition, services crash, dependencies become slow. From building production systems, I've learned that the question isn't whether failures will happen, but how your system responds when they do.
This article covers practical resilience patterns: circuit breakers, retries, timeouts, bulkheads, and fallback strategies.
Why Resilience Matters
Failure Type
Impact
Mitigation
Service crash
Requests fail
Retry, fallback
Slow response
Thread exhaustion
Timeout, circuit breaker
Network partition
Requests hang
Timeout, bulkhead
Resource exhaustion
Cascading failures
Bulkhead, load shedding
Circuit Breaker Pattern
Concept
State
Behavior
Closed
Normal operation, requests pass through
Open
Fail fast, don't attempt requests
Half-Open
Allow limited requests to test recovery
Implementation
Circuit Breaker with Fallback
Retry Pattern
Exponential Backoff
Retry with Circuit Breaker
Timeout Pattern
Bulkhead Pattern
Thread Pool Bulkhead
Connection Pool Bulkhead
Fallback Pattern
Load Shedding
Combined Resilience Strategy
Health Checks for Resilience
Key Takeaways
Circuit breaker prevents cascading failures - Fail fast when service is down
Retry with backoff - Handle transient failures gracefully
Timeouts are essential - Don't wait forever for responses
Bulkheads isolate failures - One dependency's failure shouldn't affect others
Fallbacks provide graceful degradation - Always have a plan B
What's Next?
To manage resilient systems, we need visibility. In Article 10: Observability, we'll cover distributed tracing, centralized logging, metrics collection, and health checks.
from enum import Enum
class HealthStatus(Enum):
HEALTHY = "healthy"
DEGRADED = "degraded"
UNHEALTHY = "unhealthy"
class ServiceHealth:
"""Track service health based on resilience metrics."""
def __init__(self, circuit_breakers: list[CircuitBreaker]):
self.circuits = circuit_breakers
def get_status(self) -> dict:
open_circuits = [c for c in self.circuits if c.is_open]
if not open_circuits:
status = HealthStatus.HEALTHY
elif len(open_circuits) < len(self.circuits):
status = HealthStatus.DEGRADED
else:
status = HealthStatus.UNHEALTHY
return {
"status": status.value,
"circuits": {
c.name: {
"state": c.state.value,
"failures": c.stats.failures,
}
for c in self.circuits
},
}
# FastAPI health endpoint
@app.get("/health")
async def health():
return service_health.get_status()