# Resilience & Fault Tolerance

## Table of Contents

* [Introduction](#introduction)
* [The Payment Service Outage](#the-payment-service-outage)
* [Circuit Breaker Pattern](#circuit-breaker-pattern)
* [Retry Strategies with Exponential Backoff](#retry-strategies-with-exponential-backoff)
* [Timeout Configuration](#timeout-configuration)
* [Graceful Degradation](#graceful-degradation)
* [Bulkhead Pattern](#bulkhead-pattern)
* [Production Lessons Learned](#production-lessons-learned)
* [Best Practices](#best-practices)

## Introduction

At 2:47 PM on a busy Friday, my Payment Service crashed. A bug in the payment gateway integration caused the service to hang indefinitely on every request. Within 90 seconds:

* POS Core Service: 200+ pending requests, 100% CPU
* Chatbot Service: Completely unresponsive
* Restaurant Service: Timing out on all orders
* **Result**: Entire POS system down, customers couldn't pay, revenue lost

The Payment Service was down for **8 minutes**. But my entire system was offline for **23 minutes** because I hadn't implemented proper **resilience patterns**.

That incident taught me this: **In distributed systems, failures are not exceptional—they are normal**. Services will crash, networks will be slow, dependencies will timeout. Your architecture must assume failure and handle it gracefully.

In this article, I'll share the resilience patterns I implemented after that painful outage: circuit breakers, retry logic, timeouts, and graceful degradation. These patterns transformed my POS system from fragile to antifragile.

## The Payment Service Outage

Here's what happened during the outage:

```
14:47:23 - Payment Service hangs (bug in payment gateway client)
14:47:45 - POS Core starts accumulating timeout errors (22s later)
14:48:12 - POS Core thread pool exhausted, stops processing requests
14:48:30 - Chatbot Service cascades to failure (calling POS Core)
14:49:15 - Restaurant Service fails (depends on POS Core for orders)
14:55:00 - Payment Service restarted (bug fixed)
14:55:30 - POS Core still unresponsive (thread pool deadlocked)
15:10:00 - Full system restart required
```

**Root cause**: One service failure cascaded through the entire system because I had:

* No circuit breakers (services kept calling failed dependencies)
* No proper timeouts (threads waited forever)
* No graceful degradation (all-or-nothing responses)

After this, I implemented the patterns below.

## Circuit Breaker Pattern

A circuit breaker prevents cascading failures by stopping calls to failing services. It has three states:

```
CLOSED (normal) → OPEN (failing) → HALF-OPEN (testing) → CLOSED (recovered)
```

### Implementation

```python
# infrastructure/circuit_breaker.py
from datetime import datetime, timedelta
from enum import Enum
from typing import Callable, Any
import asyncio
import logging

logger = logging.getLogger(__name__)


class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing recovery


class CircuitBreaker:
    """
    Circuit breaker implementation.
    
    Prevents cascading failures by stopping calls to failing services.
    """
    
    def __init__(
        self,
        name: str,
        failure_threshold: int = 5,
        success_threshold: int = 2,
        timeout: float = 60.0
    ):
        self.name = name
        self.failure_threshold = failure_threshold  # Failures before opening
        self.success_threshold = success_threshold  # Successes to close from half-open
        self.timeout = timeout  # Seconds to wait before trying again
        
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time: datetime = None
    
    def is_open(self) -> bool:
        """Check if circuit is open (rejecting requests)."""
        if self.state == CircuitState.OPEN:
            # Check if timeout has elapsed
            if datetime.utcnow() - self.last_failure_time > timedelta(seconds=self.timeout):
                logger.info(f"Circuit breaker [{self.name}] entering HALF-OPEN state")
                self.state = CircuitState.HALF_OPEN
                self.success_count = 0
                return False
            return True
        return False
    
    async def call(self, func: Callable, *args, **kwargs) -> Any:
        """
        Execute function through circuit breaker.
        Raises CircuitBreakerOpen if circuit is open.
        """
        if self.is_open():
            raise CircuitBreakerOpen(f"Circuit breaker [{self.name}] is OPEN")
        
        try:
            # Execute the function
            result = await func(*args, **kwargs) if asyncio.iscoroutinefunction(func) else func(*args, **kwargs)
            
            # Success
            self.record_success()
            return result
            
        except Exception as e:
            # Failure
            self.record_failure()
            raise
    
    def record_success(self):
        """Record a successful call."""
        self.failure_count = 0
        
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            logger.info(
                f"Circuit breaker [{self.name}] success in HALF-OPEN "
                f"({self.success_count}/{self.success_threshold})"
            )
            
            if self.success_count >= self.success_threshold:
                logger.info(f"Circuit breaker [{self.name}] closing (recovered)")
                self.state = CircuitState.CLOSED
                self.success_count = 0
    
    def record_failure(self):
        """Record a failed call."""
        self.failure_count += 1
        self.last_failure_time = datetime.utcnow()
        
        logger.warning(
            f"Circuit breaker [{self.name}] failure "
            f"({self.failure_count}/{self.failure_threshold})"
        )
        
        if self.state == CircuitState.HALF_OPEN:
            logger.warning(f"Circuit breaker [{self.name}] opening (failed in HALF-OPEN)")
            self.state = CircuitState.OPEN
            self.failure_count = 0
            
        elif self.failure_count >= self.failure_threshold:
            logger.error(f"Circuit breaker [{self.name}] OPENED")
            self.state = CircuitState.OPEN


class CircuitBreakerOpen(Exception):
    """Exception raised when circuit breaker is open."""
    pass


# Global circuit breakers for each service
circuit_breakers = {
    "payment": CircuitBreaker("Payment Service", failure_threshold=5, timeout=60.0),
    "inventory": CircuitBreaker("Inventory Service", failure_threshold=5, timeout=30.0),
    "pos_core": CircuitBreaker("POS Core", failure_threshold=10, timeout=60.0),
    "restaurant": CircuitBreaker("Restaurant Service", failure_threshold=5, timeout=30.0),
}


def get_circuit_breaker(service_name: str) -> CircuitBreaker:
    """Get circuit breaker for a service."""
    return circuit_breakers.get(service_name)
```

### Using Circuit Breakers

```python
# services/payment_client.py
from infrastructure.circuit_breaker import get_circuit_breaker, CircuitBreakerOpen
import httpx

class PaymentClient:
    """Client for Payment Service with circuit breaker."""
    
    def __init__(self):
        self.base_url = "http://localhost:4004"
        self.circuit_breaker = get_circuit_breaker("payment")
    
    async def process_payment(
        self,
        tenant_id: str,
        order_id: str,
        amount: float
    ) -> dict:
        """
        Process payment with circuit breaker protection.
        Raises CircuitBreakerOpen if service is failing.
        """
        try:
            result = await self.circuit_breaker.call(
                self._do_process_payment,
                tenant_id,
                order_id,
                amount
            )
            return result
        except CircuitBreakerOpen as e:
            logger.error(f"Payment service unavailable: {e}")
            # Return error instead of crashing
            return {
                "success": False,
                "error": "Payment service temporarily unavailable",
                "retry_later": True
            }
    
    async def _do_process_payment(
        self,
        tenant_id: str,
        order_id: str,
        amount: float
    ) -> dict:
        """Actual payment processing logic."""
        async with httpx.AsyncClient() as client:
            response = await client.post(
                f"{self.base_url}/payments",
                json={
                    "order_id": order_id,
                    "amount": amount
                },
                headers={"x-tenant-id": tenant_id},
                timeout=5.0
            )
            
            if response.status_code != 200:
                raise Exception(f"Payment failed: {response.status_code}")
            
            return response.json()
```

With circuit breakers, when the Payment Service fails:

1. **After 5 failures**: Circuit opens, stops sending requests
2. **For 60 seconds**: Immediately return error (don't wait for timeout)
3. **After 60 seconds**: Try one request (half-open)
4. **If 2 successes**: Close circuit, resume normal operation

This prevents the cascade that took down my system.

## Retry Strategies with Exponential Backoff

Transient failures (network blips, temporary overload) should be retried. But naive retries can make things worse.

### Bad Retry (Fixed Interval)

```python
# DON'T DO THIS
async def call_service_bad_retry():
    for i in range(3):
        try:
            return await service_call()
        except Exception:
            await asyncio.sleep(1)  # Always wait 1 second
    raise Exception("Failed after 3 retries")
```

Problem: If service is overloaded, this hammers it with requests every second, making the problem worse.

### Good Retry (Exponential Backoff with Jitter)

```python
# infrastructure/retry.py
import asyncio
import random
from typing import Callable, Any
import logging

logger = logging.getLogger(__name__)


async def retry_with_backoff(
    func: Callable,
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 30.0,
    exponential_base: float = 2.0,
    jitter: bool = True,
    retryable_exceptions: tuple = (Exception,)
) -> Any:
    """
    Retry function with exponential backoff and jitter.
    
    Args:
        func: Async function to retry
        max_retries: Maximum number of retry attempts
        base_delay: Initial delay in seconds
        max_delay: Maximum delay in seconds
        exponential_base: Multiplier for exponential backoff
        jitter: Add random jitter to prevent thundering herd
        retryable_exceptions: Tuple of exceptions that should trigger retry
    
    Returns:
        Result of successful function call
    
    Raises:
        Last exception if all retries fail
    """
    last_exception = None
    
    for attempt in range(max_retries + 1):
        try:
            return await func()
        except retryable_exceptions as e:
            last_exception = e
            
            if attempt == max_retries:
                logger.error(f"Failed after {max_retries} retries: {e}")
                raise
            
            # Calculate delay: base_delay * (exponential_base ^ attempt)
            delay = min(base_delay * (exponential_base ** attempt), max_delay)
            
            # Add jitter (random variation)
            if jitter:
                delay = delay * (0.5 + random.random())  # 50-150% of calculated delay
            
            logger.warning(
                f"Attempt {attempt + 1}/{max_retries} failed: {e}. "
                f"Retrying in {delay:.2f}s"
            )
            
            await asyncio.sleep(delay)
    
    raise last_exception


# Decorator version
def with_retry(
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 30.0
):
    """Decorator for retry with exponential backoff."""
    def decorator(func):
        async def wrapper(*args, **kwargs):
            return await retry_with_backoff(
                lambda: func(*args, **kwargs),
                max_retries=max_retries,
                base_delay=base_delay,
                max_delay=max_delay
            )
        return wrapper
    return decorator
```

### Using Retries

```python
# services/inventory_client.py
from infrastructure.retry import with_retry
import httpx

class InventoryClient:
    """Client for Inventory Service with retry logic."""
    
    @with_retry(max_retries=3, base_delay=0.5, max_delay=5.0)
    async def get_product(self, tenant_id: str, product_id: str) -> dict:
        """Get product with automatic retry on failure."""
        async with httpx.AsyncClient() as client:
            response = await client.get(
                f"http://localhost:4003/products/{product_id}",
                headers={"x-tenant-id": tenant_id},
                timeout=3.0
            )
            
            if response.status_code == 503:  # Service unavailable
                raise httpx.HTTPStatusError("Service unavailable", request=None, response=response)
            
            response.raise_for_status()
            return response.json()
```

**Retry timing**:

* Attempt 1: Immediate
* Attempt 2: \~0.5s later (base delay)
* Attempt 3: \~1.0s later (0.5 \* 2^1)
* Attempt 4: \~2.0s later (0.5 \* 2^2)

Jitter prevents thundering herd (all clients retrying at same time).

## Timeout Configuration

Every network call needs a timeout. Always. No exceptions.

### Service-Specific Timeouts

```python
# config/timeouts.py
from dataclasses import dataclass

@dataclass
class ServiceTimeout:
    """Timeout configuration for a service."""
    connect: float  # Time to establish connection
    read: float     # Time to read response
    total: float    # Total request time


class TimeoutConfig:
    """Centralized timeout configuration."""
    
    # Auth Service: Fast, in-memory operations
    AUTH = ServiceTimeout(connect=1.0, read=2.0, total=3.0)
    
    # POS Core: Medium, database queries
    POS_CORE = ServiceTimeout(connect=1.0, read=5.0, total=8.0)
    
    # Inventory Service: Medium, MongoDB queries
    INVENTORY = ServiceTimeout(connect=1.0, read=3.0, total=5.0)
    
    # Payment Service: Slow, external gateway
    PAYMENT = ServiceTimeout(connect=2.0, read=10.0, total=15.0)
    
    # Restaurant Service: Fast, small dataset
    RESTAURANT = ServiceTimeout(connect=1.0, read=2.0, total=4.0)
    
    # Chatbot: Slow, aggregates multiple services
    CHATBOT = ServiceTimeout(connect=1.0, read=15.0, total=20.0)


# Using timeouts with httpx
import httpx

async def call_payment_service(url: str, data: dict) -> dict:
    """Call payment service with proper timeouts."""
    timeout = httpx.Timeout(
        connect=TimeoutConfig.PAYMENT.connect,
        read=TimeoutConfig.PAYMENT.read,
        write=5.0,
        pool=None
    )
    
    async with httpx.AsyncClient(timeout=timeout) as client:
        response = await client.post(url, json=data)
        return response.json()
```

### Timeout vs Deadline

Sometimes you need a **deadline** (absolute time limit) rather than per-call timeout:

```python
import asyncio
from datetime import datetime, timedelta

async def chatbot_query_with_deadline(query: str, deadline_seconds: float = 10.0):
    """
    Process chatbot query with absolute deadline.
    Useful when orchestrating multiple services.
    """
    start_time = datetime.utcnow()
    deadline = start_time + timedelta(seconds=deadline_seconds)
    
    async def check_deadline():
        """Helper to check if deadline exceeded."""
        if datetime.utcnow() > deadline:
            raise asyncio.TimeoutError(f"Deadline exceeded ({deadline_seconds}s)")
    
    try:
        # Step 1: Get orders (budget: 3s)
        await check_deadline()
        orders = await asyncio.wait_for(get_orders(), timeout=3.0)
        
        # Step 2: Get products (budget: 3s)
        await check_deadline()
        products = await asyncio.wait_for(get_products(), timeout=3.0)
        
        # Step 3: Get payments (budget: 3s)
        await check_deadline()
        payments = await asyncio.wait_for(get_payments(), timeout=3.0)
        
        # Step 4: Aggregate (budget: remaining time)
        await check_deadline()
        return aggregate(orders, products, payments)
        
    except asyncio.TimeoutError as e:
        elapsed = (datetime.utcnow() - start_time).total_seconds()
        logger.error(f"Query timeout after {elapsed:.2f}s: {e}")
        raise
```

## Graceful Degradation

When dependencies fail, provide **reduced functionality** instead of complete failure.

### Example: Chatbot with Degraded Service

```python
# services/chatbot_resilient.py
from infrastructure.circuit_breaker import CircuitBreakerOpen
import logging

logger = logging.getLogger(__name__)


class ResilientChatbot:
    """Chatbot with graceful degradation."""
    
    async def get_daily_revenue(self, tenant_id: str, date: date) -> dict:
        """
        Get daily revenue with graceful degradation.
        
        Full response: Orders + Payments + Products
        Degraded: Orders only (estimated revenue)
        Minimal: Cached data from yesterday
        """
        result = {
            "date": date.isoformat(),
            "mode": "full"
        }
        
        try:
            # Try full data aggregation
            orders = await self.pos_client.get_orders(tenant_id, date)
            payments = await self.payment_client.get_payments(tenant_id, date)
            products = await self.inventory_client.get_products(tenant_id)
            
            result.update({
                "total_revenue": payments["total"],
                "order_count": len(orders),
                "payment_methods": payments["by_method"],
                "top_products": self._calculate_top_products(orders, products)
            })
            
        except CircuitBreakerOpen:
            logger.warning("Payment service unavailable, using degraded mode")
            result["mode"] = "degraded"
            
            try:
                # Degraded: Orders only, estimate revenue
                orders = await self.pos_client.get_orders(tenant_id, date)
                estimated_revenue = sum(order["total"] for order in orders)
                
                result.update({
                    "total_revenue": estimated_revenue,
                    "order_count": len(orders),
                    "warning": "Estimated revenue (payment data unavailable)"
                })
                
            except Exception:
                logger.error("POS service also unavailable, using cache")
                result["mode"] = "minimal"
                
                # Minimal: Return cached data
                cached = await self.cache.get(f"revenue:{tenant_id}")
                if cached:
                    result.update(cached)
                    result["warning"] = "Using cached data (services unavailable)"
                else:
                    result["error"] = "No data available"
        
        return result
```

Users get:

1. **Full data** when all services work
2. **Estimated data** when some services fail
3. **Cached data** when all services fail
4. **Clear error message** when even cache is empty

## Bulkhead Pattern

Isolate resources to prevent one failure from exhausting all system resources.

```python
# infrastructure/bulkhead.py
import asyncio
from typing import Callable, Any

class Bulkhead:
    """
    Bulkhead pattern: Limit concurrent requests to prevent resource exhaustion.
    """
    
    def __init__(self, name: str, max_concurrent: int):
        self.name = name
        self.max_concurrent = max_concurrent
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.active_count = 0
        self.rejected_count = 0
    
    async def execute(self, func: Callable, *args, **kwargs) -> Any:
        """Execute function with concurrency limit."""
        if self.semaphore.locked():
            self.rejected_count += 1
            raise BulkheadFull(f"Bulkhead [{self.name}] is full")
        
        async with self.semaphore:
            self.active_count += 1
            try:
                return await func(*args, **kwargs)
            finally:
                self.active_count -= 1


class BulkheadFull(Exception):
    """Raised when bulkhead is at capacity."""
    pass


# Usage: Limit concurrent calls to Payment Service
payment_bulkhead = Bulkhead("Payment", max_concurrent=10)

async def process_payment_with_bulkhead(order_id: str, amount: float):
    """Process payment with concurrency limit."""
    try:
        return await payment_bulkhead.execute(
            process_payment,
            order_id,
            amount
        )
    except BulkheadFull:
        return {"error": "Payment service busy, please retry"}
```

## Production Lessons Learned

### Lesson 1: Monitor Circuit Breaker State

After implementing circuit breakers, I had no visibility into when they opened/closed.

**Solution**: Add metrics and alerts:

```python
async def circuit_breaker_metrics_middleware():
    """Emit metrics for circuit breaker states."""
    for name, cb in circuit_breakers.items():
        await metrics.gauge(
            f"circuit_breaker.{name}.state",
            1 if cb.state == CircuitState.OPEN else 0
        )
        await metrics.gauge(
            f"circuit_breaker.{name}.failure_count",
            cb.failure_count
        )
```

### Lesson 2: Don't Retry Idempotency Errors

I retried a payment twice, charging the customer twice. Oops.

**Solution**: Distinguish retryable from non-retryable errors:

```python
class NonRetryableError(Exception):
    """Errors that should never be retried."""
    pass

RETRYABLE_STATUS_CODES = {408, 429, 500, 502, 503, 504}

async def call_with_smart_retry(func: Callable):
    """Only retry transient failures."""
    try:
        return await func()
    except httpx.HTTPStatusError as e:
        if e.response.status_code in RETRYABLE_STATUS_CODES:
            # Transient error, safe to retry
            raise
        else:
            # Client error (400, 404) or non-retryable (409), don't retry
            raise NonRetryableError(f"Non-retryable error: {e.response.status_code}")
```

### Lesson 3: Fallback Data Must Be Obvious

Users complained that stale cached data wasn't clearly marked.

**Solution**: Always indicate degraded mode:

```python
return {
    "data": result,
    "data_quality": "degraded",
    "warning": "Using cached data from 2 hours ago (Payment service unavailable)",
    "cached_at": "2024-01-15T14:23:00Z"
}
```

## Best Practices

1. **Use circuit breakers** for all external dependencies
2. **Implement exponential backoff** with jitter for retries
3. **Set aggressive timeouts** (don't wait forever for slow services)
4. **Distinguish retryable from non-retryable errors**
5. **Provide graceful degradation** (reduced functionality > complete failure)
6. **Use bulkheads** to isolate resource pools
7. **Monitor circuit breaker states** and alert when opened
8. **Mark degraded responses** clearly in the UI
9. **Test failure scenarios** regularly (chaos engineering)
10. **Fail fast** (timeout quickly rather than tying up resources)

## Next Steps

Resilience patterns keep your system running during failures. But how do you **know** when failures occur? How do you **debug** issues across 6 distributed services?

In the final article, **Observability & Monitoring Architecture**, we'll explore distributed tracing, structured logging, metrics collection, and health checks—the patterns that give you visibility into your distributed system's behavior.

***

*This is part of the Software Architecture 101 series, where I share lessons learned building a production multi-tenant POS system with 6 microservices.*