Resilience & Fault Tolerance

Introduction

At 2:47 PM on a busy Friday, my Payment Service crashed. A bug in the payment gateway integration caused the service to hang indefinitely on every request. Within 90 seconds:

POS Core Service: 200+ pending requests, 100% CPU
Chatbot Service: Completely unresponsive
Restaurant Service: Timing out on all orders
Result: Entire POS system down, customers couldn't pay, revenue lost

The Payment Service was down for 8 minutes. But my entire system was offline for 23 minutes because I hadn't implemented proper resilience patterns.

That incident taught me this: In distributed systems, failures are not exceptional—they are normal. Services will crash, networks will be slow, dependencies will timeout. Your architecture must assume failure and handle it gracefully.

In this article, I'll share the resilience patterns I implemented after that painful outage: circuit breakers, retry logic, timeouts, and graceful degradation. These patterns transformed my POS system from fragile to antifragile.

The Payment Service Outage

Here's what happened during the outage:

14:47:23 - Payment Service hangs (bug in payment gateway client)
14:47:45 - POS Core starts accumulating timeout errors (22s later)
14:48:12 - POS Core thread pool exhausted, stops processing requests
14:48:30 - Chatbot Service cascades to failure (calling POS Core)
14:49:15 - Restaurant Service fails (depends on POS Core for orders)
14:55:00 - Payment Service restarted (bug fixed)
14:55:30 - POS Core still unresponsive (thread pool deadlocked)
15:10:00 - Full system restart required

Root cause: One service failure cascaded through the entire system because I had:

No circuit breakers (services kept calling failed dependencies)
No proper timeouts (threads waited forever)
No graceful degradation (all-or-nothing responses)

After this, I implemented the patterns below.

Circuit Breaker Pattern

A circuit breaker prevents cascading failures by stopping calls to failing services. It has three states:

CLOSED (normal) → OPEN (failing) → HALF-OPEN (testing) → CLOSED (recovered)

Implementation

# infrastructure/circuit_breaker.py
from datetime import datetime, timedelta
from enum import Enum
from typing import Callable, Any
import asyncio
import logging

logger = logging.getLogger(__name__)


class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing recovery


class CircuitBreaker:
    """
    Circuit breaker implementation.
    
    Prevents cascading failures by stopping calls to failing services.
    """
    
    def __init__(
        self,
        name: str,
        failure_threshold: int = 5,
        success_threshold: int = 2,
        timeout: float = 60.0
    ):
        self.name = name
        self.failure_threshold = failure_threshold  # Failures before opening
        self.success_threshold = success_threshold  # Successes to close from half-open
        self.timeout = timeout  # Seconds to wait before trying again
        
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time: datetime = None
    
    def is_open(self) -> bool:
        """Check if circuit is open (rejecting requests)."""
        if self.state == CircuitState.OPEN:
            # Check if timeout has elapsed
            if datetime.utcnow() - self.last_failure_time > timedelta(seconds=self.timeout):
                logger.info(f"Circuit breaker [{self.name}] entering HALF-OPEN state")
                self.state = CircuitState.HALF_OPEN
                self.success_count = 0
                return False
            return True
        return False
    
    async def call(self, func: Callable, *args, **kwargs) -> Any:
        """
        Execute function through circuit breaker.
        Raises CircuitBreakerOpen if circuit is open.
        """
        if self.is_open():
            raise CircuitBreakerOpen(f"Circuit breaker [{self.name}] is OPEN")
        
        try:
            # Execute the function
            result = await func(*args, **kwargs) if asyncio.iscoroutinefunction(func) else func(*args, **kwargs)
            
            # Success
            self.record_success()
            return result
            
        except Exception as e:
            # Failure
            self.record_failure()
            raise
    
    def record_success(self):
        """Record a successful call."""
        self.failure_count = 0
        
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            logger.info(
                f"Circuit breaker [{self.name}] success in HALF-OPEN "
                f"({self.success_count}/{self.success_threshold})"
            )
            
            if self.success_count >= self.success_threshold:
                logger.info(f"Circuit breaker [{self.name}] closing (recovered)")
                self.state = CircuitState.CLOSED
                self.success_count = 0
    
    def record_failure(self):
        """Record a failed call."""
        self.failure_count += 1
        self.last_failure_time = datetime.utcnow()
        
        logger.warning(
            f"Circuit breaker [{self.name}] failure "
            f"({self.failure_count}/{self.failure_threshold})"
        )
        
        if self.state == CircuitState.HALF_OPEN:
            logger.warning(f"Circuit breaker [{self.name}] opening (failed in HALF-OPEN)")
            self.state = CircuitState.OPEN
            self.failure_count = 0
            
        elif self.failure_count >= self.failure_threshold:
            logger.error(f"Circuit breaker [{self.name}] OPENED")
            self.state = CircuitState.OPEN


class CircuitBreakerOpen(Exception):
    """Exception raised when circuit breaker is open."""
    pass


# Global circuit breakers for each service
circuit_breakers = {
    "payment": CircuitBreaker("Payment Service", failure_threshold=5, timeout=60.0),
    "inventory": CircuitBreaker("Inventory Service", failure_threshold=5, timeout=30.0),
    "pos_core": CircuitBreaker("POS Core", failure_threshold=10, timeout=60.0),
    "restaurant": CircuitBreaker("Restaurant Service", failure_threshold=5, timeout=30.0),
}


def get_circuit_breaker(service_name: str) -> CircuitBreaker:
    """Get circuit breaker for a service."""
    return circuit_breakers.get(service_name)

Using Circuit Breakers

# services/payment_client.py
from infrastructure.circuit_breaker import get_circuit_breaker, CircuitBreakerOpen
import httpx

class PaymentClient:
    """Client for Payment Service with circuit breaker."""
    
    def __init__(self):
        self.base_url = "http://localhost:4004"
        self.circuit_breaker = get_circuit_breaker("payment")
    
    async def process_payment(
        self,
        tenant_id: str,
        order_id: str,
        amount: float
    ) -> dict:
        """
        Process payment with circuit breaker protection.
        Raises CircuitBreakerOpen if service is failing.
        """
        try:
            result = await self.circuit_breaker.call(
                self._do_process_payment,
                tenant_id,
                order_id,
                amount
            )
            return result
        except CircuitBreakerOpen as e:
            logger.error(f"Payment service unavailable: {e}")
            # Return error instead of crashing
            return {
                "success": False,
                "error": "Payment service temporarily unavailable",
                "retry_later": True
            }
    
    async def _do_process_payment(
        self,
        tenant_id: str,
        order_id: str,
        amount: float
    ) -> dict:
        """Actual payment processing logic."""
        async with httpx.AsyncClient() as client:
            response = await client.post(
                f"{self.base_url}/payments",
                json={
                    "order_id": order_id,
                    "amount": amount
                },
                headers={"x-tenant-id": tenant_id},
                timeout=5.0
            )
            
            if response.status_code != 200:
                raise Exception(f"Payment failed: {response.status_code}")
            
            return response.json()

With circuit breakers, when the Payment Service fails:

After 5 failures: Circuit opens, stops sending requests
For 60 seconds: Immediately return error (don't wait for timeout)
After 60 seconds: Try one request (half-open)
If 2 successes: Close circuit, resume normal operation

This prevents the cascade that took down my system.

Retry Strategies with Exponential Backoff

Transient failures (network blips, temporary overload) should be retried. But naive retries can make things worse.

Bad Retry (Fixed Interval)

# DON'T DO THIS
async def call_service_bad_retry():
    for i in range(3):
        try:
            return await service_call()
        except Exception:
            await asyncio.sleep(1)  # Always wait 1 second
    raise Exception("Failed after 3 retries")

Problem: If service is overloaded, this hammers it with requests every second, making the problem worse.

Good Retry (Exponential Backoff with Jitter)

# infrastructure/retry.py
import asyncio
import random
from typing import Callable, Any
import logging

logger = logging.getLogger(__name__)


async def retry_with_backoff(
    func: Callable,
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 30.0,
    exponential_base: float = 2.0,
    jitter: bool = True,
    retryable_exceptions: tuple = (Exception,)
) -> Any:
    """
    Retry function with exponential backoff and jitter.
    
    Args:
        func: Async function to retry
        max_retries: Maximum number of retry attempts
        base_delay: Initial delay in seconds
        max_delay: Maximum delay in seconds
        exponential_base: Multiplier for exponential backoff
        jitter: Add random jitter to prevent thundering herd
        retryable_exceptions: Tuple of exceptions that should trigger retry
    
    Returns:
        Result of successful function call
    
    Raises:
        Last exception if all retries fail
    """
    last_exception = None
    
    for attempt in range(max_retries + 1):
        try:
            return await func()
        except retryable_exceptions as e:
            last_exception = e
            
            if attempt == max_retries:
                logger.error(f"Failed after {max_retries} retries: {e}")
                raise
            
            # Calculate delay: base_delay * (exponential_base ^ attempt)
            delay = min(base_delay * (exponential_base ** attempt), max_delay)
            
            # Add jitter (random variation)
            if jitter:
                delay = delay * (0.5 + random.random())  # 50-150% of calculated delay
            
            logger.warning(
                f"Attempt {attempt + 1}/{max_retries} failed: {e}. "
                f"Retrying in {delay:.2f}s"
            )
            
            await asyncio.sleep(delay)
    
    raise last_exception


# Decorator version
def with_retry(
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 30.0
):
    """Decorator for retry with exponential backoff."""
    def decorator(func):
        async def wrapper(*args, **kwargs):
            return await retry_with_backoff(
                lambda: func(*args, **kwargs),
                max_retries=max_retries,
                base_delay=base_delay,
                max_delay=max_delay
            )
        return wrapper
    return decorator

Using Retries

# services/inventory_client.py
from infrastructure.retry import with_retry
import httpx

class InventoryClient:
    """Client for Inventory Service with retry logic."""
    
    @with_retry(max_retries=3, base_delay=0.5, max_delay=5.0)
    async def get_product(self, tenant_id: str, product_id: str) -> dict:
        """Get product with automatic retry on failure."""
        async with httpx.AsyncClient() as client:
            response = await client.get(
                f"http://localhost:4003/products/{product_id}",
                headers={"x-tenant-id": tenant_id},
                timeout=3.0
            )
            
            if response.status_code == 503:  # Service unavailable
                raise httpx.HTTPStatusError("Service unavailable", request=None, response=response)
            
            response.raise_for_status()
            return response.json()

Retry timing:

Attempt 1: Immediate
Attempt 2: ~0.5s later (base delay)
Attempt 3: ~1.0s later (0.5 * 2^1)
Attempt 4: ~2.0s later (0.5 * 2^2)

Jitter prevents thundering herd (all clients retrying at same time).

Timeout Configuration

Every network call needs a timeout. Always. No exceptions.

Service-Specific Timeouts

# config/timeouts.py
from dataclasses import dataclass

@dataclass
class ServiceTimeout:
    """Timeout configuration for a service."""
    connect: float  # Time to establish connection
    read: float     # Time to read response
    total: float    # Total request time


class TimeoutConfig:
    """Centralized timeout configuration."""
    
    # Auth Service: Fast, in-memory operations
    AUTH = ServiceTimeout(connect=1.0, read=2.0, total=3.0)
    
    # POS Core: Medium, database queries
    POS_CORE = ServiceTimeout(connect=1.0, read=5.0, total=8.0)
    
    # Inventory Service: Medium, MongoDB queries
    INVENTORY = ServiceTimeout(connect=1.0, read=3.0, total=5.0)
    
    # Payment Service: Slow, external gateway
    PAYMENT = ServiceTimeout(connect=2.0, read=10.0, total=15.0)
    
    # Restaurant Service: Fast, small dataset
    RESTAURANT = ServiceTimeout(connect=1.0, read=2.0, total=4.0)
    
    # Chatbot: Slow, aggregates multiple services
    CHATBOT = ServiceTimeout(connect=1.0, read=15.0, total=20.0)


# Using timeouts with httpx
import httpx

async def call_payment_service(url: str, data: dict) -> dict:
    """Call payment service with proper timeouts."""
    timeout = httpx.Timeout(
        connect=TimeoutConfig.PAYMENT.connect,
        read=TimeoutConfig.PAYMENT.read,
        write=5.0,
        pool=None
    )
    
    async with httpx.AsyncClient(timeout=timeout) as client:
        response = await client.post(url, json=data)
        return response.json()

Timeout vs Deadline

Sometimes you need a deadline (absolute time limit) rather than per-call timeout:

import asyncio
from datetime import datetime, timedelta

async def chatbot_query_with_deadline(query: str, deadline_seconds: float = 10.0):
    """
    Process chatbot query with absolute deadline.
    Useful when orchestrating multiple services.
    """
    start_time = datetime.utcnow()
    deadline = start_time + timedelta(seconds=deadline_seconds)
    
    async def check_deadline():
        """Helper to check if deadline exceeded."""
        if datetime.utcnow() > deadline:
            raise asyncio.TimeoutError(f"Deadline exceeded ({deadline_seconds}s)")
    
    try:
        # Step 1: Get orders (budget: 3s)
        await check_deadline()
        orders = await asyncio.wait_for(get_orders(), timeout=3.0)
        
        # Step 2: Get products (budget: 3s)
        await check_deadline()
        products = await asyncio.wait_for(get_products(), timeout=3.0)
        
        # Step 3: Get payments (budget: 3s)
        await check_deadline()
        payments = await asyncio.wait_for(get_payments(), timeout=3.0)
        
        # Step 4: Aggregate (budget: remaining time)
        await check_deadline()
        return aggregate(orders, products, payments)
        
    except asyncio.TimeoutError as e:
        elapsed = (datetime.utcnow() - start_time).total_seconds()
        logger.error(f"Query timeout after {elapsed:.2f}s: {e}")
        raise

Graceful Degradation

When dependencies fail, provide reduced functionality instead of complete failure.

Example: Chatbot with Degraded Service

# services/chatbot_resilient.py
from infrastructure.circuit_breaker import CircuitBreakerOpen
import logging

logger = logging.getLogger(__name__)


class ResilientChatbot:
    """Chatbot with graceful degradation."""
    
    async def get_daily_revenue(self, tenant_id: str, date: date) -> dict:
        """
        Get daily revenue with graceful degradation.
        
        Full response: Orders + Payments + Products
        Degraded: Orders only (estimated revenue)
        Minimal: Cached data from yesterday
        """
        result = {
            "date": date.isoformat(),
            "mode": "full"
        }
        
        try:
            # Try full data aggregation
            orders = await self.pos_client.get_orders(tenant_id, date)
            payments = await self.payment_client.get_payments(tenant_id, date)
            products = await self.inventory_client.get_products(tenant_id)
            
            result.update({
                "total_revenue": payments["total"],
                "order_count": len(orders),
                "payment_methods": payments["by_method"],
                "top_products": self._calculate_top_products(orders, products)
            })
            
        except CircuitBreakerOpen:
            logger.warning("Payment service unavailable, using degraded mode")
            result["mode"] = "degraded"
            
            try:
                # Degraded: Orders only, estimate revenue
                orders = await self.pos_client.get_orders(tenant_id, date)
                estimated_revenue = sum(order["total"] for order in orders)
                
                result.update({
                    "total_revenue": estimated_revenue,
                    "order_count": len(orders),
                    "warning": "Estimated revenue (payment data unavailable)"
                })
                
            except Exception:
                logger.error("POS service also unavailable, using cache")
                result["mode"] = "minimal"
                
                # Minimal: Return cached data
                cached = await self.cache.get(f"revenue:{tenant_id}")
                if cached:
                    result.update(cached)
                    result["warning"] = "Using cached data (services unavailable)"
                else:
                    result["error"] = "No data available"
        
        return result

Users get:

Full data when all services work
Estimated data when some services fail
Cached data when all services fail
Clear error message when even cache is empty

Bulkhead Pattern

Isolate resources to prevent one failure from exhausting all system resources.

# infrastructure/bulkhead.py
import asyncio
from typing import Callable, Any

class Bulkhead:
    """
    Bulkhead pattern: Limit concurrent requests to prevent resource exhaustion.
    """
    
    def __init__(self, name: str, max_concurrent: int):
        self.name = name
        self.max_concurrent = max_concurrent
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.active_count = 0
        self.rejected_count = 0
    
    async def execute(self, func: Callable, *args, **kwargs) -> Any:
        """Execute function with concurrency limit."""
        if self.semaphore.locked():
            self.rejected_count += 1
            raise BulkheadFull(f"Bulkhead [{self.name}] is full")
        
        async with self.semaphore:
            self.active_count += 1
            try:
                return await func(*args, **kwargs)
            finally:
                self.active_count -= 1


class BulkheadFull(Exception):
    """Raised when bulkhead is at capacity."""
    pass


# Usage: Limit concurrent calls to Payment Service
payment_bulkhead = Bulkhead("Payment", max_concurrent=10)

async def process_payment_with_bulkhead(order_id: str, amount: float):
    """Process payment with concurrency limit."""
    try:
        return await payment_bulkhead.execute(
            process_payment,
            order_id,
            amount
        )
    except BulkheadFull:
        return {"error": "Payment service busy, please retry"}

Production Lessons Learned

Lesson 1: Monitor Circuit Breaker State

After implementing circuit breakers, I had no visibility into when they opened/closed.

Solution: Add metrics and alerts:

async def circuit_breaker_metrics_middleware():
    """Emit metrics for circuit breaker states."""
    for name, cb in circuit_breakers.items():
        await metrics.gauge(
            f"circuit_breaker.{name}.state",
            1 if cb.state == CircuitState.OPEN else 0
        )
        await metrics.gauge(
            f"circuit_breaker.{name}.failure_count",
            cb.failure_count
        )

Lesson 2: Don't Retry Idempotency Errors

I retried a payment twice, charging the customer twice. Oops.

Solution: Distinguish retryable from non-retryable errors:

class NonRetryableError(Exception):
    """Errors that should never be retried."""
    pass

RETRYABLE_STATUS_CODES = {408, 429, 500, 502, 503, 504}

async def call_with_smart_retry(func: Callable):
    """Only retry transient failures."""
    try:
        return await func()
    except httpx.HTTPStatusError as e:
        if e.response.status_code in RETRYABLE_STATUS_CODES:
            # Transient error, safe to retry
            raise
        else:
            # Client error (400, 404) or non-retryable (409), don't retry
            raise NonRetryableError(f"Non-retryable error: {e.response.status_code}")

Lesson 3: Fallback Data Must Be Obvious

Users complained that stale cached data wasn't clearly marked.

Solution: Always indicate degraded mode:

return {
    "data": result,
    "data_quality": "degraded",
    "warning": "Using cached data from 2 hours ago (Payment service unavailable)",
    "cached_at": "2024-01-15T14:23:00Z"
}

Best Practices

Use circuit breakers for all external dependencies
Implement exponential backoff with jitter for retries
Set aggressive timeouts (don't wait forever for slow services)
Distinguish retryable from non-retryable errors
Provide graceful degradation (reduced functionality > complete failure)
Use bulkheads to isolate resource pools
Monitor circuit breaker states and alert when opened
Mark degraded responses clearly in the UI
Test failure scenarios regularly (chaos engineering)
Fail fast (timeout quickly rather than tying up resources)

Next Steps

Resilience patterns keep your system running during failures. But how do you know when failures occur? How do you debug issues across 6 distributed services?

In the final article, Observability & Monitoring Architecture, we'll explore distributed tracing, structured logging, metrics collection, and health checks—the patterns that give you visibility into your distributed system's behavior.

This is part of the Software Architecture 101 series, where I share lessons learned building a production multi-tenant POS system with 6 microservices.

PreviousIntegration Patterns & Orchestration NextObservability & Monitoring Architecture

Last updated 5 days ago

hashtagTable of Contents

hashtagIntroduction

hashtagThe Payment Service Outage

hashtagCircuit Breaker Pattern

hashtagImplementation

hashtagUsing Circuit Breakers

hashtagRetry Strategies with Exponential Backoff

hashtagBad Retry (Fixed Interval)

hashtagGood Retry (Exponential Backoff with Jitter)

hashtagUsing Retries

hashtagTimeout Configuration

hashtagService-Specific Timeouts

hashtagTimeout vs Deadline

hashtagGraceful Degradation

hashtagExample: Chatbot with Degraded Service

hashtagBulkhead Pattern

hashtagProduction Lessons Learned

hashtagLesson 1: Monitor Circuit Breaker State

hashtagLesson 2: Don't Retry Idempotency Errors

hashtagLesson 3: Fallback Data Must Be Obvious

hashtagBest Practices

hashtagNext Steps

Table of Contents