Resilience & Fault Tolerance

Table of Contents

Introduction

At 2:47 PM on a busy Friday, my Payment Service crashed. A bug in the payment gateway integration caused the service to hang indefinitely on every request. Within 90 seconds:

  • POS Core Service: 200+ pending requests, 100% CPU

  • Chatbot Service: Completely unresponsive

  • Restaurant Service: Timing out on all orders

  • Result: Entire POS system down, customers couldn't pay, revenue lost

The Payment Service was down for 8 minutes. But my entire system was offline for 23 minutes because I hadn't implemented proper resilience patterns.

That incident taught me this: In distributed systems, failures are not exceptional—they are normal. Services will crash, networks will be slow, dependencies will timeout. Your architecture must assume failure and handle it gracefully.

In this article, I'll share the resilience patterns I implemented after that painful outage: circuit breakers, retry logic, timeouts, and graceful degradation. These patterns transformed my POS system from fragile to antifragile.

The Payment Service Outage

Here's what happened during the outage:

Root cause: One service failure cascaded through the entire system because I had:

  • No circuit breakers (services kept calling failed dependencies)

  • No proper timeouts (threads waited forever)

  • No graceful degradation (all-or-nothing responses)

After this, I implemented the patterns below.

Circuit Breaker Pattern

A circuit breaker prevents cascading failures by stopping calls to failing services. It has three states:

Implementation

Using Circuit Breakers

With circuit breakers, when the Payment Service fails:

  1. After 5 failures: Circuit opens, stops sending requests

  2. For 60 seconds: Immediately return error (don't wait for timeout)

  3. After 60 seconds: Try one request (half-open)

  4. If 2 successes: Close circuit, resume normal operation

This prevents the cascade that took down my system.

Retry Strategies with Exponential Backoff

Transient failures (network blips, temporary overload) should be retried. But naive retries can make things worse.

Bad Retry (Fixed Interval)

Problem: If service is overloaded, this hammers it with requests every second, making the problem worse.

Good Retry (Exponential Backoff with Jitter)

Using Retries

Retry timing:

  • Attempt 1: Immediate

  • Attempt 2: ~0.5s later (base delay)

  • Attempt 3: ~1.0s later (0.5 * 2^1)

  • Attempt 4: ~2.0s later (0.5 * 2^2)

Jitter prevents thundering herd (all clients retrying at same time).

Timeout Configuration

Every network call needs a timeout. Always. No exceptions.

Service-Specific Timeouts

Timeout vs Deadline

Sometimes you need a deadline (absolute time limit) rather than per-call timeout:

Graceful Degradation

When dependencies fail, provide reduced functionality instead of complete failure.

Example: Chatbot with Degraded Service

Users get:

  1. Full data when all services work

  2. Estimated data when some services fail

  3. Cached data when all services fail

  4. Clear error message when even cache is empty

Bulkhead Pattern

Isolate resources to prevent one failure from exhausting all system resources.

Production Lessons Learned

Lesson 1: Monitor Circuit Breaker State

After implementing circuit breakers, I had no visibility into when they opened/closed.

Solution: Add metrics and alerts:

Lesson 2: Don't Retry Idempotency Errors

I retried a payment twice, charging the customer twice. Oops.

Solution: Distinguish retryable from non-retryable errors:

Lesson 3: Fallback Data Must Be Obvious

Users complained that stale cached data wasn't clearly marked.

Solution: Always indicate degraded mode:

Best Practices

  1. Use circuit breakers for all external dependencies

  2. Implement exponential backoff with jitter for retries

  3. Set aggressive timeouts (don't wait forever for slow services)

  4. Distinguish retryable from non-retryable errors

  5. Provide graceful degradation (reduced functionality > complete failure)

  6. Use bulkheads to isolate resource pools

  7. Monitor circuit breaker states and alert when opened

  8. Mark degraded responses clearly in the UI

  9. Test failure scenarios regularly (chaos engineering)

  10. Fail fast (timeout quickly rather than tying up resources)

Next Steps

Resilience patterns keep your system running during failures. But how do you know when failures occur? How do you debug issues across 6 distributed services?

In the final article, Observability & Monitoring Architecture, we'll explore distributed tracing, structured logging, metrics collection, and health checks—the patterns that give you visibility into your distributed system's behavior.


This is part of the Software Architecture 101 series, where I share lessons learned building a production multi-tenant POS system with 6 microservices.

Last updated