Integration Patterns & Orchestration

Table of Contents

Introduction

The Chatbot Service (port 4006) in my POS system has one job: answer user questions by aggregating data from 5 other services. Sounds simple, right? It wasn't.

A question like "What's my revenue today?" requires:

  • Auth Service: Verify user and get permissions

  • POS Core: Get all orders for the date

  • Payment: Get payment totals and methods

  • Inventory: Get product details and costs

  • Restaurant: Get table assignments and covers

When I first implemented this, I made sequential HTTP calls, waiting for each response before making the next request. A single chatbot query took 2.3 seconds. Users thought the system was broken.

The solution was orchestration: one service (the orchestrator) coordinates multiple services, making parallel calls when possible, and aggregating results. This pattern reduced response times from 2.3s to 380ms—a 6x improvement.

In this article, I'll share how I built the Chatbot Orchestrator, including parallel execution with asyncio, correlation IDs for distributed tracing, and the hard lessons I learned about orchestration vs choreography.

The Chatbot Integration Challenge

The Chatbot Service needs to answer business questions that span multiple domains:

Revenue Questions:

  • "What's my revenue today?" → POS Core + Payment

  • "What's my best-selling product?" → POS Core + Inventory + Payment

  • "What payment methods are customers using?" → Payment + POS Core

Inventory Questions:

  • "What products are low in stock?" → Inventory

  • "What's my inventory value?" → Inventory + Product costs

Operational Questions:

  • "How many tables are occupied?" → Restaurant

  • "What orders are pending?" → POS Core + Restaurant

Each question requires data from 2-5 services. The challenge: how do we coordinate these calls efficiently, reliably, and with proper error handling?

Orchestration vs Choreography

There are two patterns for integrating distributed services:

Orchestration (Centralized)

One service (orchestrator) explicitly controls the workflow:

Pros:

  • Clear workflow visibility (all logic in one place)

  • Easy to debug (centralized logs)

  • Simple error handling and compensation

Cons:

  • Single point of failure

  • Can become a bottleneck

  • Tight coupling to downstream services

Choreography (Event-Driven)

Services react to events without central coordination:

Pros:

  • Decoupled services

  • No single point of failure

  • Scales independently

Cons:

  • Workflow is implicit (harder to understand)

  • Difficult to debug (distributed logs)

  • Complex error handling (compensating transactions)

My decision: Use orchestration for the Chatbot because:

  1. Chatbot queries need synchronous responses (users wait for answers)

  2. Workflow is complex (5 services, conditional logic)

  3. I need centralized error handling and retry logic

I use choreography for background workflows like order processing (covered in the Event-Driven Architecture article).

Implementing the Orchestrator Pattern

Here's the Chatbot Orchestrator that coordinates calls to 5 services:

This orchestrator:

  • Coordinates 5 services

  • Executes parallel calls when dependencies allow

  • Handles errors gracefully

  • Tracks requests with correlation IDs

  • Aggregates data from multiple sources

Parallel Service Calls with asyncio

The key to fast orchestration is parallel execution. Python's asyncio.gather() makes this easy:

Handling Dependencies

Some calls depend on others. Use sequential then parallel:

Error Handling with gather()

Use return_exceptions=True to handle errors without stopping other tasks:

Correlation IDs for Request Tracing

When a chatbot query touches 5 services, debugging failures is hard. Correlation IDs solve this:

Using Correlation IDs

Log Output with Correlation IDs

Now I can grep logs for [abc-123] to see the entire request lifecycle across all services.

Error Handling in Orchestration

Orchestration error handling is critical. I learned this the hard way when the Payment Service went down and took the entire Chatbot offline.

Pattern 1: Partial Success

Return partial results if some services fail:

Pattern 2: Fallback Values

Use cached or default values when services fail:

Pattern 3: Timeout Configuration

Different services have different SLAs. Configure timeouts accordingly:

Production Lessons Learned

Lesson 1: Cascade Failures

When the Inventory Service slowed down (500ms → 3s), the Chatbot started timing out, backing up requests, and eventually crashing.

Solution: Implement circuit breaker (covered in next article) and aggressive timeouts:

Lesson 2: Thundering Herd

During a popular lunch rush, 100 simultaneous chatbot queries hit all services at once, overwhelming them.

Solution: Rate limiting and request coalescing:

Lesson 3: Debugging Distributed Failures

A user reported "Chatbot not working." I had no idea which service failed or why.

Solution: Structured logging with correlation IDs (shown earlier) and response metadata:

Best Practices

  1. Use correlation IDs for distributed tracing across services

  2. Execute in parallel when dependencies allow (use asyncio.gather())

  3. Configure timeouts per service based on SLAs

  4. Handle partial failures gracefully (don't fail entire response if one service is down)

  5. Implement circuit breakers to prevent cascade failures

  6. Cache aggressively (especially for expensive aggregations)

  7. Monitor orchestrator performance (response time, failure rate, timeout rate)

  8. Log service call duration to identify slow dependencies

  9. Use request coalescing to prevent thundering herd

  10. Prefer orchestration for synchronous workflows (user-facing queries)

  11. Prefer choreography for asynchronous workflows (background processing)

Next Steps

The Chatbot Orchestrator improved performance 6x, but it introduced new failure modes. When the Payment Service fails, should the Chatbot fail too? How do we handle transient errors vs permanent failures?

In the next article, Resilience & Fault Tolerance, we'll explore circuit breakers, retry strategies, and graceful degradation—the patterns that keep distributed systems running when things go wrong (and they will).


This is part of the Software Architecture 101 series, where I share lessons learned building a production multi-tenant POS system with 6 microservices.

Last updated