Integration Patterns & Orchestration
Table of Contents
Introduction
The Chatbot Service (port 4006) in my POS system has one job: answer user questions by aggregating data from 5 other services. Sounds simple, right? It wasn't.
A question like "What's my revenue today?" requires:
Auth Service: Verify user and get permissions
POS Core: Get all orders for the date
Payment: Get payment totals and methods
Inventory: Get product details and costs
Restaurant: Get table assignments and covers
When I first implemented this, I made sequential HTTP calls, waiting for each response before making the next request. A single chatbot query took 2.3 seconds. Users thought the system was broken.
The solution was orchestration: one service (the orchestrator) coordinates multiple services, making parallel calls when possible, and aggregating results. This pattern reduced response times from 2.3s to 380ms—a 6x improvement.
In this article, I'll share how I built the Chatbot Orchestrator, including parallel execution with asyncio, correlation IDs for distributed tracing, and the hard lessons I learned about orchestration vs choreography.
The Chatbot Integration Challenge
The Chatbot Service needs to answer business questions that span multiple domains:
Revenue Questions:
"What's my revenue today?" → POS Core + Payment
"What's my best-selling product?" → POS Core + Inventory + Payment
"What payment methods are customers using?" → Payment + POS Core
Inventory Questions:
"What products are low in stock?" → Inventory
"What's my inventory value?" → Inventory + Product costs
Operational Questions:
"How many tables are occupied?" → Restaurant
"What orders are pending?" → POS Core + Restaurant
Each question requires data from 2-5 services. The challenge: how do we coordinate these calls efficiently, reliably, and with proper error handling?
Orchestration vs Choreography
There are two patterns for integrating distributed services:
Orchestration (Centralized)
One service (orchestrator) explicitly controls the workflow:
Pros:
Clear workflow visibility (all logic in one place)
Easy to debug (centralized logs)
Simple error handling and compensation
Cons:
Single point of failure
Can become a bottleneck
Tight coupling to downstream services
Choreography (Event-Driven)
Services react to events without central coordination:
Pros:
Decoupled services
No single point of failure
Scales independently
Cons:
Workflow is implicit (harder to understand)
Difficult to debug (distributed logs)
Complex error handling (compensating transactions)
My decision: Use orchestration for the Chatbot because:
Chatbot queries need synchronous responses (users wait for answers)
Workflow is complex (5 services, conditional logic)
I need centralized error handling and retry logic
I use choreography for background workflows like order processing (covered in the Event-Driven Architecture article).
Implementing the Orchestrator Pattern
Here's the Chatbot Orchestrator that coordinates calls to 5 services:
This orchestrator:
Coordinates 5 services
Executes parallel calls when dependencies allow
Handles errors gracefully
Tracks requests with correlation IDs
Aggregates data from multiple sources
Parallel Service Calls with asyncio
The key to fast orchestration is parallel execution. Python's asyncio.gather() makes this easy:
Handling Dependencies
Some calls depend on others. Use sequential then parallel:
Error Handling with gather()
Use return_exceptions=True to handle errors without stopping other tasks:
Correlation IDs for Request Tracing
When a chatbot query touches 5 services, debugging failures is hard. Correlation IDs solve this:
Using Correlation IDs
Log Output with Correlation IDs
Now I can grep logs for [abc-123] to see the entire request lifecycle across all services.
Error Handling in Orchestration
Orchestration error handling is critical. I learned this the hard way when the Payment Service went down and took the entire Chatbot offline.
Pattern 1: Partial Success
Return partial results if some services fail:
Pattern 2: Fallback Values
Use cached or default values when services fail:
Pattern 3: Timeout Configuration
Different services have different SLAs. Configure timeouts accordingly:
Production Lessons Learned
Lesson 1: Cascade Failures
When the Inventory Service slowed down (500ms → 3s), the Chatbot started timing out, backing up requests, and eventually crashing.
Solution: Implement circuit breaker (covered in next article) and aggressive timeouts:
Lesson 2: Thundering Herd
During a popular lunch rush, 100 simultaneous chatbot queries hit all services at once, overwhelming them.
Solution: Rate limiting and request coalescing:
Lesson 3: Debugging Distributed Failures
A user reported "Chatbot not working." I had no idea which service failed or why.
Solution: Structured logging with correlation IDs (shown earlier) and response metadata:
Best Practices
Use correlation IDs for distributed tracing across services
Execute in parallel when dependencies allow (use
asyncio.gather())Configure timeouts per service based on SLAs
Handle partial failures gracefully (don't fail entire response if one service is down)
Implement circuit breakers to prevent cascade failures
Cache aggressively (especially for expensive aggregations)
Monitor orchestrator performance (response time, failure rate, timeout rate)
Log service call duration to identify slow dependencies
Use request coalescing to prevent thundering herd
Prefer orchestration for synchronous workflows (user-facing queries)
Prefer choreography for asynchronous workflows (background processing)
Next Steps
The Chatbot Orchestrator improved performance 6x, but it introduced new failure modes. When the Payment Service fails, should the Chatbot fail too? How do we handle transient errors vs permanent failures?
In the next article, Resilience & Fault Tolerance, we'll explore circuit breakers, retry strategies, and graceful degradation—the patterns that keep distributed systems running when things go wrong (and they will).
This is part of the Software Architecture 101 series, where I share lessons learned building a production multi-tenant POS system with 6 microservices.
Last updated