Part 5: Production Patterns

Part of the Multi Agent Orchestration 101 Series

The Gap Between "It Works" and "It Works in Production"

The first version of my multi-agent system worked great in development. It fell apart in production within a day. The failures weren't dramatic β€” no crashes, no errors β€” just silent misbehaviour: agents looping indefinitely, costs spiking without warning, no way to trace why an agent made a particular decision.

This part covers the things I had to add to make the system trustworthy. None of it is glamorous, but all of it is necessary.


1. Structured Logging with Trace IDs

The biggest debugging problem with multi-agent systems is attribution: when something goes wrong, which agent caused it? A trace ID threads through every log line in a single request, making it possible to reconstruct the full chain of agent decisions.

# tracing.py
from __future__ import annotations
import logging
import uuid
from contextvars import ContextVar

trace_id: ContextVar[str] = ContextVar("trace_id", default="no-trace")


def new_trace() -> str:
    tid = str(uuid.uuid4())[:8]
    trace_id.set(tid)
    return tid


class TraceFilter(logging.Filter):
    def filter(self, record: logging.LogRecord) -> bool:
        record.trace_id = trace_id.get()
        return True


def configure_logging() -> None:
    handler = logging.StreamHandler()
    handler.addFilter(TraceFilter())
    formatter = logging.Formatter(
        "%(asctime)s [%(trace_id)s] %(levelname)s %(name)s β€” %(message)s"
    )
    handler.setFormatter(formatter)
    logging.basicConfig(level=logging.INFO, handlers=[handler])

Add logging calls to the AgentV2.run() loop:

At the entry point, call new_trace() once per request:

Now every log line from every agent in a request shares the same 8-character trace ID. When a request goes wrong, grep 'a3f2b1c9' app.log shows the complete picture.


2. Token Budget and Cost Tracking

I had a billing surprise in the first week of running my agent system. The supervisor was occasionally getting into a reasoning loop, making 15–20 LLM calls per request instead of the expected 3–5. At GPT-4o pricing, that adds up fast.

Two mitigations: a hard iteration cap (already in Part 1's loop), and per-request token tracking.

Pass a shared CostTracker to all agents and record after every API call:

Log the summary at the end of every request:

Add a hard limit if you want cost-based circuit breaking:


3. Rate Limiting

Both OpenAI and Anthropic have per-minute and per-day token limits. When you run parallel agents (Part 3), you can hit rate limits quickly.

A simple token bucket limiter:

Use it in _think() before every API call:

For production systems I recommend the limits library or slowapi for FastAPI, but this bucket is enough for a standalone agent process.


4. Graceful Degradation

When a worker agent fails, the supervisor shouldn't crash β€” it should handle the failure and either retry, use a fallback, or return a partial result.

In the supervisor's make_worker_tool, wrap the inner call:

Now a crashing worker returns a fallback message to the supervisor, which can decide how to proceed (retry, skip, or escalate).


5. Health Checks and Readiness

If you expose your agent system as a service (FastAPI, for example), add a health check that verifies the LLM APIs are reachable:


6. The Mistakes I Made (So You Don't Have To)

Mistake 1: Trusting the agent to stop itself. Without a hard iteration cap, an agent can loop indefinitely. I now always enforce for _ in range(max_iter) and return a clear message when the cap is hit.

Mistake 2: Not trimming memory. After enough turns, the context window fills up and old conversation turns get truncated by the LLM β€” silently. The agent starts forgetting earlier parts of the task. I trim memory proactively (see Part 1's _trim_memory pattern).

Mistake 3: Passing all tools to all agents. Early on I gave every agent every tool to "keep things flexible". The model would occasionally call the wrong tool for the wrong reason. Tight tool lists = more predictable behaviour. Workers should only have the tools they need.

Mistake 4: No timeout on worker calls. A shell command that hangs blocks the whole agent loop. Always wrap worker calls with asyncio.wait_for.

Mistake 5: Logging everything in development, nothing in production. I had verbose logging locally but stripped it all out for production to "keep things clean". Then I had a billing anomaly and no way to investigate. Log at INFO in production, DEBUG in development.


Putting It All Together

Here is the final entry point that applies all the patterns from this series:


Key Takeaways

  • Trace IDs make multi-agent debugging tractable

  • Track tokens and cost per request β€” surprises are expensive

  • Rate limit proactively before hitting API caps

  • asyncio.wait_for and fallback strings prevent a single worker from blocking everything

  • Tight tool lists per agent = more predictable behaviour

  • Log at INFO in production, always


Series Complete

You now have a multi-agent system built from first principles:

Part
What You Built

Agent loop + message bus in pure Python

Tool dispatcher + short/long-term memory

OpenAI function calling + supervisor pattern

Claude tool use + extended thinking

Tracing, cost tracking, rate limiting, resilience

The same patterns extend to more sophisticated frameworks (LangGraph, AutoGen, CrewAI). The difference is that now you understand what those frameworks are doing under the hood β€” which means you can debug them when they break.

Last updated