Part 5: Evaluating and Testing Your Agent
Part of the AI Agent Development 101 Series
The Hardest Question in Agent Development
After building my first production agent, a colleague asked: "How do you know it's working correctly?"
I didn't have a good answer. I had been running manual spot-checks β giving the agent a task, checking the answer, moving on. That's not a test suite. It's hope.
The problem is that agents are harder to test than normal software. There's no single "correct" output β the same goal can be achieved through different tool sequences. Nondeterminism means the same input can produce different outputs. And when you upgrade the underlying model, behaviour can change in subtle ways that a naive correctness check would miss.
This part covers the testing approaches I now use, built from painful experience.
What "Correct" Means for an Agent
Before writing any tests, you need to be precise about what you're measuring. I use three distinct quality dimensions:
1. Tool dispatch accuracy β Did the agent call the right tool with valid arguments? This is the most deterministic and easiest to test.
2. Trajectory quality β Did the agent take a sensible path to the answer? A correct answer via a bizarre path (10 tools when 2 would do) is a warning sign.
3. Answer quality β Is the final answer correct? For factual tasks this is binary. For open-ended tasks it's a spectrum.
Most developers only measure answer quality and wonder why their agents fail mysteriously in production. Tool dispatch accuracy and trajectory quality are where the actual bugs live.
Part 1: Deterministic Tests for Tool Dispatch
These are the tests I write first for any new agent. They verify that given a specific goal and observation, the agent calls the expected tool with valid arguments β without actually running the tool.
These tests mock the actual tool execution so they run fast and don't touch the filesystem. What they verify is the agent's decision-making, not the tools themselves.
Part 2: Trajectory Evaluation
A trajectory is the full sequence of (Thought, Action, Observation) steps the agent took. Trajectory evaluation checks whether that sequence was sensible β not just whether the final answer was correct.
I maintain a small set of "golden trajectories" for known tasks. Each golden trajectory specifies:
Which tools must be called (required tools)
Which tools must not be called (forbidden tools)
Maximum number of steps allowed
Whether the agent finished (vs hitting max_steps)
Example test using the evaluator:
Part 3: Loop Detection
One of the failure modes I mentioned in Part 1: the agent repeats the same action indefinitely. Here's a test that catches it:
Part 4: Regression Testing When Upgrading Models
This is the test I wish I had written earlier. When I upgraded from gpt-4o to gpt-4o-2024-11-20, three tasks that previously worked broke. I only found out because a user reported it.
The pattern: run a fixed set of representative tasks on both the old and new model, compare trajectories and answers.
I run this regression suite before deploying any model upgrade. It takes under 30 seconds because all tools are mocked. If any case changes from PASS to FAIL, I investigate before deploying.
Part 5: A Practical Testing Checklist
After building several agents, here is the checklist I go through before calling an agent production-ready:
The manual spot-checks remain important because automated tests can't catch everything β especially qualitative reasoning failures that produce a technically correct but poor answer.
Key Takeaways
Test tool dispatch first β it's the most deterministic layer and where most bugs live
Trajectory evaluation checks the path, not just the destination
Loop detection prevents a common silent failure mode
Always run a regression suite before upgrading the underlying model
The trace from Part 1 is your best debugging tool for all of the above
Series Complete
You've now built a complete, testable, single AI agent from first principles:
ReAct loop in pure Python, reasoning trace
Sliding context window, SQLite persistence, Chroma episodic memory
OpenAI function calling in the ReAct loop, structured output, streaming
Claude tool use, extended thinking, prompt caching
Tool dispatch tests, trajectory eval, loop detection, regression suite
When you're ready to have multiple agents work together, the Multi Agent Orchestration 101 series builds directly on these foundations.
Last updated