Part 5: Evaluating and Testing Your Agent

Part of the AI Agent Development 101 Series

The Hardest Question in Agent Development

After building my first production agent, a colleague asked: "How do you know it's working correctly?"

I didn't have a good answer. I had been running manual spot-checks — giving the agent a task, checking the answer, moving on. That's not a test suite. It's hope.

The problem is that agents are harder to test than normal software. There's no single "correct" output — the same goal can be achieved through different tool sequences. Nondeterminism means the same input can produce different outputs. And when you upgrade the underlying model, behaviour can change in subtle ways that a naive correctness check would miss.

This part covers the testing approaches I now use, built from painful experience.

What "Correct" Means for an Agent

Before writing any tests, you need to be precise about what you're measuring. I use three distinct quality dimensions:

1. Tool dispatch accuracy — Did the agent call the right tool with valid arguments? This is the most deterministic and easiest to test.

2. Trajectory quality — Did the agent take a sensible path to the answer? A correct answer via a bizarre path (10 tools when 2 would do) is a warning sign.

3. Answer quality — Is the final answer correct? For factual tasks this is binary. For open-ended tasks it's a spectrum.

Most developers only measure answer quality and wonder why their agents fail mysteriously in production. Tool dispatch accuracy and trajectory quality are where the actual bugs live.

Part 1: Deterministic Tests for Tool Dispatch

These are the tests I write first for any new agent. They verify that given a specific goal and observation, the agent calls the expected tool with valid arguments — without actually running the tool.

# tests/test_tool_dispatch.py
import asyncio
import pytest
from unittest.mock import AsyncMock, patch
from openai_react_agent import OpenAIReActAgent
from dispatcher import ToolDispatcher
from tools import shell_tool, kv_set_tool


@pytest.fixture
def agent(tmp_path):
    dispatcher = ToolDispatcher([shell_tool, kv_set_tool])
    a = OpenAIReActAgent(
        goal="Check disk usage.",
        dispatcher=dispatcher,
        session_id="test_session",
        model="gpt-4o",
    )
    return a


@pytest.mark.asyncio
async def test_calls_shell_for_disk_check(agent, monkeypatch):
    """Agent should call shell tool when asked about disk usage."""
    captured_calls: list[tuple[str, str]] = []

    original_call = agent.dispatcher.call

    async def mock_call(name: str, args: str) -> str:
        captured_calls.append((name, args))
        return "Filesystem      Size  Used Avail Use% Mounted on\n/dev/sda1  50G  20G  27G  43% /"

    monkeypatch.setattr(agent.dispatcher, "call", mock_call)

    await agent.start()
    await agent.run("Check disk usage on the root partition.")

    tool_names = [c[0] for c in captured_calls]
    assert "shell" in tool_names, f"Expected 'shell' to be called. Got: {tool_names}"

    shell_args = [c[1] for c in captured_calls if c[0] == "shell"]
    # The command should be disk-related
    assert any("df" in arg for arg in shell_args), (
        f"Shell was called but not with a df command. Args: {shell_args}"
    )


@pytest.mark.asyncio
async def test_does_not_call_unknown_tools(agent, monkeypatch):
    """Agent should never call a tool that isn't in its dispatcher."""
    called_with_unknown: list[str] = []
    known_tools = set(t["name"] for t in agent.dispatcher.schemas())

    original_call = agent.dispatcher.call

    async def mock_call(name: str, args: str) -> str:
        if name not in known_tools:
            called_with_unknown.append(name)
        return "mock result"

    monkeypatch.setattr(agent.dispatcher, "call", mock_call)

    await agent.start()
    await agent.run("List files and check disk usage.")

    assert not called_with_unknown, (
        f"Agent called unknown tools: {called_with_unknown}"
    )

These tests mock the actual tool execution so they run fast and don't touch the filesystem. What they verify is the agent's decision-making, not the tools themselves.

Part 2: Trajectory Evaluation

A trajectory is the full sequence of (Thought, Action, Observation) steps the agent took. Trajectory evaluation checks whether that sequence was sensible — not just whether the final answer was correct.

I maintain a small set of "golden trajectories" for known tasks. Each golden trajectory specifies:

Which tools must be called (required tools)
Which tools must not be called (forbidden tools)
Maximum number of steps allowed
Whether the agent finished (vs hitting max_steps)

# tests/trajectory_eval.py
from __future__ import annotations
from dataclasses import dataclass
from react_agent import ReActAgent, StepType


@dataclass
class TrajectorySpec:
    description: str
    required_tools: list[str]          # must appear in the trace
    forbidden_tools: list[str]         # must NOT appear
    max_steps: int
    must_finish: bool = True           # agent must reach FINISH


def evaluate_trajectory(agent: ReActAgent, spec: TrajectorySpec) -> list[str]:
    """
    Evaluate an agent's trace against a TrajectorySpec.
    Returns a list of failure messages. Empty list = pass.
    """
    failures: list[str] = []
    actions = [s for s in agent.trace if s.type == StepType.ACTION]
    finished = any(s.type == StepType.FINISH for s in agent.trace)

    # Check required tools were called
    called_tools = {a.content.split(":")[0].strip() for a in actions}
    called_tools.discard("FINISH")

    for tool in spec.required_tools:
        if tool not in called_tools:
            failures.append(f"Required tool '{tool}' was not called.")

    # Check forbidden tools were not called
    for tool in spec.forbidden_tools:
        if tool in called_tools:
            failures.append(f"Forbidden tool '{tool}' was called.")

    # Check step count
    if len(agent.trace) > spec.max_steps:
        failures.append(
            f"Exceeded max steps: {len(agent.trace)} > {spec.max_steps}."
        )

    # Check finish
    if spec.must_finish and not finished:
        failures.append("Agent did not reach FINISH.")

    return failures

Example test using the evaluator:

# tests/test_trajectory.py
import asyncio
import pytest
from openai_react_agent import OpenAIReActAgent
from dispatcher import ToolDispatcher
from tools import shell_tool, kv_set_tool
from trajectory_eval import TrajectorySpec, evaluate_trajectory


@pytest.mark.asyncio
async def test_disk_task_trajectory(monkeypatch):
    dispatcher = ToolDispatcher([shell_tool, kv_set_tool])
    agent = OpenAIReActAgent(
        goal="Check disk usage and store it.",
        dispatcher=dispatcher,
        session_id="traj_test",
    )

    # Mock tools to avoid real execution
    async def mock_call(name: str, args: str) -> str:
        if name == "shell":
            return "/dev/sda1 50G 20G 27G 43% /"
        if name == "kv_set":
            return "Stored."
        return "mock"

    monkeypatch.setattr(dispatcher, "call", mock_call)

    await agent.start()
    await agent.run(agent.goal)

    spec = TrajectorySpec(
        description="Disk check + store task",
        required_tools=["shell", "kv_set"],
        forbidden_tools=[],
        max_steps=15,
        must_finish=True,
    )

    failures = evaluate_trajectory(agent, spec)
    assert not failures, f"Trajectory failures:\n" + "\n".join(failures)

Part 3: Loop Detection

One of the failure modes I mentioned in Part 1: the agent repeats the same action indefinitely. Here's a test that catches it:

# tests/test_loop_detection.py
import asyncio
import pytest
from react_agent import StepType


def detect_action_loop(agent, window: int = 4) -> bool:
    """
    Return True if the agent repeated the same action in the last `window` steps.
    """
    actions = [s.content for s in agent.trace if s.type == StepType.ACTION]
    if len(actions) < window:
        return False
    recent = actions[-window:]
    # If all recent actions are identical, it's a loop
    return len(set(recent)) == 1


@pytest.mark.asyncio
async def test_agent_does_not_loop(monkeypatch):
    from openai_react_agent import OpenAIReActAgent
    from dispatcher import ToolDispatcher
    from tools import shell_tool

    dispatcher = ToolDispatcher([shell_tool])
    agent = OpenAIReActAgent(
        goal="Find all log files.",
        dispatcher=dispatcher,
        session_id="loop_test",
    )

    call_count = 0

    async def always_error(name: str, args: str) -> str:
        nonlocal call_count
        call_count += 1
        return "Error: permission denied"  # simulate repeated failure

    monkeypatch.setattr(dispatcher, "call", always_error)

    await agent.start()
    await agent.run(agent.goal)

    assert not detect_action_loop(agent), (
        "Agent entered an action loop — same action repeated consecutively."
    )
    assert call_count <= agent.max_steps, (
        f"Agent made {call_count} tool calls — likely looping."
    )

Part 4: Regression Testing When Upgrading Models

This is the test I wish I had written earlier. When I upgraded from gpt-4o to gpt-4o-2024-11-20, three tasks that previously worked broke. I only found out because a user reported it.

The pattern: run a fixed set of representative tasks on both the old and new model, compare trajectories and answers.

# tests/regression.py
from __future__ import annotations
import asyncio
import json
from dataclasses import dataclass
from pathlib import Path
from openai_react_agent import OpenAIReActAgent
from dispatcher import ToolDispatcher
from tools import shell_tool, kv_set_tool, kv_get_tool


@dataclass
class RegressionCase:
    id: str
    goal: str
    expected_tools_called: list[str]


REGRESSION_SUITE: list[RegressionCase] = [
    RegressionCase(
        id="disk_usage",
        goal="Check disk usage on /",
        expected_tools_called=["shell"],
    ),
    RegressionCase(
        id="store_and_retrieve",
        goal="Store 'hello' under key 'greeting', then retrieve it.",
        expected_tools_called=["kv_set", "kv_get"],
    ),
    RegressionCase(
        id="python_version",
        goal="Find out the installed Python version.",
        expected_tools_called=["shell"],
    ),
]


async def run_regression(model: str, mock_tool_results: dict[str, str]) -> dict:
    results = {}
    dispatcher = ToolDispatcher([shell_tool, kv_set_tool, kv_get_tool])

    for case in REGRESSION_SUITE:
        agent = OpenAIReActAgent(
            goal=case.goal,
            dispatcher=dispatcher,
            session_id=f"regression_{case.id}",
            model=model,
        )

        async def mock_call(name: str, args: str, _case=case) -> str:
            return mock_tool_results.get(name, "mock result")

        dispatcher.call = mock_call  # monkeypatch for regression
        await agent.start()
        answer = await agent.run(case.goal)

        from react_agent import StepType
        called = {
            s.content.split(":")[0].strip()
            for s in agent.trace
            if s.type == StepType.ACTION and not s.content.startswith("FINISH")
        }

        results[case.id] = {
            "passed": all(t in called for t in case.expected_tools_called),
            "called_tools": list(called),
            "expected_tools": case.expected_tools_called,
            "answer": answer,
            "steps": len(agent.trace),
        }

    return results


if __name__ == "__main__":
    mock_results = {
        "shell": "Python 3.12.3\n/dev/sda1 50G 43%",
        "kv_set": "Stored.",
        "kv_get": "hello",
    }
    for model in ["gpt-4o", "gpt-4o-2024-11-20"]:
        print(f"\n=== {model} ===")
        results = asyncio.run(run_regression(model, mock_results))
        for case_id, result in results.items():
            status = "PASS" if result["passed"] else "FAIL"
            print(f"  [{status}] {case_id}: {result['steps']} steps")
            if not result["passed"]:
                print(f"    Expected: {result['expected_tools']}")
                print(f"    Got:      {result['called_tools']}")

I run this regression suite before deploying any model upgrade. It takes under 30 seconds because all tools are mocked. If any case changes from PASS to FAIL, I investigate before deploying.

Part 5: A Practical Testing Checklist

After building several agents, here is the checklist I go through before calling an agent production-ready:

[ ] Tool dispatch tests
    [ ] Agent calls the right tool for each category of task
    [ ] Agent never calls an unknown tool
    [ ] Agent handles tool errors without crashing

[ ] Trajectory tests
    [ ] Required tools are called for known task types
    [ ] No forbidden tools appear for known task types
    [ ] Step count stays within expected bounds

[ ] Edge case tests
    [ ] Agent finishes gracefully when max_steps is reached
    [ ] Agent doesn't loop when a tool always returns an error
    [ ] Agent handles empty tool results correctly

[ ] Regression tests
    [ ] All PASS on the current model
    [ ] Ran against the new model before upgrading

[ ] Manual spot-checks
    [ ] At least 5 real tasks run end-to-end before production
    [ ] Trace reviewed for each manual run

The manual spot-checks remain important because automated tests can't catch everything — especially qualitative reasoning failures that produce a technically correct but poor answer.

Key Takeaways

Test tool dispatch first — it's the most deterministic layer and where most bugs live
Trajectory evaluation checks the path, not just the destination
Loop detection prevents a common silent failure mode
Always run a regression suite before upgrading the underlying model
The trace from Part 1 is your best debugging tool for all of the above

Series Complete

You've now built a complete, testable, single AI agent from first principles:

Part

What You Built

ReAct loop in pure Python, reasoning trace

Sliding context window, SQLite persistence, Chroma episodic memory

OpenAI function calling in the ReAct loop, structured output, streaming

Claude tool use, extended thinking, prompt caching

Tool dispatch tests, trajectory eval, loop detection, regression suite

When you're ready to have multiple agents work together, the Multi Agent Orchestration 101 series builds directly on these foundations.

PreviousPart 4: Building an Agent with Claude NextAI Engineer 101

Last updated 1 month ago

hashtagThe Hardest Question in Agent Development

hashtagWhat "Correct" Means for an Agent

hashtagPart 1: Deterministic Tests for Tool Dispatch

hashtagPart 2: Trajectory Evaluation

hashtagPart 3: Loop Detection

hashtagPart 4: Regression Testing When Upgrading Models

hashtagPart 5: A Practical Testing Checklist

hashtagKey Takeaways

hashtagSeries Complete