Part 3: Building an Agent with OpenAI

Part of the AI Agent Development 101 Series

From Rules to Reasoning

Parts 1 and 2 built the agent skeleton with a stub _think() method. Now it's time to replace that stub with a real LLM call. Using OpenAI's function calling API means the model doesn't just choose from a list of keywords — it reasons about the goal, writes a thought, and produces a structured tool call that our dispatcher can execute reliably.

The ReActAgent from Part 1 and the MemorySystem from Part 2 slot straight in. The only thing that changes is _think() and _decide().

Prerequisites

pip install openai>=1.14.0 aiosqlite chromadb sentence-transformers python-dotenv

export OPENAI_API_KEY="sk-..."

System Prompt Engineering for ReAct

The system prompt is the most important part of an OpenAI agent. A poorly written system prompt causes the model to:

Skip the thought step and jump straight to action
Invent tool names that don't exist
Call FINISH before verifying the result
Ignore observations and repeat the same action

Here is the prompt structure I settled on after many iterations:

# prompts.py

REACT_SYSTEM_PROMPT = """\
You are a ReAct agent. You solve tasks by interleaving thinking and acting.

## Format
You must respond in this exact format every turn:

Thought: <reason about your current state and what to do next>
Action: <one of the available tool calls, or FINISH>

## Rules
1. Always write a Thought before every Action.
2. In your Thought, start by summarising what you already know from past observations.
3. If the last observation was an error, diagnose it in your Thought before retrying.
4. Only call FINISH when you are confident the goal is complete.
5. FINISH format: FINISH: <your final answer to the user>
6. Never invent tool names. Only use the tools listed below.
7. Never repeat the same action twice with the same arguments — use a different approach.

## Available tools
{tool_descriptions}

## Goal
{goal}
"""

Two things I want to highlight:

Rule 2 (summarise past observations in the thought): This is the single change that most improved my agent's accuracy. Without it, on step 10 the model often "forgets" what step 3 returned and redoes the work.

Rule 7 (never repeat the same action): Without this, a failing tool call can cause the agent to loop. The model gets an error, isn't sure what to do, and tries the exact same call again. Rule 7 forces it to try something different.

The OpenAI ReAct Agent

# openai_react_agent.py
from __future__ import annotations
import json
import os
import re
import asyncio
from openai import AsyncOpenAI
from react_agent import ReActAgent, StepType
from memory_system import MemorySystem
from tools import ToolDefinition, ToolDispatcher

client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])

REACT_SYSTEM_PROMPT = """\
You are a ReAct agent. Solve tasks by interleaving thinking and acting.

Always respond in this format:
Thought: <reason about your current state, summarise what you know so far>
Action: <tool call or FINISH>

Rules:
- Start Thought by recapping observations from previous steps.
- Only use tool names from the list below. Never invent tools.
- Call FINISH only when confident the goal is fully achieved.
- FINISH format: FINISH: <final answer>
- Never repeat the same action with the same arguments.

Available tools:
{tool_descriptions}
"""


class OpenAIReActAgent(ReActAgent):
    def __init__(
        self,
        goal: str,
        dispatcher: ToolDispatcher,
        session_id: str,
        model: str = "gpt-4o",
        max_steps: int = 20,
    ) -> None:
        super().__init__(
            name="openai_react_agent",
            tools={t["name"]: None for t in dispatcher.schemas()},  # names only for base
            max_steps=max_steps,
        )
        self.goal = goal
        self.dispatcher = dispatcher
        self.model = model
        self.memory = MemorySystem(session_id=session_id)

        # Build tool descriptions for the system prompt
        tool_desc = "\n".join(
            f"- {s['name']}: {s['description']}"
            for s in dispatcher.schemas()
        )
        self._system_prompt = REACT_SYSTEM_PROMPT.format(
            tool_descriptions=tool_desc
        )

    async def start(self) -> None:
        """Restore memory and initialise the session."""
        await self.memory.restore()
        await self.memory.add("system", self._system_prompt, persist=False)
        await self.memory.add("user", f"Goal: {self.goal}")

    async def _think(self, goal: str) -> str:
        """Call OpenAI and parse the Thought line."""
        # Inject relevant episodic memories before generating the thought
        recent_obs = [
            s.content for s in self.trace[-3:]
            if s.type == StepType.OBSERVATION
        ]
        query = recent_obs[-1] if recent_obs else goal
        recalls = self.memory.recall_relevant(query, top_k=2)

        messages = self.memory.to_messages()
        if recalls:
            recall_text = "Recalled from memory:\n" + "\n".join(f"- {r}" for r in recalls)
            messages.append({"role": "system", "content": recall_text})

        response = await client.chat.completions.create(
            model=self.model,
            messages=messages,
            temperature=0.0,  # deterministic for agent reasoning
        )
        raw = response.choices[0].message.content or ""

        # Store the raw response in memory
        await self.memory.add("assistant", raw, episodic=False)

        # Parse: extract the Thought line
        thought_match = re.search(r"Thought:\s*(.+?)(?=\nAction:|$)", raw, re.DOTALL)
        return thought_match.group(1).strip() if thought_match else raw

    async def _decide(self, thought: str) -> str:
        """Parse the Action line from the last assistant message."""
        last_assistant = [
            e for e in self.memory.short_term.last(5)
            if e.role == "assistant"
        ]
        if not last_assistant:
            return "FINISH: Could not determine action."

        raw = last_assistant[-1].content
        action_match = re.search(r"Action:\s*(.+?)$", raw, re.MULTILINE | re.DOTALL)
        if not action_match:
            return "FINISH: No action found in response."

        return action_match.group(1).strip()

    async def _act(self, tool_name: str, args: str) -> str:
        if tool_name == "FINISH":
            # args is the final answer — handled by run()
            return ""

        result = await self.dispatcher.call(tool_name, args)
        await self.memory.add(
            "user",
            f"Observation: {result}",
            episodic=True,  # store observations in episodic memory
        )
        return result

A few implementation details worth noting:

temperature=0.0: I always use zero temperature for agent reasoning. Higher temperatures introduce randomness into tool selection, which is rarely useful and often harmful. Save non-zero temperatures for creative generation tasks.

Episodic storage on observations: I only store observations (tool results) in episodic memory, not thoughts or actions. Observations contain factual information the agent needs to recall. Thoughts are reasoning artifacts — useful in the trace, but not worth embedding.

_think and _decide from the same API call: I call the API once in _think and parse both the Thought and Action from the response. _decide just re-reads what was already stored. This halves the number of API calls vs calling the API separately for each.

Structured Output for Deterministic Tool Selection

For tasks where I need the tool call to be strictly schema-valid, I use OpenAI's structured output mode with response_format. This is especially useful when the agent talks to another system that expects exact field names.

from pydantic import BaseModel
from typing import Literal


class AgentAction(BaseModel):
    thought: str
    action_type: Literal["tool_call", "finish"]
    tool_name: str | None = None
    tool_args: dict | None = None
    final_answer: str | None = None


async def _think_structured(self, goal: str) -> AgentAction:
    response = await client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",  # structured output requires this model or later
        messages=self.memory.to_messages(),
        response_format=AgentAction,
        temperature=0.0,
    )
    return response.choices[0].message.parsed

With structured output, the model is physically incapable of returning a malformed response. No regex parsing, no if ":" in action_str guards. I switched my personal code generation agent to this approach and eliminated an entire category of parse errors.

Streaming Tool Calls

For long-running agent tasks in a web app, streaming the response is important for perceived responsiveness. OpenAI supports streaming even when tool calls are interleaved:

async def _think_streaming(self, goal: str) -> str:
    """Stream the response and accumulate tool call chunks."""
    stream = await client.chat.completions.create(
        model=self.model,
        messages=self.memory.to_messages(),
        stream=True,
        temperature=0.0,
    )

    full_content = ""
    async for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            full_content += delta.content
            print(delta.content, end="", flush=True)  # live output

    print()  # newline after streaming ends
    return full_content

I use streaming in any agent that runs for more than 10 seconds. Users seeing tokens appear feels faster than a spinner, even if the total time is the same.

Complete Runnable Example

# run_openai_agent.py
import asyncio
import os
from openai_react_agent import OpenAIReActAgent
from dispatcher import ToolDispatcher
from tools import shell_tool, kv_set_tool, kv_get_tool
from long_term_memory import init_db


async def main() -> None:
    await init_db()

    dispatcher = ToolDispatcher([shell_tool, kv_set_tool, kv_get_tool])

    agent = OpenAIReActAgent(
        goal="Find the three largest files in the /tmp directory and store the biggest one's path under the key 'largest_tmp_file'.",
        dispatcher=dispatcher,
        session_id="session_001",
        model="gpt-4o",
    )

    await agent.start()
    answer = await agent.run(agent.goal)

    print(f"\n=== Answer ===\n{answer}\n")
    print("=== Trace ===")
    agent.print_trace()


if __name__ == "__main__":
    asyncio.run(main())

When this runs on my machine, the agent:

Thinks: "I need to list files in /tmp sorted by size"
Runs ls -lhS /tmp | head -4 via shell_tool
Thinks: "I can see the largest file is X. Now I'll store it."
Calls kv_set_tool with the path
Thinks: "Both steps complete. I'm confident."
Calls FINISH with the answer

The trace is clean, the tool calls are correct, and the episodic memory stores all three observations so future sessions can recall what was in /tmp at this point in time.

Key Takeaways

System prompt engineering is the highest-leverage work in building an OpenAI agent
Pin temperature=0.0 for agent reasoning — randomness hurts reliability
Parse Thought and Action from the same API call to halve API costs
Structured output (response_format) eliminates parse errors for tool dispatch
Store tool observations in episodic memory; don't store thoughts

Up Next

Part 4: Building an Agent with Claude — the same ReAct loop with Anthropic's tool use API, Claude's extended thinking as an explicit reasoning step, and how to use prompt caching to cut costs on long agent runs.

PreviousPart 2: Agent Memory and State NextPart 4: Building an Agent with Claude

Last updated 1 month ago

hashtagFrom Rules to Reasoning

hashtagPrerequisites

hashtagSystem Prompt Engineering for ReAct

hashtagThe OpenAI ReAct Agent

hashtagStructured Output for Deterministic Tool Selection

hashtagStreaming Tool Calls

hashtagComplete Runnable Example

hashtagKey Takeaways

hashtagUp Next