# Part 3: Building an Agent with OpenAI

*Part of the* [*AI Agent Development 101 Series*](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-agent-development-101)

## From Rules to Reasoning

Parts 1 and 2 built the agent skeleton with a stub `_think()` method. Now it's time to replace that stub with a real LLM call. Using OpenAI's function calling API means the model doesn't just choose from a list of keywords — it reasons about the goal, writes a thought, and produces a structured tool call that our dispatcher can execute reliably.

The `ReActAgent` from Part 1 and the `MemorySystem` from Part 2 slot straight in. The only thing that changes is `_think()` and `_decide()`.

***

## Prerequisites

```bash
pip install openai>=1.14.0 aiosqlite chromadb sentence-transformers python-dotenv
```

```bash
export OPENAI_API_KEY="sk-..."
```

***

## System Prompt Engineering for ReAct

The system prompt is the most important part of an OpenAI agent. A poorly written system prompt causes the model to:

* Skip the thought step and jump straight to action
* Invent tool names that don't exist
* Call FINISH before verifying the result
* Ignore observations and repeat the same action

Here is the prompt structure I settled on after many iterations:

```python
# prompts.py

REACT_SYSTEM_PROMPT = """\
You are a ReAct agent. You solve tasks by interleaving thinking and acting.

## Format
You must respond in this exact format every turn:

Thought: <reason about your current state and what to do next>
Action: <one of the available tool calls, or FINISH>

## Rules
1. Always write a Thought before every Action.
2. In your Thought, start by summarising what you already know from past observations.
3. If the last observation was an error, diagnose it in your Thought before retrying.
4. Only call FINISH when you are confident the goal is complete.
5. FINISH format: FINISH: <your final answer to the user>
6. Never invent tool names. Only use the tools listed below.
7. Never repeat the same action twice with the same arguments — use a different approach.

## Available tools
{tool_descriptions}

## Goal
{goal}
"""
```

Two things I want to highlight:

**Rule 2 (summarise past observations in the thought):** This is the single change that most improved my agent's accuracy. Without it, on step 10 the model often "forgets" what step 3 returned and redoes the work.

**Rule 7 (never repeat the same action):** Without this, a failing tool call can cause the agent to loop. The model gets an error, isn't sure what to do, and tries the exact same call again. Rule 7 forces it to try something different.

***

## The OpenAI ReAct Agent

```python
# openai_react_agent.py
from __future__ import annotations
import json
import os
import re
import asyncio
from openai import AsyncOpenAI
from react_agent import ReActAgent, StepType
from memory_system import MemorySystem
from tools import ToolDefinition, ToolDispatcher

client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])

REACT_SYSTEM_PROMPT = """\
You are a ReAct agent. Solve tasks by interleaving thinking and acting.

Always respond in this format:
Thought: <reason about your current state, summarise what you know so far>
Action: <tool call or FINISH>

Rules:
- Start Thought by recapping observations from previous steps.
- Only use tool names from the list below. Never invent tools.
- Call FINISH only when confident the goal is fully achieved.
- FINISH format: FINISH: <final answer>
- Never repeat the same action with the same arguments.

Available tools:
{tool_descriptions}
"""


class OpenAIReActAgent(ReActAgent):
    def __init__(
        self,
        goal: str,
        dispatcher: ToolDispatcher,
        session_id: str,
        model: str = "gpt-4o",
        max_steps: int = 20,
    ) -> None:
        super().__init__(
            name="openai_react_agent",
            tools={t["name"]: None for t in dispatcher.schemas()},  # names only for base
            max_steps=max_steps,
        )
        self.goal = goal
        self.dispatcher = dispatcher
        self.model = model
        self.memory = MemorySystem(session_id=session_id)

        # Build tool descriptions for the system prompt
        tool_desc = "\n".join(
            f"- {s['name']}: {s['description']}"
            for s in dispatcher.schemas()
        )
        self._system_prompt = REACT_SYSTEM_PROMPT.format(
            tool_descriptions=tool_desc
        )

    async def start(self) -> None:
        """Restore memory and initialise the session."""
        await self.memory.restore()
        await self.memory.add("system", self._system_prompt, persist=False)
        await self.memory.add("user", f"Goal: {self.goal}")

    async def _think(self, goal: str) -> str:
        """Call OpenAI and parse the Thought line."""
        # Inject relevant episodic memories before generating the thought
        recent_obs = [
            s.content for s in self.trace[-3:]
            if s.type == StepType.OBSERVATION
        ]
        query = recent_obs[-1] if recent_obs else goal
        recalls = self.memory.recall_relevant(query, top_k=2)

        messages = self.memory.to_messages()
        if recalls:
            recall_text = "Recalled from memory:\n" + "\n".join(f"- {r}" for r in recalls)
            messages.append({"role": "system", "content": recall_text})

        response = await client.chat.completions.create(
            model=self.model,
            messages=messages,
            temperature=0.0,  # deterministic for agent reasoning
        )
        raw = response.choices[0].message.content or ""

        # Store the raw response in memory
        await self.memory.add("assistant", raw, episodic=False)

        # Parse: extract the Thought line
        thought_match = re.search(r"Thought:\s*(.+?)(?=\nAction:|$)", raw, re.DOTALL)
        return thought_match.group(1).strip() if thought_match else raw

    async def _decide(self, thought: str) -> str:
        """Parse the Action line from the last assistant message."""
        last_assistant = [
            e for e in self.memory.short_term.last(5)
            if e.role == "assistant"
        ]
        if not last_assistant:
            return "FINISH: Could not determine action."

        raw = last_assistant[-1].content
        action_match = re.search(r"Action:\s*(.+?)$", raw, re.MULTILINE | re.DOTALL)
        if not action_match:
            return "FINISH: No action found in response."

        return action_match.group(1).strip()

    async def _act(self, tool_name: str, args: str) -> str:
        if tool_name == "FINISH":
            # args is the final answer — handled by run()
            return ""

        result = await self.dispatcher.call(tool_name, args)
        await self.memory.add(
            "user",
            f"Observation: {result}",
            episodic=True,  # store observations in episodic memory
        )
        return result
```

A few implementation details worth noting:

**`temperature=0.0`**: I always use zero temperature for agent reasoning. Higher temperatures introduce randomness into tool selection, which is rarely useful and often harmful. Save non-zero temperatures for creative generation tasks.

**Episodic storage on observations**: I only store observations (tool results) in episodic memory, not thoughts or actions. Observations contain factual information the agent needs to recall. Thoughts are reasoning artifacts — useful in the trace, but not worth embedding.

**`_think` and `_decide` from the same API call**: I call the API once in `_think` and parse both the Thought and Action from the response. `_decide` just re-reads what was already stored. This halves the number of API calls vs calling the API separately for each.

***

## Structured Output for Deterministic Tool Selection

For tasks where I need the tool call to be strictly schema-valid, I use OpenAI's structured output mode with `response_format`. This is especially useful when the agent talks to another system that expects exact field names.

```python
from pydantic import BaseModel
from typing import Literal


class AgentAction(BaseModel):
    thought: str
    action_type: Literal["tool_call", "finish"]
    tool_name: str | None = None
    tool_args: dict | None = None
    final_answer: str | None = None


async def _think_structured(self, goal: str) -> AgentAction:
    response = await client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",  # structured output requires this model or later
        messages=self.memory.to_messages(),
        response_format=AgentAction,
        temperature=0.0,
    )
    return response.choices[0].message.parsed
```

With structured output, the model is physically incapable of returning a malformed response. No regex parsing, no `if ":" in action_str` guards. I switched my personal code generation agent to this approach and eliminated an entire category of parse errors.

***

## Streaming Tool Calls

For long-running agent tasks in a web app, streaming the response is important for perceived responsiveness. OpenAI supports streaming even when tool calls are interleaved:

```python
async def _think_streaming(self, goal: str) -> str:
    """Stream the response and accumulate tool call chunks."""
    stream = await client.chat.completions.create(
        model=self.model,
        messages=self.memory.to_messages(),
        stream=True,
        temperature=0.0,
    )

    full_content = ""
    async for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            full_content += delta.content
            print(delta.content, end="", flush=True)  # live output

    print()  # newline after streaming ends
    return full_content
```

I use streaming in any agent that runs for more than 10 seconds. Users seeing tokens appear feels faster than a spinner, even if the total time is the same.

***

## Complete Runnable Example

```python
# run_openai_agent.py
import asyncio
import os
from openai_react_agent import OpenAIReActAgent
from dispatcher import ToolDispatcher
from tools import shell_tool, kv_set_tool, kv_get_tool
from long_term_memory import init_db


async def main() -> None:
    await init_db()

    dispatcher = ToolDispatcher([shell_tool, kv_set_tool, kv_get_tool])

    agent = OpenAIReActAgent(
        goal="Find the three largest files in the /tmp directory and store the biggest one's path under the key 'largest_tmp_file'.",
        dispatcher=dispatcher,
        session_id="session_001",
        model="gpt-4o",
    )

    await agent.start()
    answer = await agent.run(agent.goal)

    print(f"\n=== Answer ===\n{answer}\n")
    print("=== Trace ===")
    agent.print_trace()


if __name__ == "__main__":
    asyncio.run(main())
```

When this runs on my machine, the agent:

1. Thinks: "I need to list files in /tmp sorted by size"
2. Runs `ls -lhS /tmp | head -4` via `shell_tool`
3. Thinks: "I can see the largest file is X. Now I'll store it."
4. Calls `kv_set_tool` with the path
5. Thinks: "Both steps complete. I'm confident."
6. Calls FINISH with the answer

The trace is clean, the tool calls are correct, and the episodic memory stores all three observations so future sessions can recall what was in `/tmp` at this point in time.

***

## Key Takeaways

* System prompt engineering is the highest-leverage work in building an OpenAI agent
* Pin `temperature=0.0` for agent reasoning — randomness hurts reliability
* Parse Thought and Action from the same API call to halve API costs
* Structured output (`response_format`) eliminates parse errors for tool dispatch
* Store tool observations in episodic memory; don't store thoughts

***

## Up Next

[Part 4: Building an Agent with Claude](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-agent-development-101/part-4-claude-agent) — the same ReAct loop with Anthropic's tool use API, Claude's extended thinking as an explicit reasoning step, and how to use prompt caching to cut costs on long agent runs.