Part 3: Building an Agent with OpenAI
Part of the AI Agent Development 101 Series
From Rules to Reasoning
Parts 1 and 2 built the agent skeleton with a stub _think() method. Now it's time to replace that stub with a real LLM call. Using OpenAI's function calling API means the model doesn't just choose from a list of keywords — it reasons about the goal, writes a thought, and produces a structured tool call that our dispatcher can execute reliably.
The ReActAgent from Part 1 and the MemorySystem from Part 2 slot straight in. The only thing that changes is _think() and _decide().
Prerequisites
pip install openai>=1.14.0 aiosqlite chromadb sentence-transformers python-dotenvexport OPENAI_API_KEY="sk-..."System Prompt Engineering for ReAct
The system prompt is the most important part of an OpenAI agent. A poorly written system prompt causes the model to:
Skip the thought step and jump straight to action
Invent tool names that don't exist
Call FINISH before verifying the result
Ignore observations and repeat the same action
Here is the prompt structure I settled on after many iterations:
Two things I want to highlight:
Rule 2 (summarise past observations in the thought): This is the single change that most improved my agent's accuracy. Without it, on step 10 the model often "forgets" what step 3 returned and redoes the work.
Rule 7 (never repeat the same action): Without this, a failing tool call can cause the agent to loop. The model gets an error, isn't sure what to do, and tries the exact same call again. Rule 7 forces it to try something different.
The OpenAI ReAct Agent
A few implementation details worth noting:
temperature=0.0: I always use zero temperature for agent reasoning. Higher temperatures introduce randomness into tool selection, which is rarely useful and often harmful. Save non-zero temperatures for creative generation tasks.
Episodic storage on observations: I only store observations (tool results) in episodic memory, not thoughts or actions. Observations contain factual information the agent needs to recall. Thoughts are reasoning artifacts — useful in the trace, but not worth embedding.
_think and _decide from the same API call: I call the API once in _think and parse both the Thought and Action from the response. _decide just re-reads what was already stored. This halves the number of API calls vs calling the API separately for each.
Structured Output for Deterministic Tool Selection
For tasks where I need the tool call to be strictly schema-valid, I use OpenAI's structured output mode with response_format. This is especially useful when the agent talks to another system that expects exact field names.
With structured output, the model is physically incapable of returning a malformed response. No regex parsing, no if ":" in action_str guards. I switched my personal code generation agent to this approach and eliminated an entire category of parse errors.
Streaming Tool Calls
For long-running agent tasks in a web app, streaming the response is important for perceived responsiveness. OpenAI supports streaming even when tool calls are interleaved:
I use streaming in any agent that runs for more than 10 seconds. Users seeing tokens appear feels faster than a spinner, even if the total time is the same.
Complete Runnable Example
When this runs on my machine, the agent:
Thinks: "I need to list files in /tmp sorted by size"
Runs
ls -lhS /tmp | head -4viashell_toolThinks: "I can see the largest file is X. Now I'll store it."
Calls
kv_set_toolwith the pathThinks: "Both steps complete. I'm confident."
Calls FINISH with the answer
The trace is clean, the tool calls are correct, and the episodic memory stores all three observations so future sessions can recall what was in /tmp at this point in time.
Key Takeaways
System prompt engineering is the highest-leverage work in building an OpenAI agent
Pin
temperature=0.0for agent reasoning — randomness hurts reliabilityParse Thought and Action from the same API call to halve API costs
Structured output (
response_format) eliminates parse errors for tool dispatchStore tool observations in episodic memory; don't store thoughts
Up Next
Part 4: Building an Agent with Claude — the same ReAct loop with Anthropic's tool use API, Claude's extended thinking as an explicit reasoning step, and how to use prompt caching to cut costs on long agent runs.
Last updated