Part 4: Building an Agent with Claude
Part of the AI Agent Development 101 Series
Why I Started Using Claude for Agents
My first production agent used GPT-4o exclusively. It was reliable and fast. Then I got a task that required reading a large configuration file (around 80K tokens) before deciding what to do. GPT-4o at the time had a 128K context limit — technically enough, but the model's reasoning quality degraded noticeably when the context was that full. I was also paying for 80K input tokens on every single reasoning step.
I switched the agent to Claude 3.5 Sonnet for that project. Two things improved immediately: reasoning quality on large-context tasks stayed consistent, and prompt caching cut my input token costs by around 70% because the large config file was always at the start of the prompt.
This part covers rebuilding the ReAct agent from Part 3 using Anthropic's API, and goes deeper on the features that make Claude specifically useful for agent work.
Prerequisites
pip install anthropic>=0.25.0 aiosqlite chromadb sentence-transformers python-dotenvexport ANTHROPIC_API_KEY="sk-ant-..."System Prompt Engineering for Claude
Claude and GPT-4o respond differently to the same prompt. In my experience:
Claude follows numbered rule lists very reliably — more so than bullet points
Claude benefits from an explicit output format section with a filled-in example
Claude respects negative instructions ("never do X") consistently, whereas GPT-4o sometimes needs them rephrased positively
Here is the system prompt I use for my Claude ReAct agent:
The "So far I know:" opener in Rule 2 is something I added after noticing Claude would occasionally drift from the original goal on long tasks. Forcing it to articulate accumulated knowledge at the start of each thought anchors the reasoning.
The Claude ReAct Agent
The structure mirrors the OpenAI agent from Part 3. The differences are in how tool calls are sent and how responses are parsed.
Extended Thinking as an Explicit Reasoning Step
Claude's extended thinking is different from the Thought in the ReAct loop. The Thought is output text the model writes and that you see. Extended thinking is an internal reasoning process the model runs before generating output — it solves harder subproblems, checks its own logic, and only then writes the Thought and Action.
I use extended thinking for:
Tasks where the model needs to plan more than 3 steps ahead
Debugging tasks where it needs to reason about ambiguous error messages
Any task where I've seen the model make a systematic reasoning error without it
Enable it selectively — it costs extra tokens and adds latency. For simple tool dispatch tasks it's overkill.
When I inspected the thinking blocks on complex tasks, Claude uses them to build a mental map of dependencies before deciding which tool to call first. Without extended thinking, it would sometimes call tools in an order that required backtracking.
Prompt Caching to Cut Token Costs
If your agent always starts with the same large document (a codebase, a config file, a dataset), you're paying for those tokens on every step. Anthropic's prompt caching re-uses the KV-cached prefix across calls that share the same content.
Mark the cacheable prefix with cache_control:
Real cost impact from my own usage: I had a code review agent that loaded a 60K-token codebase on every step. After adding prompt caching, cached input tokens dropped the effective cost by ~72%. The cache hits on claude-3-5-sonnet cost $0.30/1M vs $3.00/1M for uncached.
The cache is valid for 5 minutes by default. For an agent that runs faster than that, every step after the first is cached.
Complete Runnable Example
The trace from Claude is noticeably more verbose in the Thought steps than GPT-4o. Claude tends to write longer, more detailed reasoning. That's useful for debugging — I can follow the logic precisely — but it does mean slightly higher output token usage.
Head-to-Head: OpenAI vs Claude for Single Agents
After running both on the same task set in my personal projects, here is my honest comparison:
Factor
OpenAI gpt-4o
Claude 3-5-sonnet
Tool selection accuracy (5-tool agent)
High
High
Reasoning quality on long tasks (>20 steps)
Degrades slightly
Stays consistent
Structured output / strict schema
Excellent (response_format)
Good (tool use)
Large context handling
Good (128K)
Excellent (200K)
Prompt caching
Not available
Available, 72% savings
Thought verbosity
Concise
Verbose (better for debugging)
Speed on short tasks
Faster
Slightly slower
Cost per step (without caching)
Lower
Slightly higher
For agents on large documents: Claude wins on both quality and cost (with caching). For short, well-defined tasks: GPT-4o is faster and slightly cheaper. For production agents where I need to debug failures: Claude's verbose thoughts are worth the extra tokens.
Key Takeaways
System prompt structure matters more with Claude — numbered rules and explicit format sections outperform bullet points
Extended thinking is useful for complex multi-step planning; enable it selectively
Prompt caching on the system prompt and large documents cuts costs significantly on repeat calls
Claude's verbose Thought output is a feature, not a bug — it makes agent decisions traceable
The
MemorySystemfrom Part 2 works unchanged with Claude
Up Next
Part 5: Evaluating and Testing Your Agent — deterministic tests for tool dispatch, trajectory evaluation to check if the agent took a sensible path, and regression testing when you upgrade the underlying model.
Last updated