ReAct Prompting — Build a Reasoning + Action Agent
You've seen chain-of-thought prompting push an LLM to think step by step. But thinking isn't enough. When the model needs a fact it doesn't know — today's weather, a Wikipedia article, a calculation result — it hallucinates or stops.
ReAct solves this with a loop: think, act (call a tool), observe the result, then think again. In this tutorial, you'll build that loop from scratch in pure Python. No LangChain. No frameworks. Just you, an LLM, and a few functions.
What Is ReAct? The Thought-Action-Observation Loop
Ask an LLM to summarize a Wikipedia article it has never seen. It will produce a confident, detailed summary — and half the facts will be wrong. The model reasons about the topic but has no way to verify anything against the real world.
That gap between reasoning and reality is exactly what ReAct closes. ReAct stands for Reasoning + Acting. The paper by Yao et al. (2022) showed that LLMs perform dramatically better when they alternate between thinking and taking actions. Instead of generating one big answer, the model follows a repeating cycle:
Each Thought is the model reasoning about what it knows and what it still needs. Each Action calls a real function — a search engine, a calculator, a database lookup. Each Observation is the raw result that feeds back into the next thought.
The mental model is simple: a ReAct agent is a while loop where the LLM is the brain and your Python functions are its hands. The brain decides what to do, the hands do it, and the brain processes the result.
Setup and Tool Definitions
A ReAct agent is only as useful as its tools. Each tool is a plain Python function that takes a string input and returns a string output. The three tools below cover math, string manipulation, and fact lookup — deliberately simple so you can focus on the loop itself.
We need a registry so the agent can discover which tools exist. The registry is a dictionary mapping each tool name to its function and a natural-language description the LLM reads to decide when to call it.
The ReAct Prompt Template
The system prompt is the most critical piece. If it does not enforce the output format strictly, the model will answer questions directly instead of using tools. The template below injects tool descriptions dynamically, so adding a new tool to the registry updates the prompt automatically.
Parsing the Agent's Output
The parser extracts three fields from the model's free-text response: thought (the reasoning), action (which tool to call), and action_input (the argument). Each regex targets one labeled line. The function returns a dictionary with all three values, defaulting to empty strings if a field is missing.
This regex-based approach handles about 95% of model outputs cleanly. The remaining 5% — extra whitespace, commentary after the Action Input — is where production agents add retry logic.
Building the ReAct Loop
Everything we've built so far — tools, registry, prompt, parser — comes together in one async function. The loop works like this: on each iteration, the LLM generates a response, the parser extracts an action, and the tool result is fed back as an observation. Messages accumulate in a list, giving the model its full reasoning history.
Three details matter here. First, temperature=0.0 keeps tool selection deterministic. Second, observations go in as "user" messages (not "assistant") so the model treats them as new information rather than its own prior output. Third, the finish action exits the loop and returns the final answer.
The Agent in Action
Start with a single-tool query — a calculation the model shouldn't attempt in its head:
One thought-action-observation cycle and done. The real power emerges with multi-step queries that chain tools together:
Watch the trace carefully. The agent first looks up the speed of light, extracts the number from the observation, then asks the calculator to multiply. Two separate tool calls, each informed by the previous result. The model does what you'd do manually — look something up, then compute with it.
Handling Errors and Edge Cases
Production agents break in predictable ways. The three most common failure modes: the model invents a tool name, the tool throws an error, or the model loops on the same action. Our agent already handles the first two — let's verify:
The agent tried a nonexistent tool, got an error listing the available options, and either retried with a valid tool or gave up gracefully. The error message itself becomes an observation the model can reason about.
Create a function called word_count that takes a string and returns the number of words in it (words are separated by whitespace). Then create a dictionary called word_count_tool with keys "function" (pointing to your function) and "description" (a string explaining what the tool does).
Example: word_count("Hello world foo bar") should return "4".
Building a Wikipedia Research Agent
Five hardcoded facts won't cut it for real research. This section builds a two-tool Wikipedia system: wiki_search finds article titles by keyword matching, and wiki_lookup retrieves the summary of a specific article. The pattern mirrors real search engines — first discover, then read.
The mock database below stores four articles with summaries and section lists. The wiki_search function uses substring matching to find relevant titles, while wiki_lookup does a case-insensitive exact match with a fuzzy fallback. Both return descriptive strings the agent can parse.
With the Wikipedia tools defined, we create a new tool registry that swaps knowledge_base and string_length for wiki_search and wiki_lookup while keeping calculator. The build_react_prompt function generates a fresh system prompt from the new descriptions. The same react_agent function handles execution — the only difference is which tools and prompt we pass in.
This is the payoff of making react_agent parameterized. Same loop, different tools and prompt. The agent searches for Python, finds Guido van Rossum, looks up his article, extracts the BDFL detail, and delivers a grounded answer.
The agent needs at least two lookups — the ReAct article and the Python article — then synthesizes both into a single answer. Each observation narrows what it still needs.
Create a function called wiki_section_count that takes a Wikipedia article title and returns how many sections that article has. Use the WIKI_ARTICLES dictionary defined above.
If the article exists, return "The article '<title>' has <N> sections." (with the title lowercased). If not, return "Article not found: <title>" (also lowercased).
Remember: keys in WIKI_ARTICLES are all lowercase.
Agent Traces — Debugging Your Agent's Reasoning
When an agent gives a wrong answer, you need to know exactly where its reasoning went off track. Was it a bad thought? A wrong tool choice? A misinterpreted observation? The most useful debugging tool is a structured trace — a dictionary capturing every step, every tool call, and the total token count.
The traced_agent function below mirrors react_agent but instead of printing, it builds a trace dictionary. Each step records the thought, action, action input, and observation. The trace also accumulates total_tokens from the API response, so you can see the cost of each run.
Tracking Token Costs Per Agent Run
Each ReAct step requires a full API call. A 5-step agent run sends the growing message list 5 times, and later calls include all previous thoughts and observations. Token usage grows quadratically with step count because each call re-sends the entire conversation history.
# Agent runs with no limit
result = await react_agent(question)
# Could burn tokens for 10+ steps# Cap steps and track spend
trace = await traced_agent(
question, tools, prompt,
max_steps=5 # Hard limit
)
cost = trace["total_tokens"] / 1e6 * 0.15
print(f"Run cost: ${cost:.4f}")Five ReAct Pitfalls and How to Fix Them
The same failure patterns appear across every ReAct implementation. Here are the top five with concrete fixes:
2. Tool Name Hallucination. The agent invents tools like web_search or python_executor. The error message listing available tools usually corrects this. For stubborn cases, bold the tool names in the system prompt.
# Tool returns whatever it gets
def wiki_lookup(title):
return full_article_text # Could be 10,000+ chars# Tool truncates long output
def wiki_lookup(title, max_chars=500):
text = full_article_text
if len(text) > max_chars:
return text[:max_chars] + "... [truncated]"
return textCreate a function called safe_execute that takes three arguments: tool_name (str), tool_input (str), and tools_dict (dict). It should:
1. If tool_name is not in tools_dict, return "Error: Unknown tool '<tool_name>'" (with the actual tool name).
2. Call the tool function with tool_input.
3. If the result is longer than 300 characters, truncate to 300 characters and append " [truncated]".
4. Return the (possibly truncated) result.
ReAct vs Chain-of-Thought vs Function Calling
You now have three approaches to LLM reasoning in your toolkit. Picking the right one depends on whether you need reasoning, tool access, or both.
My rule of thumb: if the task needs only reasoning, use CoT. If it needs tools but not complex reasoning, use function calling. If it needs both — looking things up, computing, then reasoning about results — use ReAct.
Summary and Next Steps
You built a ReAct agent from scratch in pure Python — no frameworks required. Here's what you covered:
max_steps ceiling.Frequently Asked Questions
Can ReAct agents use more than 10 tools?
Yes, but there's a tradeoff. More tools means a longer system prompt consuming more of the context window. Agents start struggling to pick the right tool above 15-20 options. For large toolsets, use a two-stage approach: one agent picks the category, a second picks the specific tool.
How do I add memory across conversations?
The agent we built is stateless — each call starts fresh. To add memory, persist traces to a database and inject a summary of previous interactions into the system prompt. The chatbot with memory tutorial walks through this pattern step by step.
Is ReAct the same as LangChain's agents?
LangChain's AgentExecutor implements a very similar loop with additional abstractions for tool management and memory. Understanding the raw loop — which you just built — makes debugging LangChain agents much easier.
What models work best for ReAct?
Any model that reliably follows formatting instructions. GPT-4o and GPT-4o-mini are excellent. Claude 3.5 Sonnet works well. Smaller open-source models (7B parameters or less) often struggle with strict output formats. For small models, simplify to Action: tool_name(input) on a single line.