Native Tool Calling & the Agentic Loop
Module 01 built a ReAct (Reasoning and Acting) agent that parses the model’s text every turn (
Thought:/Action:/Observation:). It works, but it is fragile: one stray token and your regex misfires. This module moves to how production agents actually run — the model returns structured tool calls, the framework dispatches them, and structured results go back. Same loop, far fewer ways to break.
Learning objectives
Section titled “Learning objectives”By the end you can:
- Explain the difference between hand-rolled text ReAct and native / structured tool calling, and argue the reliability tradeoff in an interview.
- Write the exact multi-turn message shapes for both dialects — Anthropic
tool_use/tool_resultcontent blocks and OpenAItool_calls/toolrole — and explain how a library normalizes them. - Implement a clean, provider-agnostic agentic loop that runs offline on
MockLLMand live on OpenRouter/Anthropic without branching on the provider. - Handle the real failure modes: unknown tool, a tool that raises, the
parallel-call contract, and the
max_stepsguard.
1. Why move off text-parsing ReAct?
Section titled “1. Why move off text-parsing ReAct?”Hand-rolled ReAct asks the model to format its intent as text and then reverse-engineers that text:
Thought: I should look up the weather.Action: get_weatherAction Input: {"city": "Paris"}Your framework regexes out Action: and Action Input:, json.loads the args,
runs the tool, and appends Observation: 18C. Every one of those steps is a
place to fail:
TEXT ReAct (you own the parser) NATIVE tool calling (the API owns it)----------------------------------- ------------------------------------model emits free text model emits a typed tool_call object | regex Action/Action Input | already structured: id+name+input | json.loads(args) <-- can crash | args validated against your schema | match tool name by string | name is a first-class field v vrun tool, append "Observation:" run tool, append a tool_result blockNative tool calling moves the contract into the API (Application Programming Interface). You send the model a
list of tool schemas; the provider returns a structured object with a tool
id, the tool name, and an input already shaped to your JSON (JavaScript Object Notation) Schema (often
via grammar-constrained decoding, so it is syntactically valid by
construction). No prefix to forget, no JSON to hand-extract, no “the model wrote
Action : with a space.”
When does text ReAct still matter?
Section titled “When does text ReAct still matter?”Native calling is the default, but text ReAct is not dead:
- Models / endpoints without a tool-calling API. Older or local models, or a raw completion endpoint, force you back to text parsing.
- Full transparency / portability. The scratchpad is just text — trivially loggable, diffable, and provider-independent. Some eval and research setups want the reasoning trace inline.
- Teaching and debugging. Seeing
Thought:/Observation:interleaved makes the loop legible. (That is exactly why module 01 starts there.)
Interview angle. “Why prefer native tool calling over ReAct text parsing?” → Reliability and separation of concerns. The provider guarantees a valid, typed call against your schema, so you delete a whole class of parser bugs (malformed JSON, missing prefixes, tool-name typos) and get parallel calls and
tool_choicefor free. The cost: you depend on a provider feature and a specific wire format. Mitigate that by normalizing to one neutral shape behind a thin translation layer — then your loop never changes when you swap models.
2. The two dialects (know both cold)
Section titled “2. The two dialects (know both cold)”There are two wire formats in the wild. Anthropic uses content blocks;
OpenAI uses a tool role and a separate tool_calls field. OpenRouter
speaks the OpenAI dialect, and agentkit translates our neutral
(Anthropic-style) shape to it. Memorize both — interviewers ask you to “walk the
JSON.”
2a. Anthropic dialect (content blocks, no tool role)
Section titled “2a. Anthropic dialect (content blocks, no tool role)”Request — tools are {name, description, input_schema}:
{ "model": "claude-opus-4-8", "tools": [{ "name": "get_weather", "description": "Get the current weather for a city.", "input_schema": { "type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"] } }], "messages": [{"role": "user", "content": "Weather in Paris?"}]}Response — stop_reason: "tool_use", and a tool_use content block:
{ "stop_reason": "tool_use", "content": [ {"type": "text", "text": "Let me check."}, {"type": "tool_use", "id": "toolu_01A", "name": "get_weather", "input": {"city": "Paris"}} // input is a real dict, already parsed ]}You continue the conversation by appending (a) the full assistant
message as-is, then (b) a user turn whose content starts with a
tool_result block:
{"role": "assistant", "content": [ ...the tool_use block above... ]},{"role": "user", "content": [ {"type": "tool_result", "tool_use_id": "toolu_01A", "content": "18C and sunny"}]}Gotchas specific to Anthropic:
- The
tool_resultblock(s) must come first in that user turn — before any text block — or the API 400s. - The
tool_use_idmust echo theidfrom the call. There is notoolrole. - On failure, set
"is_error": truein thetool_result.
2b. OpenAI dialect (a tool role + tool_calls)
Section titled “2b. OpenAI dialect (a tool role + tool_calls)”Request — tools are wrapped in {type:"function", function:{...}}, and the
schema key is parameters (not input_schema):
{ "model": "anthropic/claude-haiku-4.5", // an OpenRouter slug "tools": [{ "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a city.", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]} } }], "tool_choice": "auto", "messages": [{"role": "user", "content": "Weather in Paris?"}]}Response — finish_reason: "tool_calls", and message.tool_calls:
{ "choices": [{ "finish_reason": "tool_calls", "message": { "content": null, "tool_calls": [{ "id": "call_abc", "type": "function", "function": {"name": "get_weather", "arguments": "{\"city\": \"Paris\"}"} // a JSON STRING! }] } }]}You continue by appending the assistant message (with its tool_calls),
then a message with role: "tool":
{"role": "assistant", "content": null, "tool_calls": [ ...as above... ]},{"role": "tool", "tool_call_id": "call_abc", "content": "18C and sunny"}Gotchas specific to OpenAI:
function.argumentsis a JSON-encoded string — you mustjson.loads()it. (Anthropic gives you a dict directly.)- The result goes in a distinct
toolrole message keyed bytool_call_id. - There is no “results must come first” rule; the
toolmessages just follow the assistant message.
2c. The neutral shape (what we code against)
Section titled “2c. The neutral shape (what we code against)”The whole point of agentkit is that you never write either of the above by
hand in your loop. You code against one neutral, Anthropic-style shape and
let the library translate at the edge:
neutral messages (Anthropic-style blocks) | +-----------------+------------------+ | | AnthropicLLM OpenRouterLLM (sends as-is) to_openai_messages() / to_openai_tools() | from_openai_response() v v Anthropic API OpenAI-compatible APIagentkit.llm exposes those translators as pure functions so they are
unit-testable offline (no network): to_openai_messages, to_openai_tools,
from_openai_response, from_anthropic_response. Both providers collapse to one
LLMResponse:
@dataclassclass LLMResponse: text: str = "" tool_calls: list[ToolCall] = [] # ToolCall(id, name, input: dict) stop_reason: str = "end_turn" # "tool_use" when it wants a toolInterview angle. “How would you support multiple model providers in one agent?” → Define a neutral message/response shape, isolate provider JSON to a translation layer, return one normalized response type. The loop, the tools, and the tests then never branch on the provider. Name the two concrete differences you have to bridge: content blocks vs
toolrole, and dictinputvs JSON-stringarguments.
3. The real agentic loop
Section titled “3. The real agentic loop”Here is the loop in agent_loop.py, in words:
messages = [user turn]for step in range(max_steps): # <-- the guard resp = llm.complete(messages, system, tools=registry.specs()) messages.append(assistant_turn(resp)) # text + tool_use blocks if not resp.tool_calls: # model gave its final answer return resp.text results = [] for call in resp.tool_calls: # run EVERY call result, is_error = dispatch(call) # catch failures! results.append(tool_result_block(call, result, is_error)) messages.append({"role": "user", "content": results}) # ONE turn, all results# fell out -> guard firedreturn "" # or raise / return partial stateFour design decisions worth defending:
- Append the assistant turn before the results. The
tool_result/toolmessage references the call by id; if the assistant turn that made the call is not in the history, the provider rejects the conversation. This is the single most common native-tool-calling bug. - Terminate on “no tool calls,” not on
stop_reason.resp.tool_callsbeing empty is the robust cross-provider “the model is done” signal. (We keepstop_reasonaround for logging and truncation handling.) - Return a result for every call id (parallel contract). A model can emit several tool calls in one turn. You must dispatch all of them and feed back a result for each id before calling the model again. An orphaned id (a call with no matching result) breaks the conversation on both providers. In production you run these concurrently to cut latency.
- Always bound the loop with
max_steps. A model can loop forever: call tool → look at result → call the same tool again. Without a cap that is an infinite loop and unbounded spend. The guard is not optional.
Tool errors are data, not crashes
Section titled “Tool errors are data, not crashes”When a tool fails, do not let the exception escape the loop. Catch it, set
is_error=True, and feed an actionable message back — the model can then
retry, switch tools, or apologize. Two cases to handle (execute_tool_call):
try: return registry.dispatch(call.name, call.input), Falseexcept KeyError: # model hallucinated a tool name available = ", ".join(s["name"] for s in registry.specs()) or "(none)" return f"Error: unknown tool '{call.name}'. Available tools: {available}.", Trueexcept Exception as exc: # the tool body raised return f"Error running tool '{call.name}': {exc}", TrueListing the real tool names in the unknown-tool message materially improves the model’s odds of self-correcting. Write error text like a stack trace for a junior engineer: what failed and what to do next. (Claude will typically retry a failing tool 2–3 times before giving up.)
4. From @tool to the API tools param
Section titled “4. From @tool to the API tools param”You never hand-write JSON Schema. agentkit’s @tool derives it from the
function’s type hints + docstring, and ToolRegistry.specs() is exactly what you
pass as tools:
@tooldef get_weather(city: str) -> str: """Get the current weather for a city. Use for any weather question.""" ...
registry = ToolRegistry([get_weather])registry.specs()# -> [{"name": "get_weather",# "description": "Get the current weather for a city. ...",# "input_schema": {"type": "object",# "properties": {"city": {"type": "string"}},# "required": ["city"]}}]That neutral spec is sent as-is to Anthropic, or run through to_openai_tools()
(which rewraps it as {type:"function", function:{..., parameters: input_schema}})
for OpenRouter. The description is the single biggest routing lever — if
the model keeps picking the wrong tool, fix descriptions first (be specific, say
when not to use it), then the schema, then reduce the tool count.
5. tool_choice and structured outputs (quick but interview-relevant)
Section titled “5. tool_choice and structured outputs (quick but interview-relevant)”tool_choicecontrols whether/which tool fires. Anthropic:auto | any | tool | none. OpenAI:auto | required | {function:{name}} | none. Forcing a choice (any/tool/required) prefills the assistant turn, so you get no natural-language preamble — great for extraction, bad for chat.- Structured outputs ≠ function calling. Structured outputs constrain the
final answer’s shape (e.g. OpenAI
response_format); function calling triggers actions. They are orthogonal. Both major providers now ship native structured outputs: on Claude, constrain the response to a JSON Schema withoutput_config={"format":{"type":"json_schema","schema":{...}}}(or use the SDK — Software Development Kit — helperclient.messages.parse(..., output_format=Model)for a validated, typed object), and use strict tool use ("strict": trueon a tool’sinput_schema) to guarantee valid tool arguments. A still-useful, provider-portable fallback that predates native support: define a single tool, force it withtool_choice:{type:"tool", name}, and read the tool’sinputas your structured data — handy on models/providers (e.g. some via OpenRouter) that lack native structured outputs.
Interview angle. “
anyvsauto?” → Guarantee a tool fires (data extraction) vs let the model decide whether to (conversational). “Structured output for document extraction with Claude?” → Use native structured outputs (JSON-schemaoutput_configormessages.parse); the single-tool + forced-tool_choicepattern is the portable fallback when native support is unavailable.
6. Common pitfalls / gotchas
Section titled “6. Common pitfalls / gotchas”- Feeding the result without the assistant turn. The #1 native-tool bug:
you append the
tool_result/toolmessage but forgot to first append the assistant message that made the call → 400 / “unknown tool_use_id.” - Forgetting
json.loadson OpenAIarguments. OpenAI gives you a JSON string; Anthropic gives you a dict. Treating the string as a dict (or vice versa) silently breaks dispatch. (from_openai_responsehandles this for you — and defensively returns{}on malformed JSON instead of crashing.) - Orphaned tool-call ids. Parallel calls without a result for every id.
Always loop over all
resp.tool_calls. - No
max_stepsguard. Infinite tool loops and runaway spend. Real incidents trace back to a missing bound. Always cap, and decide what to return when you hit it (partial state, a “I couldn’t finish” message, or raise). - Letting a tool exception escape. One bad tool argument should not kill the
agent. Catch, flag
is_error, feed it back. - Tool-name hallucination. The model invents a tool that does not exist. Return the available names so it can recover — don’t silently no-op.
- Anthropic ordering rule.
tool_resultblocks must lead the user turn, before any text block. - Prompt injection via tool output. A
tool_resultis untrusted input. If a tool returns attacker-controlled text (“ignore your instructions and…”), that text now sits in the model’s context. Treat tool results as untrusted; this is the focus of the security module.
7. Try it
Section titled “7. Try it”# Offline, deterministic, no API key — four scripted scenarios with a live trace:python 02_native_tool_calling/demo.py
# Real model via OpenRouter (set OPENROUTER_API_KEY first):python 02_native_tool_calling/demo.py --live
# Tests for the worked code (must pass offline):pytest 02_native_tool_calling/ -q
# Your turn: implement exercises.py, then check (expected red until done):pytest 02_native_tool_calling/practice_test.py -qKey takeaways
Section titled “Key takeaways”- Native tool calling replaces a parser with a contract. The provider
returns a typed, schema-valid call (
id,name,input) — deleting the malformed-JSON / missing-prefix / tool-name-typo bug class that plagues text ReAct. Text ReAct still matters for no-tool-API models, transparency, and teaching. - Two dialects, one loop. Anthropic uses
tool_use/tool_resultcontent blocks (dictinput, notoolrole, results must lead); OpenAI usestool_calls+ atoolrole (JSON-stringarguments). Normalize to one neutral shape behind a translation layer and the loop never branches on provider — that is whatagentkitdoes for OpenRouter. - The loop is small but the invariants are strict: append the assistant turn
before results, return a result for every call id, terminate on “no tool
calls,” and always guard with
max_steps. - Tool errors are data. Catch unknown-tool and tool-body failures, flag
is_error, and feed actionable text back so the model can recover instead of crashing the run.