Native Tool Calling & the Agentic Loop

Module 01 built a ReAct (Reasoning and Acting) agent that parses the model’s text every turn (Thought: / Action: / Observation:). It works, but it is fragile: one stray token and your regex misfires. This module moves to how production agents actually run — the model returns structured tool calls, the framework dispatches them, and structured results go back. Same loop, far fewer ways to break.

Learning objectives

By the end you can:

Explain the difference between hand-rolled text ReAct and native / structured tool calling, and argue the reliability tradeoff in an interview.
Write the exact multi-turn message shapes for both dialects — Anthropic tool_use / tool_result content blocks and OpenAI tool_calls / tool role — and explain how a library normalizes them.
Implement a clean, provider-agnostic agentic loop that runs offline on MockLLM and live on OpenRouter/Anthropic without branching on the provider.
Handle the real failure modes: unknown tool, a tool that raises, the parallel-call contract, and the max_steps guard.

1. Why move off text-parsing ReAct?

Hand-rolled ReAct asks the model to format its intent as text and then reverse-engineers that text:

Thought: I should look up the weather.
Action: get_weather
Action Input: {"city": "Paris"}

Your framework regexes out Action: and Action Input:, json.loads the args, runs the tool, and appends Observation: 18C. Every one of those steps is a place to fail:

TEXT ReAct (you own the parser)            NATIVE tool calling (the API owns it)
-----------------------------------        ------------------------------------
model emits free text                      model emits a typed tool_call object
   |  regex Action/Action Input               |  already structured: id+name+input
   |  json.loads(args)  <-- can crash         |  args validated against your schema
   |  match tool name by string               |  name is a first-class field
   v                                          v
run tool, append "Observation:"            run tool, append a tool_result block

Native tool calling moves the contract into the API (Application Programming Interface). You send the model a list of tool schemas; the provider returns a structured object with a tool id, the tool name, and an input already shaped to your JSON (JavaScript Object Notation) Schema (often via grammar-constrained decoding, so it is syntactically valid by construction). No prefix to forget, no JSON to hand-extract, no “the model wrote Action : with a space.”

When does text ReAct still matter?

Native calling is the default, but text ReAct is not dead:

Models / endpoints without a tool-calling API. Older or local models, or a raw completion endpoint, force you back to text parsing.
Full transparency / portability. The scratchpad is just text — trivially loggable, diffable, and provider-independent. Some eval and research setups want the reasoning trace inline.
Teaching and debugging. Seeing Thought:/Observation: interleaved makes the loop legible. (That is exactly why module 01 starts there.)

Interview angle. “Why prefer native tool calling over ReAct text parsing?” → Reliability and separation of concerns. The provider guarantees a valid, typed call against your schema, so you delete a whole class of parser bugs (malformed JSON, missing prefixes, tool-name typos) and get parallel calls and tool_choice for free. The cost: you depend on a provider feature and a specific wire format. Mitigate that by normalizing to one neutral shape behind a thin translation layer — then your loop never changes when you swap models.

2. The two dialects (know both cold)

There are two wire formats in the wild. Anthropic uses content blocks; OpenAI uses a tool role and a separate tool_calls field. OpenRouter speaks the OpenAI dialect, and agentkit translates our neutral (Anthropic-style) shape to it. Memorize both — interviewers ask you to “walk the JSON.”

2a. Anthropic dialect (content blocks, no `tool` role)

Request — tools are {name, description, input_schema}:

{
  "model": "claude-opus-4-8",
  "tools": [{
    "name": "get_weather",
    "description": "Get the current weather for a city.",
    "input_schema": {
      "type": "object",
      "properties": {"city": {"type": "string"}},
      "required": ["city"]
    }
  }],
  "messages": [{"role": "user", "content": "Weather in Paris?"}]
}

Response — stop_reason: "tool_use", and a tool_use content block:

{
  "stop_reason": "tool_use",
  "content": [
    {"type": "text", "text": "Let me check."},
    {"type": "tool_use", "id": "toolu_01A", "name": "get_weather",
     "input": {"city": "Paris"}}        // input is a real dict, already parsed
  ]
}

You continue the conversation by appending (a) the full assistant message as-is, then (b) a user turn whose content starts with a tool_result block:

{"role": "assistant", "content": [ ...the tool_use block above... ]},
{"role": "user", "content": [
  {"type": "tool_result", "tool_use_id": "toolu_01A", "content": "18C and sunny"}
]}

Gotchas specific to Anthropic:

The tool_result block(s) must come first in that user turn — before any text block — or the API 400s.
The tool_use_id must echo the id from the call. There is no tool role.
On failure, set "is_error": true in the tool_result.

2b. OpenAI dialect (a `tool` role + `tool_calls`)

Request — tools are wrapped in {type:"function", function:{...}}, and the schema key is parameters (not input_schema):

{
  "model": "anthropic/claude-haiku-4.5",   // an OpenRouter slug
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get the current weather for a city.",
      "parameters": {"type": "object",
                     "properties": {"city": {"type": "string"}},
                     "required": ["city"]}
    }
  }],
  "tool_choice": "auto",
  "messages": [{"role": "user", "content": "Weather in Paris?"}]
}

Response — finish_reason: "tool_calls", and message.tool_calls:

{
  "choices": [{
    "finish_reason": "tool_calls",
    "message": {
      "content": null,
      "tool_calls": [{
        "id": "call_abc",
        "type": "function",
        "function": {"name": "get_weather",
                     "arguments": "{\"city\": \"Paris\"}"}  // a JSON STRING!
      }]
    }
  }]
}

You continue by appending the assistant message (with its tool_calls), then a message with role: "tool":

{"role": "assistant", "content": null, "tool_calls": [ ...as above... ]},
{"role": "tool", "tool_call_id": "call_abc", "content": "18C and sunny"}

Gotchas specific to OpenAI:

function.arguments is a JSON-encoded string — you must json.loads() it. (Anthropic gives you a dict directly.)
The result goes in a distinct tool role message keyed by tool_call_id.
There is no “results must come first” rule; the tool messages just follow the assistant message.

2c. The neutral shape (what we code against)

The whole point of agentkit is that you never write either of the above by hand in your loop. You code against one neutral, Anthropic-style shape and let the library translate at the edge:

            neutral messages (Anthropic-style blocks)
                          |
        +-----------------+------------------+
        |                                    |
  AnthropicLLM                         OpenRouterLLM
  (sends as-is)                  to_openai_messages() / to_openai_tools()
        |                          from_openai_response()
        v                                    v
   Anthropic API                       OpenAI-compatible API

agentkit.llm exposes those translators as pure functions so they are unit-testable offline (no network): to_openai_messages, to_openai_tools, from_openai_response, from_anthropic_response. Both providers collapse to one LLMResponse:

@dataclass
class LLMResponse:
    text: str = ""
    tool_calls: list[ToolCall] = []      # ToolCall(id, name, input: dict)
    stop_reason: str = "end_turn"        # "tool_use" when it wants a tool

Interview angle. “How would you support multiple model providers in one agent?” → Define a neutral message/response shape, isolate provider JSON to a translation layer, return one normalized response type. The loop, the tools, and the tests then never branch on the provider. Name the two concrete differences you have to bridge: content blocks vs tool role, and dict input vs JSON-string arguments.

3. The real agentic loop

Here is the loop in agent_loop.py, in words:

messages = [user turn]
for step in range(max_steps):                 # <-- the guard
    resp = llm.complete(messages, system, tools=registry.specs())
    messages.append(assistant_turn(resp))     # text + tool_use blocks
    if not resp.tool_calls:                    # model gave its final answer
        return resp.text
    results = []
    for call in resp.tool_calls:               # run EVERY call
        result, is_error = dispatch(call)      # catch failures!
        results.append(tool_result_block(call, result, is_error))
    messages.append({"role": "user", "content": results})   # ONE turn, all results
# fell out -> guard fired
return ""   # or raise / return partial state

Four design decisions worth defending:

Append the assistant turn before the results. The tool_result / tool message references the call by id; if the assistant turn that made the call is not in the history, the provider rejects the conversation. This is the single most common native-tool-calling bug.
Terminate on “no tool calls,” not on stop_reason. resp.tool_calls being empty is the robust cross-provider “the model is done” signal. (We keep stop_reason around for logging and truncation handling.)
Return a result for every call id (parallel contract). A model can emit several tool calls in one turn. You must dispatch all of them and feed back a result for each id before calling the model again. An orphaned id (a call with no matching result) breaks the conversation on both providers. In production you run these concurrently to cut latency.
Always bound the loop with max_steps. A model can loop forever: call tool → look at result → call the same tool again. Without a cap that is an infinite loop and unbounded spend. The guard is not optional.

Tool errors are data, not crashes

When a tool fails, do not let the exception escape the loop. Catch it, set is_error=True, and feed an actionable message back — the model can then retry, switch tools, or apologize. Two cases to handle (execute_tool_call):

try:
    return registry.dispatch(call.name, call.input), False
except KeyError:                                  # model hallucinated a tool name
    available = ", ".join(s["name"] for s in registry.specs()) or "(none)"
    return f"Error: unknown tool '{call.name}'. Available tools: {available}.", True
except Exception as exc:                           # the tool body raised
    return f"Error running tool '{call.name}': {exc}", True

Listing the real tool names in the unknown-tool message materially improves the model’s odds of self-correcting. Write error text like a stack trace for a junior engineer: what failed and what to do next. (Claude will typically retry a failing tool 2–3 times before giving up.)

4. From `@tool` to the API `tools` param

You never hand-write JSON Schema. agentkit’s @tool derives it from the function’s type hints + docstring, and ToolRegistry.specs() is exactly what you pass as tools:

@tool
def get_weather(city: str) -> str:
    """Get the current weather for a city. Use for any weather question."""
    ...

registry = ToolRegistry([get_weather])
registry.specs()
# -> [{"name": "get_weather",
#      "description": "Get the current weather for a city. ...",
#      "input_schema": {"type": "object",
#                       "properties": {"city": {"type": "string"}},
#                       "required": ["city"]}}]

That neutral spec is sent as-is to Anthropic, or run through to_openai_tools() (which rewraps it as {type:"function", function:{..., parameters: input_schema}}) for OpenRouter. The description is the single biggest routing lever — if the model keeps picking the wrong tool, fix descriptions first (be specific, say when not to use it), then the schema, then reduce the tool count.

5. `tool_choice` and structured outputs (quick but interview-relevant)

tool_choice controls whether/which tool fires. Anthropic: auto | any | tool | none. OpenAI: auto | required | {function:{name}} | none. Forcing a choice (any/tool/required) prefills the assistant turn, so you get no natural-language preamble — great for extraction, bad for chat.
Structured outputs ≠ function calling. Structured outputs constrain the final answer’s shape (e.g. OpenAI response_format); function calling triggers actions. They are orthogonal. Both major providers now ship native structured outputs: on Claude, constrain the response to a JSON Schema with output_config={"format":{"type":"json_schema","schema":{...}}} (or use the SDK — Software Development Kit — helper client.messages.parse(..., output_format=Model) for a validated, typed object), and use strict tool use ("strict": true on a tool’s input_schema) to guarantee valid tool arguments. A still-useful, provider-portable fallback that predates native support: define a single tool, force it with tool_choice:{type:"tool", name}, and read the tool’s input as your structured data — handy on models/providers (e.g. some via OpenRouter) that lack native structured outputs.

Interview angle. “any vs auto?” → Guarantee a tool fires (data extraction) vs let the model decide whether to (conversational). “Structured output for document extraction with Claude?” → Use native structured outputs (JSON-schema output_config or messages.parse); the single-tool + forced-tool_choice pattern is the portable fallback when native support is unavailable.

6. Common pitfalls / gotchas

Feeding the result without the assistant turn. The #1 native-tool bug: you append the tool_result/tool message but forgot to first append the assistant message that made the call → 400 / “unknown tool_use_id.”
Forgetting json.loads on OpenAI arguments. OpenAI gives you a JSON string; Anthropic gives you a dict. Treating the string as a dict (or vice versa) silently breaks dispatch. (from_openai_response handles this for you — and defensively returns {} on malformed JSON instead of crashing.)
Orphaned tool-call ids. Parallel calls without a result for every id. Always loop over all resp.tool_calls.
No max_steps guard. Infinite tool loops and runaway spend. Real incidents trace back to a missing bound. Always cap, and decide what to return when you hit it (partial state, a “I couldn’t finish” message, or raise).
Letting a tool exception escape. One bad tool argument should not kill the agent. Catch, flag is_error, feed it back.
Tool-name hallucination. The model invents a tool that does not exist. Return the available names so it can recover — don’t silently no-op.
Anthropic ordering rule. tool_result blocks must lead the user turn, before any text block.
Prompt injection via tool output. A tool_result is untrusted input. If a tool returns attacker-controlled text (“ignore your instructions and…”), that text now sits in the model’s context. Treat tool results as untrusted; this is the focus of the security module.

7. Try it

# Offline, deterministic, no API key — four scripted scenarios with a live trace:
python 02_native_tool_calling/demo.py

# Real model via OpenRouter (set OPENROUTER_API_KEY first):
python 02_native_tool_calling/demo.py --live

# Tests for the worked code (must pass offline):
pytest 02_native_tool_calling/ -q

# Your turn: implement exercises.py, then check (expected red until done):
pytest 02_native_tool_calling/practice_test.py -q

Key takeaways

Native tool calling replaces a parser with a contract. The provider returns a typed, schema-valid call (id, name, input) — deleting the malformed-JSON / missing-prefix / tool-name-typo bug class that plagues text ReAct. Text ReAct still matters for no-tool-API models, transparency, and teaching.
Two dialects, one loop. Anthropic uses tool_use/tool_result content blocks (dict input, no tool role, results must lead); OpenAI uses tool_calls + a tool role (JSON-string arguments). Normalize to one neutral shape behind a translation layer and the loop never branches on provider — that is what agentkit does for OpenRouter.
The loop is small but the invariants are strict: append the assistant turn before results, return a result for every call id, terminate on “no tool calls,” and always guard with max_steps.
Tool errors are data. Catch unknown-tool and tool-body failures, flag is_error, and feed actionable text back so the model can recover instead of crashing the run.