20 minute read

“Giving the Brain Hands to Act: The Interface Between Intelligence and Infrastructure.”

TL;DR

Tool calling (also called function calling) is the parse-execute loop that transforms LLMs from text generators into autonomous controllers. The model generates structured JSON representing a function call intent; the runtime validates arguments with Pydantic, executes in a sandbox, and injects results back into the conversation. Define tools using JSON Schema with verbose descriptions, prefer atomic single-responsibility tools over mega-tools, and use Human-in-the-Loop (HITL) gates for dangerous operations. For large tool inventories, use Tool RAG to dynamically retrieve only relevant tool schemas. The core principle: never trust the model at the boundary – validate inputs, sanitize outputs, isolate execution, and build observability.

CNC milling machine carving a precise channel into brushed aluminum with metal chips curling away

1. What Is Tool Calling and Why Does It Matter?

An LLM isolated in a box is just a text generator. It can hallucinatively describe the weather, but it cannot check the actual temperature. To become an Agent, it needs to interact with the outside world. This is achieved through Tool Calling (often called Function Calling).

Tool calling is the bridge between the Probabilistic World of AI (where 2+2 might equal 5 if the context is weird) and the Deterministic World of Software (where 2+2 is always 4). It is the mechanism that transforms an LLM from a “Chatbot” into a “Controller.”

In this post, we will dissect the anatomy of a tool call, explore standard patterns like JSON Schema, and discuss the critical architecture of a robust Runtime Execution Environment.


2. How Does the Tool Calling Mechanism Work?

How does a model “call” a function? It doesn’t. A neural network cannot execute code. It can only emit tokens. Tool Calling is a parse-execute loop.

2.1 What Is the Lifecycle of a Tool Call?

Let’s trace a user request: “What is the weather in Tokyo?” through the system.

  1. Tool Definition: You provide the model with a “menu” of functions in the System Prompt (usually in strict JSON Schema format).
    • Prompt: “You have a tool get_weather(city: str). Use it if needed.”
  2. Reasoning: The model analyzes the user request against the menu. It performs Semantic Matching.
    • Thinking: “User asks for weather. ‘Tokyo’ is a city. This matches get_weather.”
  3. Generation: The model generates a special token or formatted string (e.g., a JSON object) representing the intent to call the function: json { "tool": "get_weather", "arguments": { "city": "Tokyo" } }
  4. Pause (Stop Sequence): The inference engine recognizes that the model has output a “Tool Call Block” and stops generating. It freezes the model state and returns control to the Python script (the Orchestrator).
  5. Execution (The Runtime):
    • Your code parses the JSON.
    • Validation: Is “Tokyo” a string? Is it valid?
    • Execution: Your code calls the actual Python function requests.get(...).
    • Result: It captures the return value: {"temp": 18, "condition": "Cloudy"}.
  6. Context Injection (The Feedback): You do not show the result to the user yet. You append a new message to the chat history:
    • Role: Tool, Content: {"temp": 18, "condition": "Cloudy"}.
  7. Final Response: You invoke the model again. It sees its own previous request + the tool output. It now performs Synthesis.
    • Output: “It’s 18°C and cloudy in Tokyo.”

Two details here matter a lot when you move from toy demos to production:

  • The tool result becomes the source of truth. The model should treat tool output as authoritative even if it conflicts with its prior. If the tool says “18°C,” the model shouldn’t “correct” it to “25°C” because it feels more plausible.
  • Tool output should be shaped for the model. If your tools return huge payloads (full HTML pages, long JSON blobs), you should summarize or extract just the fields you need before injecting it into the LLM context. Otherwise, you waste tokens and increase error rates by flooding the context with irrelevant text.

This is why many stacks include a small “tool post-processor” layer that converts raw responses into a compact schema: status, data, and a short message for the agent to reason about.

2.2 Why Is JSON Schema the Standard Tool Protocol?

Standardization is key. The industry has converged on JSON Schema (OpenAPI) to define tools. Models like GPT-4 are fine-tuned to read this specific format.

  • Crucial Tip: The description field in the schema is not just documentation; it is part of the Prompt. Be verbose.
  • Bad: "description": "Get weather"
  • Good: "description": "Get current weather for a specific city. Do not use for general climate questions. Returns temp in Celsius."
  • Impact: The model reads this description to decide when to call the tool.

In real systems, the schema is also your contract boundary. If the schema is vague, the model will improvise. If the schema is strict, the model tends to stay in-bounds.

  • Prefer enums and bounded strings where possible: e.g. unit: ["celsius", "fahrenheit"], currency: ["USD", "EUR"].
  • Put examples in descriptions when the semantics are tricky: “Use start_date in ISO-8601 (2026-01-05).”
  • Encode safety constraints into the interface: instead of a free-form sql: str, expose higher-level primitives like get_user_by_id(id) or list_orders(status, limit) to prevent the model from generating destructive queries.

One subtle but important point: schemas should be written for the model, not the engineer. The model benefits from explicit redundancy:

  • say what the tool does,
  • say when to use it,
  • say when not to use it,
  • say what a successful response looks like,
  • say what to do on common failures.

Finally, prefer stable, typed tool outputs. If a tool returns a free-form blob, the model has to re-interpret it every time, and you’ll see inconsistent behavior. If the tool returns a small structured object with stable keys (and ideally stable units), you get predictable downstream reasoning:

  • temp_c: 18 is better than "18C" or "18°C" (string parsing bugs).
  • status: "not_found" is better than "User does not exist" (brittle NLP parsing).

It sounds mundane, but the “shape” of tool outputs is one of the strongest levers you have for tool-calling reliability.


3. What Design Patterns Improve Tool Calling Reliability?

LLMs are clumsy. They make mistakes. They hallucinate arguments (e.g., inventing a parameter force=True that your function doesn’t accept). Your runtime must be defensive.

3.1 How Does Pydantic Validation Protect Against Hallucinated Arguments?

You should never pass raw LLM output to a function.

  • Pattern: Wrap every tool in a Pydantic Model.
  • Logic: Before executing, pass the LLM’s argument dict into the Pydantic model.
  • Self-Correction: If validation fails (e.g., “Field ‘city’ is missing”), catch the error and return it to the LLM.
  • System Response to Agent: Error: Invalid argument. Missing 'city'.
  • Agent: “Ah, sorry.” (Tries again with city). This Feedback Loop allows the agent to fix its own typos without crashing the program.

You also want validation to cover more than “types.” For agent tools, the most common production bugs are semantic:

  • Missing required fields (the model assumed a default).
  • Invalid combinations (e.g., refund_amount present but charge_id missing).
  • Range violations (negative quantities, absurd time ranges).
  • Cross-field constraints (“end_time must be after start_time”).

This is where Pydantic shines: you can enforce invariants early and return a short, actionable error string the model can correct on the next attempt. That transforms failures from “runtime exception” into “self-healing loop.”

One more reliability trick is to implement argument repair as a first-class step:

  • If a string field often comes in with extra whitespace or casing issues, normalize it (strip(), lower()).
  • If the model frequently confuses synonyms (“SF” vs “SFO”), add a canonicalization map or a resolver tool (resolve_airport_code("SF") -> "SFO").
  • If arguments are ambiguous (multiple users match “John”), return a structured clarification request and force the agent to ask a follow-up question.

This reduces the number of “agent failures” that are really just “missing UX in the tool layer.”

3.2 How Should You Handle Tool Errors and Error Propagation?

What happens if your API returns a 500-line stack trace?

  • Problem: It runs up your token bill and fills the context window with noise. The agent uses the stack trace to hallucinate a weird answer.
  • Fix: Catch exceptions in the tool wrapper. Return a concise, safe string.
  • Bad: Traceback (most recent call last): File "main.py"...
  • Good: Error: The database connection timed out. Please try again later.
  • Why: This allows the Agent to reason about the failure (“Okay, I’ll apologize to the user or try a different database”) rather than getting confused.

Another key practice is to return machine-meaningful errors, not just English sentences. If your tool wrapper can return:

  • error_type: "timeout",
  • retryable: true,
  • suggested_next_action: "retry_with_backoff", the agent can make consistent decisions across tools. Without this, the model will treat each error message as a fresh piece of prose and behave inconsistently.

You can push this further by making tools explicit about partial success. Many real operations are not all-or-nothing:

  • “Created ticket, but failed to attach logs.”
  • “Provisioned VM, but failed to update DNS.”
  • “Fetched 8/10 pages; rate-limited on the rest.”

If the tool returns only “error,” the agent may retry the whole operation and create duplicates. If the tool returns a structured status with what succeeded and what failed, the agent can resume from the right step. This is the same idea as saga workflows in distributed systems: make progress visible and resumable.

3.3 Why Are Atomic Tools Better Than Mega Tools?

  • Mega Tool: manage_user(action, id, data) - Hard for the LLM to understand all the permutations. It has to guess the schema for data based on action. It often fails.
  • Atomic Tools: create_user, delete_user, update_email.
  • Rule: Smaller, specific tools reduce hallucination rates. Adhere to the Single Responsibility Principle.

There’s a second advantage to atomic tools: auditability. When an agent makes a mistake, you want logs to read like:

  • create_user(email=..., plan=...) then send_welcome_email(user_id=...) not:
  • manage_user(action="some_action", data={...})

Clear, narrow tool calls make it easier to build approval flows, anomaly detection (“why is the agent deleting users?”), and metrics per capability.


4. How Do You Scale When You Have Too Many Tools?

GPT-4 has a context limit. If you have 5,000 internal APIs (like AWS or Stripe), you cannot paste all 5,000 JSON schemas into the prompt. It would cost $5 per query and confuse the model.

4.1 What Is Tool RAG and How Does It Work?

Treat tool definitions like documents.

  1. Embed the descriptions of all 5,000 tools into a Vector Database.
  2. When a user query comes in (“How do I refund a charge?”), embed the query.
  3. Search the Vector DB for the top 5 most relevant tools (stripe_refund, stripe_get_charge, etc.).
  4. Inject only those 5 definitions into the context prompt. This allows agents to have infinite toolkits.

Tool RAG works best when the retrieval corpus is curated:

  • keep descriptions crisp and disambiguated (“refund a charge” vs “void an authorization”),
  • include synonyms (“refund”, “chargeback”, “reversal”),
  • and add negative examples (“do not use this tool for…”).

In many orgs, the hard part is not retrieval quality, it’s tool documentation quality. If your internal APIs have weak docs, your agent will inherit that weakness. Improving tool schemas and descriptions often yields bigger wins than prompt tweaks.

There’s also a governance angle: once tools are retrievable, the agent might discover powerful capabilities you didn’t intend to expose. That’s why tool RAG should typically retrieve from a permission-filtered corpus:

  • Determine the user/tenant/role at runtime.
  • Retrieve only tool schemas that role is allowed to call.
  • Enforce the same permission check again at execution time (defense in depth).

If you skip this, you create a weird security bug: “the model found an admin-only tool because it was semantically relevant.”

4.2 How Do You Handle Asynchronous Tool Actions?

Some tools take time (e.g., generate_video, provision_server). You can’t keep the HTTP connection open for 10 minutes waiting for the tool.

  • Pattern:
    1. Agent calls start_job().
    2. Tool returns immediately: {"job_id": "123", "status": "pending"}.
    3. Agent reasons: “I have started the job. I will check back later.”
    4. Agent (or a scheduler) periodically calls check_status(job_id).

Async tool calls introduce two classic distributed-systems problems:

  • At-least-once execution: retries can create duplicate jobs unless the tool supports idempotency keys.
  • Stuck jobs: the agent needs a timeout policy (“if pending > 10 minutes, escalate to human / re-run / choose alternative”).

This is why mature runtimes treat tool calling as workflow orchestration: retries with backoff, step budgets, and explicit terminal states (success/failure/cancelled).


5. How Do You Secure Tool Calling with Sandboxing?

The most dangerous tool is run_python_code. If an agent can run code, it can os.system('rm -rf /') or os.environ['AWS_KEY'].

Security for tool calling is basically the same story as security for microservices: assume compromise, minimize privilege, and enforce boundaries. Models can be tricked (prompt injection), they can make mistakes (hallucinated arguments), and they can be over-trusting (treating untrusted text as instructions).

So beyond sandboxing execution, you usually need:

  • Secret hygiene: tools should never return secrets into the model context. Redact at the tool boundary.
  • Output filtering: don’t let the model echo back raw tool output if it may contain PII or internal identifiers. Prefer “summarize and cite fields” patterns.
  • Network policy: many agent tasks do not need arbitrary internet access. Restrict outbound egress to allowlisted domains where possible.
  • Filesystem policy: allowlist specific directories; mount everything else read-only.

5.1 What Are the Best Sandboxing Strategies?

NEVER run agent code on your host.

  1. Transient Containers: Spin up a Docker container for the session. Destroy it after.
  2. WebAssembly (Wasm): Run code in a browser-like sandbox (e.g., Pyodide).
  3. Cloud Sandboxes: Use services like E2B or Modal that provide secure, isolated VMs specifically for AI agents.

Sandboxing isn’t just about malicious intent; it’s about blast radius. Agents will do unsafe things accidentally:

  • delete the wrong file path due to a string formatting mistake,
  • exfiltrate secrets by printing environment variables into logs,
  • create infinite loops that burn CPU and rack up cloud costs.

So you want hard limits at the runtime boundary: CPU/memory quotas, network egress policies, filesystem allowlists, and timeouts.

5.2 How Do Permission Scopes and Human-in-the-Loop Work?

  • Read-Only: Give agents “Viewer” access by default.
  • Human-in-the-Loop (HITL): Tag dangerous tools (delete_db, send_money) as requires_approval=True.
  • When the agent calls it, the runtime pauses.
  • The user gets a pop-up: “Agent wants to delete DB. Allow?”
  • Only on “Yes” does the code execute.

A useful framing is that tool calls should be capability-based:

  • “Read” capabilities (safe by default).
  • “Write” capabilities scoped to a resource (write to this ticket, post to this Slack channel).
  • “Irreversible” capabilities (delete, send money, rotate keys) always gated and ideally require an explicit human confirmation step that includes a clear diff of what will happen.

6. How Do You Build a Semantic Tool Router?

Instead of showing the parsing code (which varies by library), let’s look at the Router Logic structure.

class ToolExecutor:
    """
    The 'Hands' of the agent.
    Responsible for safe execution and validation.
    """
    def execute(self, tool_name: str, raw_arguments: dict) -> str:

        # 1. Lookup
        tool = self.registry.get(tool_name)
        if not tool:
            return "Error: Tool not found."

            # 2. Validation (Pydantic)
            try:
                validated_args = tool.schema(**raw_arguments)
            except ValidationError as e:
                # CRITICAL: Return the validation error to the LLM
                # giving it a chance to fix its typo.
                return f"Error: Invalid Arguments. {e}"

                # 3. Security Check (HITL)
                if tool.requires_approval:
                    permission = self.human_interface.request(tool_name, validated_args)
                    if not permission:
                        return "Error: User denied permission."

                        # 4. Execution (Sandboxed)
                        try:
                            result = tool.run(validated_args)
                            return str(result)
                        except Exception as e:
                            # 5. Error Sanitization
                            return "Error: Internal tool failure. Please try again."

                            # Usage in the Agent Loop:
                            # if output.is_tool_call:
                            # result = executor.execute(output.name, output.args)
                            # memory.add("ToolResult", result)

Even in this simplified pseudo-code, two extra production considerations matter:

  • Idempotency: if the agent repeats a tool call, do you re-run it or deduplicate? Many actions should be idempotent by default (create resources with deterministic keys; avoid sending duplicate messages).
  • Tracing: you want a run ID that flows through the agent loop and every tool call. That enables “one click” debugging: what tools were called, with what args, what returned, and how the agent decided its next step.

FAQ

Q: How does tool calling work in AI agents? A: Tool calling is a parse-execute loop. The LLM generates a structured JSON object representing a function call intent. The runtime parses this JSON, validates arguments (via Pydantic), executes the actual function in a sandbox, and injects the result back into the conversation for the model to synthesize a final response.

Q: Why is Pydantic validation important for AI agent tool calls? A: LLMs hallucinate arguments, invent nonexistent parameters, and miss required fields. Pydantic validation catches these errors before execution and returns actionable error messages to the LLM, creating a self-correction feedback loop where the agent can fix its own mistakes without crashing.

Q: What is Tool RAG and how does it solve the too many tools problem? A: Tool RAG embeds descriptions of all available tools into a vector database. When a user query arrives, the system retrieves only the top 5 most relevant tool schemas and injects them into the prompt, allowing agents to work with thousands of tools without exceeding context limits.

Q: How should you sandbox AI agent code execution? A: Never run agent code on your host machine. Use transient Docker containers, WebAssembly sandboxes like Pyodide, or cloud sandbox services like E2B or Modal. Apply CPU/memory quotas, network egress policies, filesystem allowlists, and hard timeouts to limit blast radius.

Q: What is the difference between atomic and mega tools for AI agents? A: Mega tools like manage_user(action, id, data) force the LLM to guess complex parameter schemas per action, increasing hallucination. Atomic tools like create_user, delete_user, update_email follow single responsibility, reduce errors, and produce clearer audit logs.


7. Key Takeaways

Tool Calling is what makes AI useful.

  • Standardized Interfaces (JSON Schema) allow models to understand the world.
  • Defensive Coding (Validation Loops) allow models to correct their own mistakes.
  • Strict Security (Sandboxing) ensures the agent doesn’t burn down the house.

By mastering these fundamentals, you can build agents that don’t just talk, but do, transforming business workflows from manual drudgery to autonomous execution.

The core engineering theme is simple: never trust the model at the boundary. Treat tool calling like API consumption in hostile environments. Validate inputs, sanitize outputs, isolate execution, and build observability so you can diagnose failures quickly. When you do, tool calling stops being a gimmick and becomes a reliable control plane for automation.

A good final sanity check is to ask: if this agent were replaced by an intern, what guardrails would you put in place? You’d require approvals for risky operations, limit access to production, enforce checklists, and make them write down what they did. Tool-calling runtimes should do the same with programmatic enforcement: approvals, least privilege, idempotency, and traceability. That’s how you safely give “the brain” hands.

Once you have that foundation, you can start optimizing: cache stable tool results, batch calls where possible, and use smaller models for simple routing decisions. But those optimizations only pay off after correctness and safety are nailed down, because the cost of a bad tool call is not just tokens, it’s user trust and real-world damage.

If you take nothing else away: the agent loop is “LLM + tools,” but the system is “LLM + tools + policies.” The policies live in schemas, validators, permission checks, and runtime boundaries. Build those well, and swapping models becomes an implementation detail rather than a rewrite.

And if you’re designing tools for agents, design them for composability. Tools should be small enough that the agent can chain them, but not so small that it has to make ten calls just to do one business action. Finding that “right granularity” is a product decision as much as an engineering one, and it heavily influences agent success rates.


Originally published at: arunbaby.com/ai-agents/0004-tool-calling-fundamentals

If you found this helpful, consider sharing it with others who might benefit.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch