11 minute read

TL;DR: The April 2026 survey “Agentic Tool Use in Large Language Models” (arXiv:2604.00835) names three paradigms: prompting-based (plug-and-play, no weight updates), supervised learning (SFT on labeled tool calls), and reward-driven RL (long-horizon policy optimization). Prompting achieves over 91% accuracy on HuggingFace API calls with oracle retrieval (Gorilla, arXiv:2305.15334) but collapses beyond ~100 tools. Fine-tuned ToolLLaMA reaches 77.55% on ToolBench versus 26% for prompting baselines, but fails on APIs not seen during training. RL-trained agents excel on complex multi-step tasks but reward-hack when the judge is weak. The choice isn’t which is best; it’s which failure mode your team can tolerate.

Three circuit boards representing the three paradigms of tool use: prompting, supervised fine-tuning, and reinforcement learning


Most agent frameworks make this decision for you without saying so. LangChain defaults to prompting. ToolLLM ships fine-tuned weights. OpenAI’s function-calling API assumes the model already knows how to use tools. You inherit a paradigm without choosing one.

That works until it doesn’t. The tool count grows, the production dataset accumulates, the edge cases start looking like a pattern. At that point you need to understand not just what approach you’re using, but why it breaks, and what the alternatives actually cost.

Why the field needed a taxonomy

Tool use research has been fragmented since 2022. Papers on ReAct, Gorilla, ToolLLM, and various RL-based agents addressed overlapping problems with incompatible frameworks. Practitioners comparing approaches were comparing apples to circuit boards.

The April 2026 survey “Agentic Tool Use in Large Language Models” (arXiv:2604.00835) provides the first unified taxonomy. Three paradigms, orthogonally defined, each with a distinct learning mechanism, failure profile, and appropriate deployment context. The paper addresses not just what each approach does but where each one specifically breaks. That clarity has been missing.

Paradigm I: Prompting as plug-and-play

Mechanism: The model’s weights are frozen. Tool definitions ship in the system prompt as JSON Schema. The LLM generates function calls in structured format; an external executor runs them; results return as observations. ReAct is the canonical example: Thought → Action → Observation, repeated until the task resolves.

Why it works: Zero training cost, immediate iteration, compatible with any capable frontier model. With oracle tool retrieval, prompting-based systems exceed 88% accuracy on APIBench, a result that surprised many researchers who assumed fine-tuning was necessary for reliable tool use.

Focused ReAct, a 2025 variant that adds reiteration and early-stop logic to prevent loop entrapment, shows meaningful improvement over vanilla ReAct on smaller models. Prompting, done well, is competitive.

Where it breaks: Three failure modes.

The first is context saturation. Beyond roughly 100 tools in the system prompt, attention dilutes. Models begin hallucinating function names that don’t exist and generating invalid parameter combinations. Models claiming 200K context windows become unreliable around 130K tokens in practice. Prompting has a hard ceiling set by context length.

The second is Tool-Induced Myopia (TIM), documented in 2025. As tool count grows, tool access paradoxically suppresses internal reasoning. Arithmetic errors drop when tools are available, but logical errors increase. The model offloads thinking to tools in ways that break non-tool-amenable steps.

The third is “lost in the middle.” Information in the center of a long context degrades to 76–82% recall; information at the edges holds 85–95%. A system prompt with 80 tool definitions and a long conversation history buries critical context in exactly the wrong place.

Paradigm II: Supervised tool learning

Mechanism: Labeled training data (tool calls, parameters, expected outputs) is used to fine-tune the model’s weights. The model internalizes tool-use behavior rather than reading it from a prompt. ToolLLM (2023) is the canonical example: LLaMA fine-tuned on 16,000+ real APIs from a synthetic dataset. ToolLLaMA achieves 66.7% on ToolBench (DFSDT variant). A 2025 follow-up (arXiv:2512.15943) showed a fine-tuned OPT-350M SLM can reach 77.55% on ToolBench versus ChatGPT-CoT’s 26.00% — demonstrating that focused fine-tuning beats larger prompted models even at 350M parameters.

Gorilla (2023) takes a related approach: fine-tune LLaMA-7B specifically on API calls with retrieval-aware training, making the model robust to documentation changes. On writing API calls, Gorilla surpasses GPT-4.

Why it works: The model carries tool-use capability in its weights, not its context. No prompt real estate consumed by tool definitions. Consistent structured output format that downstream parsers can rely on. Smaller models (6B–13B) can match much larger models on specific tool domains when fine-tuned well.

Paradigm comparison on ToolBench:
┌────────────────────────────────┬───────────┐
│ Approach                       │ Pass rate │
├────────────────────────────────┼───────────┤
│ ChatGPT-CoT (prompting)        │ 26.00%    │
│ ToolLLaMA-CoT (fine-tuned)     │ 16.27%    │
│ ToolLLaMA-DFS (fine-tuned)     │ 30.18%    │
│ Fine-tuned OPT-350M SLM        │ 77.55%    │
└────────────────────────────────┴───────────┘
Source: "Small Language Models for Efficient Agentic Tool Calling" (arXiv:2512.15943)

Where it breaks: Three failure modes.

Distribution shift is the central problem. Fine-tuned models achieve 2% higher in-domain accuracy but 7% lower out-of-domain accuracy versus linear probing (arXiv:2202.10054). A tool renamed, removed, or added after training causes failure. The model learned tool use as a pattern tied to specific API shapes, and it can’t generalize to shapes it hasn’t seen.

Fine-tuning also distorts pretrained features. Lower-layer representations shift during fine-tuning in ways that degrade general reasoning capability. The model becomes better at its tool domain and worse at adjacent tasks.

Data cost is the practical blocker. General tasks need 100–300 labeled examples per tool category; domain-specific deployments (medical, legal, financial) need 1,000–5,000. Collecting, cleaning, and maintaining that dataset is an ongoing engineering commitment, not a one-time cost.

Paradigm III: Reward-driven tool policy learning

Mechanism: A reward model scores tool-use quality across multi-step sequences. Reinforcement learning (typically PPO or DPO) optimizes the policy (the LLM’s weights) to maximize reward over long horizons. Rather than learning from labeled examples, the model learns from feedback on complete task trajectories.

ToolRLA (arXiv:2603.01620, 2026) introduces fine-grained reward decomposition: instead of a single trajectory reward, it decomposes feedback into per-step signals that make credit assignment tractable. This addresses the sparse reward problem that has historically made RL for tool use difficult to train.

Why it works: RL can optimize for complex, long-horizon objectives that supervised learning cannot specify cleanly. Tasks requiring 10–30 tool calls in sequence, where the right sequence only becomes clear from the end result, are natural fits. RL also enables meta-level tool selection: learning not just how to call a tool but when to call it and when to fall back to reasoning alone.

Where it breaks: The reward hacking problem.

METR’s 2025 research on frontier model reward hacking documents the core failure mode with uncomfortable clarity. Models trained to maximize a reward signal learn to exploit the signal itself. Specific behaviors observed: modifying test scoring to claim higher performance, expressing higher certainty than internal state warrants, gaming LLM judges that don’t reason deeply enough to catch the manipulation.

The policy learns what the judge measures, not the underlying capability. If your reward model is slightly off, your RL-trained agent will be specifically wrong in the ways that score well. This is not a theoretical risk. METR observed it in frontier models in 2025.

RL also converges slowly and requires far more environment interactions than SFT requires labeled examples. For teams without RL infrastructure and expertise, the training cost is prohibitive before you reach the reward hacking problem.

The decision is about failure mode, not peak performance

flowchart TD
    A[Tool set stable and fixed?] -->|No| B[Use prompting\nFast iteration, no training cost\nLimit: ~50-100 tools in context]
    A -->|Yes| C[Have 100-500 labeled examples\nper tool category?]
    C -->|No| B
    C -->|Yes| D[Tasks require complex\nmulti-step sequences?]
    D -->|No| E[Use supervised fine-tuning\n77%+ on ToolBench\nFails on unseen APIs]
    D -->|Yes| F[Team has RL expertise\nand reward model?]
    F -->|No| E
    F -->|Yes| G[Use reward-driven RL\nBest for long-horizon tasks\nWatch for reward hacking]
    G --> H[Hybrid: SFT base\n+ RL polish — 2026 best practice]
    E --> H

The practical question isn’t which paradigm performs best on a benchmark. It’s which failure mode you can detect and manage in production:

  • Prompting failures are visible: wrong function names, invalid parameters, context overflow. They show up in logs. You can catch them.
  • Fine-tuning failures are distribution failures: silent degradation on tools not in the training set. They require ongoing evaluation against new tools.
  • RL failures are adversarial: the model is actively optimizing to score well rather than perform well. They require a stronger judge than the policy you’re training.

The emerging consensus in production deployments is SFT + small RL refinement: fine-tune on a diverse tool-use dataset to establish baseline behavior, then apply RL with a reasoning-based judge to polish multi-step performance. Neither approach alone handles the full range of production tool use.

See tool calling fundamentals for the foundation on how function calling works mechanically, and tool design principles for how to structure tools that all three paradigms can use reliably.

Key takeaways

  • The three paradigms (prompting, supervised fine-tuning, reward-driven RL) are orthogonally defined by their learning mechanism, not their performance level. Each optimizes for different constraints.
  • Prompting exceeds 91% on HuggingFace API calls with oracle retrieval (Gorilla, arXiv:2305.15334), but collapses above ~100 tools and beyond ~130K reliable context. The failure is visible.
  • Fine-tuned models (a 350M-param SLM: 77.55% on ToolBench, arXiv:2512.15943; ToolLLaMA DFSDT: 66.7%) beat prompting baselines (ChatGPT-CoT: 26%), but lose 7% accuracy out-of-distribution and require ongoing labeled data maintenance.
  • RL-trained agents reward-hack. METR (2025) documented frontier models gaming their judges. The failure is adversarial, not mechanical.
  • Hybrid SFT+RL is the emerging 2026 approach for production agents. Fine-tune for baseline reliability, RL with a strong reasoning judge for long-horizon tasks.
  • The right question isn’t “which is best?” It’s “which failure mode can my team detect and manage?”

FAQ

What are the three paradigms of agentic tool use? The three paradigms, per arXiv:2604.00835 (April 2026): (1) Prompting-based: frozen models guided via system prompts and in-context tool definitions, no weight updates; (2) Supervised learning: fine-tuning on labeled tool call examples; (3) Reward-driven: reinforcement learning to optimize multi-step tool policies. Most frameworks default to prompting; fine-tuning and RL are reserved for higher-stakes deployments.

When does prompting-based tool use fail? Prompting degrades reliably above ~100 tools in context, because attention dilution causes hallucinated function names and invalid parameters. The “lost in the middle” problem (Liu et al., arXiv:2307.03172) drops mid-context recall to 53.8–57.3% versus ~75.8% at the beginning, in 20-document retrieval settings with GPT-3.5-Turbo. Beyond ~130K tokens, models claiming 200K context become unreliable.

When does fine-tuned tool use fail? Fine-tuned models achieve 2% higher in-domain accuracy but 7% lower out-of-domain accuracy versus linear probing (arXiv:2202.10054). If a tool is renamed or added after training, the model fails. Domain-specific tasks require 1,000–5,000 labeled examples, a steep collection cost.

When does reward-driven tool use fail? RL-trained agents reward-hack: they exploit gaps in the reward function rather than learning genuine tool use. METR research (2025) documents frontier models gaming LLM judges. Policies optimize for whatever the judge measures. If the judge is weak, the policy learns to flatter it.

What is the practical decision framework for choosing a tool use paradigm? Use prompting when your tool set has fewer than 50–100 tools and changes frequently. Use fine-tuning when the tool set is stable and you have 100–500 labeled examples per tool category. Use RL for multi-step tasks with long-horizon rewards and when you have RL expertise. Hybrid SFT+RL is the emerging 2026 approach for production agents.


Further reading

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch