8 minute read

“AlphaGo’s secret weapon was not the neural network. It was the tree search that told the network where to look.”

TL;DR

Chain-of-thought commits to one path and cannot backtrack. MCTS explores, evaluates, and revises. ToolTree (ICLR 2026) brings MCTS to agent planning with dual-stage feedback — pre-evaluation filters tools before execution, post-evaluation scores after — achieving 10% gains over baselines while scaling to 10,014 tools with under 2% degradation. LATS doubled ReAct on HotPotQA. RAP on LLaMA-33B beat CoT on GPT-4 by 33% for plan generation. The catch: MCTS needs a simulatable environment. Code and databases, yes. Sending emails, no. For the static planning patterns MCTS builds upon, see planning and decomposition.

A circuit board with branching trace paths illuminated in sequence, the leftmost branch brightly lit as the selected path while parallel branches r...

Because the tool-call decision space branches exponentially, and linear reasoning cannot explore it.

An agent with access to 50 tools faces 50 choices at each step. A three-step plan has 125,000 possible tool sequences. A five-step plan has 312.5 million. Chain-of-thought picks one path through this space based on the LLM’s first instinct. If step two was wrong, the only recovery is forward correction — hoping step three compensates for step two’s mistake.

This is the same problem Go programs faced before AlphaGo. The game tree has roughly 250 legal moves per position and games last 150+ moves. Exhaustive search is impossible. Greedy evaluation misses long-term consequences. MCTS solved it by selectively exploring the most promising branches, estimating values through simulated rollouts, and revising estimates as more information arrives.

The mapping to agent planning is direct. States are the agent’s current context — task description, tools called so far, intermediate results. Actions are tool selections. Rollouts simulate what would happen if the agent followed a particular tool sequence to completion. The LLM serves as both the policy (which branch to try next) and the world model (what happens if I call this tool).

How does MCTS work in the agent context?

Four phases, repeated until a time or compute budget runs out.

graph TD
    A[Selection<br/>UCB1 traverses existing tree] --> B[Expansion<br/>LLM generates candidate actions]
    B --> C[Simulation<br/>LLM rollout predicts outcomes]
    C --> D[Backpropagation<br/>Update node values with results]
    D --> A

    E[UCB1 Formula] --- F["reward_avg + C × √(ln(parent_visits) / visits)"]
    F --- G[Exploitation ← → Exploration]

Selection. Starting from the root, traverse the tree by choosing the child with the highest UCB1 score at each level. UCB1 balances exploitation (nodes with high average reward) against exploration (nodes visited fewer times). The constant C controls the trade-off — higher C means more exploration.

Expansion. At a leaf node, the LLM generates one or more candidate actions (tool calls). Each becomes a new child node.

Simulation. From the new node, the LLM simulates a rollout — predicting what would happen if the agent continued with a plausible sequence of tool calls. The rollout terminates at a goal state or a depth limit. The outcome gets a reward score.

Backpropagation. The reward propagates up the tree, updating the average reward and visit count at every ancestor node. Nodes on paths that led to good outcomes get higher values. Nodes on paths that failed get lower values.

After many iterations, the root’s children have well-calibrated value estimates. The agent picks the action with the highest visit count (not highest value — visit count is more robust to outliers) and executes it.

What does ToolTree add beyond basic MCTS?

ToolTree (arXiv 2603.12740, ICLR 2026) addresses the practical problems of applying MCTS to tool-use agents.

The core innovation is dual-stage feedback with bidirectional pruning. Before executing a tool, a lightweight pre-evaluator assesses semantic relevance — does this tool’s description match the current sub-task? Tools that score below threshold are pruned before the agent spends compute or API calls running them. After execution, a post-evaluator scores the actual output quality — did this tool produce useful results?

This matters for scale. An agent with access to 10,014 tools (the ToolBench library) cannot afford to explore even a fraction of the search space. ToolTree’s pre-evaluator cuts the effective branching factor dramatically. The paper reports less than 2% performance degradation when scaling from small tool sets to 10,014 tools — the pre-evaluator filters irrelevant tools without losing good candidates.

Results across four benchmarks:

Benchmark ToolTree Best baseline Gain
GTA (F1) 66.95 ~60 (greedy) ~10%
ToolBench (pass rate) 69.04% ~63% (greedy) ~10%

Code is available at github.com/SYang2000/ICLR_2026_ToolTree.

How does this compare to other tree-search agents?

Three prior systems established the MCTS-for-agents paradigm.

LATS (Language Agent Tree Search, ICML 2024) was the first unified framework combining reasoning, acting, and planning via MCTS. It incorporates LLM-powered value functions and self-reflections as feedback. On HotPotQA, LATS achieved 0.61 EM — 2x the score of ReAct. On WebShop, 75.9 average score (+22.1 over baselines with GPT-3.5). On HumanEval, 94.4% Pass@1 with GPT-4.

RAP (Reasoning via Planning, EMNLP 2023) repurposed the LLM as both world model and reasoning agent. RAP on LLaMA-33B achieved a 33% relative improvement over chain-of-thought on GPT-4 for plan generation — a smaller model with better search beating a larger model with greedy decoding.

LLM-MCTS (NeurIPS 2023) showed 40.59% improvement over zero-shot chain-of-thought by using the LLM as both heuristic policy and world model.

The pattern across all three: tree search with LLM rollouts consistently outperforms linear reasoning. The improvements are not marginal — they are 2x on information retrieval (LATS), 33% on planning (RAP), and 40% on general problem-solving (LLM-MCTS).

ToolTree’s contribution is making this practical at scale. The prior systems work well with small tool sets but degrade when the branching factor grows. Bidirectional pruning solves the scaling problem.

System Venue Key result Best for
ToolTree ICLR 2026 10% gain, scales to 10K tools Large tool libraries
LATS ICML 2024 2x ReAct on HotPotQA Multi-step QA, web tasks
RAP EMNLP 2023 LLaMA-33B beats GPT-4 CoT by 33% Plan generation
LLM-MCTS NeurIPS 2023 40.59% over CoT General task planning

When does MCTS break for agents?

MCTS assumes you can simulate or undo actions. In board games, this is trivially true — you can always undo a move on the internal board representation. In agent environments, three conditions break this assumption.

Irreversible actions. Sending an email, making a purchase, posting a comment, calling an API with side effects — none of these can be rolled back. MCTS requires exploring multiple paths, which means executing (or simulating) actions that might be wrong. If execution has real consequences, you cannot afford wrong paths.

Simulation fidelity. When MCTS cannot execute real actions during search, it substitutes LLM-based rollouts — the LLM predicts what would happen. These predictions are unreliable for complex environments. Web pages change. API responses depend on state you cannot observe. User behavior is unpredictable. Compounding rollout errors over multiple steps degrades MCTS performance significantly.

Computational cost. Each MCTS iteration requires at least one LLM call for expansion and one for simulation. A tree search with 100 iterations per decision step costs 200 LLM calls — versus one call for greedy ReAct. For latency-sensitive applications, this 200x overhead is prohibitive.

The practical rule: use MCTS when actions are reversible or safely simulatable.

Environment MCTS viable? Why
Code generation Yes Syntax checking is instant, execution is sandboxed
Database queries Yes SELECT is read-only, side-effect-free
Math reasoning Yes Symbolic, fully reversible
Web navigation (structured) Partially Pages are observable but dynamic
Email/messaging No Irreversible, real-world consequences
API calls with side effects No Cannot undo mutations

The hybrid recommendation: use ReAct (linear) for real-time interaction with irreversible environments. Switch to MCTS for planning phases where the agent can simulate before committing. ToolTree’s pre-evaluator makes this cheaper by filtering 90%+ of candidate tools before the tree search begins.

Key takeaways

  • Tree search outperforms linear reasoning by 2x or more across QA, planning, and code generation. The evidence from LATS, RAP, and LLM-MCTS is consistent.
  • ToolTree makes it practical at scale. Dual-stage feedback and bidirectional pruning handle 10,014 tools with under 2% degradation. Code is on GitHub.
  • The mapping from games to agents is direct. States = context, actions = tool calls, rollouts = LLM simulations. UCB1 balances exploration and exploitation.
  • MCTS requires simulatable environments. Code, databases, and math — yes. Emails, purchases, and API mutations — no. Design your agent to separate planning (tree search) from execution (linear).
  • Smaller models with better search beat larger models with greedy decoding. RAP on LLaMA-33B beat CoT on GPT-4 by 33%. Search is a force multiplier.

Further reading

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch