8 minute read

“We’ve been comparing a sprinter with one leg tied to a committee with a head start, and concluding committees are faster.”

TL;DR

A study across two datasets, three model families, and five multi-agent architectures (Sequential, Debate, Ensemble, Parallel-roles, Subtask-parallel) finds that single-agent LLMs match or outperform multi-agent systems on multi-hop reasoning when thinking token budgets are equalized (arXiv 2604.02460). Most multi-agent gains come from spending more tokens, not from coordination. Before adding agents, max out your single agent’s reasoning depth. For background on multi-agent coordination patterns, see multi-agent architectures.

A single chess piece standing on a circuit board surrounded by fallen pieces, representing single-agent superiority over multi-agent swarms under equal conditions

What happens when you equalize the token budget?

The standard multi-agent experiment gives each agent its own token allocation. Three agents with 4,000 tokens each spend 12,000 tokens total. The single-agent baseline gets 4,000. The multi-agent system “wins,” and the paper concludes coordination helps.

arXiv 2604.02460 runs the fair version of this experiment. Give the single agent the same 12,000 tokens. Let it think deeper instead of splitting reasoning across agents. The result: across two multi-hop reasoning datasets, three model families, and five multi-agent architectures, the single agent matches or outperforms the multi-agent system.

The finding holds across Sequential, Debate, Ensemble, Parallel-roles, and Subtask-parallel designs. Not one of the five architectures consistently beat a single agent operating at the same token budget. The implication is direct: token budget, not agent count, drives reasoning performance.

graph LR
    subgraph "Standard comparison (unfair)"
        A1[Agent 1<br/>4K tokens] --> R1[Result]
        A2[Agent 2<br/>4K tokens] --> R1
        A3[Agent 3<br/>4K tokens] --> R1
        S1[Single agent<br/>4K tokens] --> R2[Result]
    end
    
    subgraph "Equalized comparison (fair)"
        B1[Agent 1<br/>4K tokens] --> R3[Result]
        B2[Agent 2<br/>4K tokens] --> R3
        B3[Agent 3<br/>4K tokens] --> R3
        S2[Single agent<br/>12K tokens] --> R4["Result<br/>(matches or wins)"]
    end

Why does single-agent win under fair conditions?

Two mechanisms explain the gap.

Latent versus externalized reasoning. A single agent keeps intermediate reasoning steps inside one continuous chain of thought. Nothing gets serialized, truncated, or reformatted between steps. A sequential multi-agent pipeline externalizes these intermediates as messages passed between agents. Each handoff is a lossy compression step — the sending agent summarizes its reasoning into a message, and the receiving agent reconstructs context from that summary. Information degrades at every hop.

This is the multi-agent equivalent of the telephone game. Each hand-off introduces noise. Research on context degradation in multi-agent pipelines confirms the pattern: Agent A’s degraded output enters Agent B’s context as ground truth, and errors amplify across the chain.

The coordination tax. Every token spent on inter-agent communication is a token not spent on reasoning. In a three-agent Sequential pipeline, a meaningful fraction of the total token budget goes to formatting instructions, role descriptions, and message headers. A single agent spends its entire budget on the problem. When budgets are tight — and in production, they always are — the coordination tax matters.

Architecture Where tokens go Reasoning efficiency
Single agent (12K tokens) 100% on reasoning High
3-agent Sequential (12K total) ~70-80% reasoning, ~20-30% coordination Medium
3-agent Debate (12K total) ~50-60% reasoning, ~40-50% debate overhead Low

When does multi-agent actually help?

The paper does not claim multi-agent is always wrong. It claims multi-agent is wrong when the task is sequential reasoning and the comparison is unfair.

Multi-agent architectures earn their complexity in three cases.

Genuinely parallel tasks. Searching three databases simultaneously, running independent analyses on separate data partitions, or fetching information from multiple APIs. Here the wall-clock time savings are real, and each agent’s context stays independent. A Towards Data Science analysis describes the “bag of agents” anti-pattern — unstructured parallelism where agents interfere with each other, producing up to 17x the error rate of a single agent. Structured parallelism, where sub-tasks are independent by design, avoids this.

Different system prompts per sub-task. A coding agent and a review agent need different instructions, different temperatures, and different tool access. Forcing both personas into one system prompt degrades both. The multi-agent split here is not about token budgets — it is about keeping each agent’s context clean and focused.

Task-structure decomposability. When sub-tasks are genuinely independent — parallel data retrieval, simultaneous API calls, concurrent analysis of separate documents — multi-agent systems earn their overhead. When tasks are sequential and state-dependent, where each step depends on the full history of previous steps, multi-agent architectures pay the coordination tax without getting the parallelism benefit. The task structure determines whether splitting helps or hurts.

OrgAgent (arXiv 2604.01020) demonstrates a middle path: organize multi-agent systems with governance, execution, and compliance layers — like a company, not a swarm. Their structured approach achieved 102% performance improvement and 74% token reduction on SQuAD 2.0 compared to unstructured multi-agent baselines. Structure matters more than agent count.

What does this mean for production architectures?

Gartner measured 1,445% growth in multi-agent AI system inquiries from Q1 2024 to Q2 2025. That is demand, not validation. The production reality is more nuanced, and the equalization finding changes how teams should evaluate their architectures.

The equalization test. Before shipping a multi-agent system, run this test: give a single agent the same total token budget your multi-agent pipeline consumes. If the single agent matches performance, you have complexity without benefit. If multi-agent wins, the coordination is earning its keep — but verify that the gain comes from parallelism or specialization, not just from spending more tokens overall.

Cost implications are immediate. With inference costs dropping 280x since November 2022 for GPT-3.5-equivalent performance (Stanford HAI 2025 AI Index), the raw cost of tokens matters less. But the engineering cost of multi-agent systems — debugging handoff failures, managing context bleed, preventing cost explosions from retry loops — remains high. A single agent with a generous thinking budget is simpler to monitor, cheaper to debug, and easier to explain to stakeholders. For the full taxonomy of multi-agent failure modes, see the Gartner adoption analysis.

Reasoning depth scales better than agent count. The emerging research on test-time compute scaling — spending more inference tokens on harder problems — aligns with this finding. A single agent that allocates more reasoning tokens to harder sub-problems outperforms a multi-agent system that spreads tokens evenly across agents. The winning strategy is depth-first, not breadth-first.

How should you allocate your inference budget?

A decision framework based on the equalization finding:

graph TD
    A[New reasoning task] --> B{Is the task<br/>decomposable into<br/>independent sub-tasks?}
    B -->|Yes| C{Do sub-tasks need<br/>different system prompts<br/>or models?}
    B -->|No| D[Single agent<br/>Full token budget]
    C -->|Yes| E[Multi-agent<br/>Specialized per sub-task]
    C -->|No| F{Is wall-clock<br/>latency critical?}
    F -->|Yes| G[Multi-agent parallel<br/>Same model, split work]
    F -->|No| D
    E --> H[Run equalization test<br/>to verify benefit]
    G --> H

Step 1: Default to single agent. Give it the full token budget you would have split across agents. Measure quality.

Step 2: Add agents only with evidence. If single-agent quality is insufficient AND the task has independent sub-components, try multi-agent. Measure the delta.

Step 3: Run the equalization test. Compare multi-agent at budget X against single-agent at budget X. If single-agent matches, keep it simple.

Step 4: Monitor the coordination tax. In production, instrument your multi-agent pipeline to track what fraction of tokens goes to coordination versus reasoning. If coordination exceeds 30% of total tokens, your architecture is spending more on talking than thinking.

Key takeaways

  • Equalize before you compare. Most published multi-agent advantages disappear when single agents get the same total token budget.
  • The coordination tax is real. Every token spent on inter-agent messaging is a token not spent on reasoning.
  • Multi-agent debate gains come from voting, not deliberation. Majority voting over independent outputs accounts for most accuracy improvements attributed to debate.
  • Task structure determines the right architecture. Parallel decomposable tasks benefit from multi-agent. Sequential reasoning tasks do not.
  • Run the equalization test. Before shipping multi-agent, verify the coordination earns its complexity by testing single-agent at the same budget.
  • Depth beats breadth for reasoning. Invest in deeper single-agent thinking before investing in more agents.

Further reading

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch