7 minute read

“Every team is building their first multi-agent system. We are about to generate a massive dataset of production failures.”

TL;DR

Gartner’s 1,445% growth in multi-agent inquiries signals adoption, not success. The real data: hub-and-spoke is the default architecture, failures cluster at handoffs, and cost explosions hit when orchestrators retry without limits. Deer-Flow (37K stars) and gstack (23K in 7 days) show what abstractions practitioners want. For the technical foundations of multi-agent coordination, see multi-agent architectures.

A physical network cabling installation viewed from directly above showing a central distribution hub with dozens of cables radiating outward to peripheral nodes in a clean hub-and-spoke topology,  the hub junction glowing bright with activity while some spoke endpoints show disconnected or fault indicator lights

What does the 1,445% number actually mean?

Gartner tracks client inquiry volume — how many enterprises are asking their analysts about a technology category. A 1,445% increase from Q1 2024 to Q2 2025 means roughly 15x more organizations evaluated multi-agent architectures in that period.

This is a demand signal. It measures curiosity and exploration, not deployment. The distinction matters because the gap between “exploring multi-agent” and “running multi-agent in production” is enormous. Most teams that explore will build a prototype, hit unexpected failure modes, and either simplify back to single-agent or invest months in production hardening.

The GitHub data tells a parallel story. ByteDance’s Deer-Flow — a SuperAgent harness for research, coding, and task execution — reached 37,000+ stars. It provides sandboxes, memory, tools, skills, subagents, and message gateways in a structured framework. YC president Garry Tan’s gstack hit 23,000 stars in its first week, offering an opinionated configuration-over-code approach with bundled skills for planning, review, shipping, and QA.

Both gained traction because they reduce the distance from “I want multi-agent” to “I have a working multi-agent system.” The demand is real. The engineering cost of going from demo to production is what the Gartner number does not capture.

What do early production deployments look like?

Three patterns emerge from organizations that have moved past prototyping.

Hub-and-spoke is the default. One orchestrator agent receives all requests, decides which specialist agents to invoke, collects their outputs, and synthesizes a response. This is the multi-agent equivalent of a monolithic API gateway — all traffic flows through one point. Teams choose it because it is the easiest to debug, monitor, and explain. When something fails, you check the orchestrator’s logs and trace which sub-agent caused the problem.

The limitation is predictable: the orchestrator becomes a throughput bottleneck and a single point of failure. But for teams processing hundreds of requests per day (not thousands), this constraint does not bind. Most early adopters are in that range.

Heterogeneous model selection is common. The Gartner report noted organizations using expensive frontier models (GPT-5.2, Claude 4.6) for complex reasoning tasks and cheaper models for high-frequency execution tasks (formatting, validation, classification). This is a natural optimization — route each sub-task to the cheapest model that handles it reliably. The cost savings are substantial: a frontier model at $15/million tokens for 30% of tasks and a small model at $0.15/million tokens for 70% of tasks costs 10x less than using the frontier model for everything.

Use cases cluster in two domains. Customer service (routing, knowledge retrieval, response generation as separate agents) and document processing (extraction, validation, classification, summarization as a pipeline). These are the use cases where the multi-agent decomposition maps naturally to the task structure. More ambitious applications — multi-agent research, autonomous software engineering, creative collaboration — exist in prototypes but rarely in production.

graph TD
    A[User Request] --> B[Orchestrator Agent<br/>Frontier model, reasoning]
    B --> C[Classifier Agent<br/>Small model, fast]
    B --> D[Retrieval Agent<br/>Medium model, RAG]
    B --> E[Response Agent<br/>Frontier model, generation]
    C --> B
    D --> B
    E --> B
    B --> F[Response to User]

    G[Cost breakdown:<br/>Orchestrator: 20% of calls, 60% of cost<br/>Specialist: 80% of calls, 40% of cost]

Where do the failures cluster?

Three failure modes appear consistently in production multi-agent systems. All are engineering problems, not research problems.

Handoff failures. When one agent passes work to another, context is lost or corrupted. The first agent’s 2,000-token output exceeds the second agent’s effective context window. The second agent receives a truncated version and proceeds with partial information. The result looks correct at a glance — the second agent generates confident output — but misses details from the truncation. These failures are silent and discovered only when a human reviews the output or a downstream system rejects it.

Context bleed. In systems where agents share memory or message buses, information from one task leaks into another agent’s context. A customer service agent processing User A’s complaint picks up context from User B’s conversation because both used the same retrieval agent and the retrieval results were not properly scoped. Context bleed is the multi-agent equivalent of a race condition — it appears intermittently and is hard to reproduce.

Cost explosions. An orchestrator sends a task to a specialist agent. The specialist fails. The orchestrator retries. The specialist fails again. Without a retry budget, the orchestrator retries indefinitely. Each retry costs API tokens. A single stuck task can generate hundreds of dollars in inference costs before anyone notices. The fix is simple — circuit breakers and retry budgets — but teams often discover the need through a surprisingly large cloud bill.

The multi-agent failure taxonomy (MAST research) identified 14 unique failure modes across three categories: system design failures (9), inter-agent misalignment (3), and task verification gaps (2). Handoff failures account for the largest share.

When should you use multi-agent versus single-agent?

The decision tree is simpler than most framework documentation suggests.

Use a single agent when:

  • One system prompt covers the full task
  • Tool calls are sequential, not parallel
  • The task fits in one model’s context window
  • Debugging transparency matters more than modularity

Add agents when you have a concrete reason:

  • Different sub-tasks require fundamentally different system prompts (a coding agent and a review agent need different instructions, temperature, and tools)
  • Independent sub-tasks can run in parallel (searching three databases simultaneously)
  • Different tasks need different model sizes (frontier for reasoning, small for formatting)
  • The task structure is genuinely a pipeline where each stage transforms the output

The anti-pattern: splitting a task into agents because it looks clean on an architecture diagram, not because it solves a real problem. Every agent boundary is a potential handoff failure, a serialization/deserialization step, and a cost multiplication. Add them when the benefit outweighs these costs.

Key takeaways

  • 1,445% measures curiosity, not success. Gartner tracks inquiries. Production deployments are a fraction of the interest.
  • Hub-and-spoke is the default architecture. Easiest to debug. Orchestrator bottleneck matters only at scale most teams have not reached.
  • Failures cluster at handoffs. Context loss, context bleed, and cost explosions. All engineering problems with known solutions.
  • Heterogeneous model routing saves 10x. Frontier models for reasoning, cheap models for formatting. Route by task complexity.
  • Start single-agent. Add agents only for concrete reasons: parallel execution, different system prompts, different model sizes. Multi-agent for its own sake is complexity without benefit.
  • Deer-Flow and gstack show what practitioners want. Structured frameworks that reduce the distance from prototype to production. Configuration over code.

Further reading

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch