The Workload-Router-Pool: how vLLM thinks about fleet inference

Q: What is the Workload-Router-Pool architecture?

WRP (arXiv:2603.21354, vLLM team, March 2026) is a framework for fleet LLM inference that defines three orthogonal dimensions: Workload (characterizing what the fleet serves — chat vs. agentic, prefill-heavy vs. decode-heavy), Router (how requests are dispatched — static rules, bandit algorithms, RL, or cascading), and Pool (where inference runs — GPU cluster topology, disaggregated prefill/decode, KV-cache architecture). The key insight is that these dimensions are interdependent: you cannot optimize routing without knowing your workload mix, and cannot design your pool without knowing your routing policy.

Q: Why do agentic workloads break naive round-robin routing?

Agentic multi-step requests generate 3–10x more LLM calls per user request than single-turn chat (WRP paper, arXiv:2603.21354) and are stateful — each loop iteration depends on prior outputs and tool results. Round-robin routing sends iterations to random server instances, causing KV-cache misses: the new instance has no context for the ongoing session and must recompute or refetch it. Cache-aware routing that keeps a session on the same instance (or shares KV-cache across instances) eliminates this overhead.

Q: What is disaggregated prefill-decode and when does it help?

Disaggregated prefill-decode separates the two phases of LLM inference onto different servers. Prefill (processing the input prompt) is compute-intensive and can be parallelized; decode (generating tokens one at a time) is memory-bandwidth bound. Separating them allows each to be scaled independently. It reduces TTFT (time to first token) by 30–40% for long prompts (>4K tokens) but loses KV-cache locality if prefix caching is important to your workload. Use it when prefill latency dominates and context reuse is low.

Q: What is the vLLM Semantic Router and how does it relate to WRP?

The vLLM Semantic Router (v0.1 Iris, January 2026) is a production implementation of the Router dimension in WRP — a signal-driven routing system that classifies requests by domain, keyword, embedding, and other signals to dispatch them to the right model. It ships as an Envoy external processor and achieved 10.24pp accuracy gain, 47.1% latency reduction, and 48.5% token reduction in benchmarks (arXiv:2510.08731). WRP is the architectural framework; Semantic Router is one concrete Router implementation within it.

Q: When should a team move from single-server to fleet LLM inference?

Move to fleet when: P99 latency exceeds 2 seconds under production load, you serve multiple models and need cost-optimal routing, your workload mix includes both short chat requests and long agentic sequences, or you hit the memory ceiling of a single GPU. At 1,000–10,000 requests/sec per model, llm-d with KEDA autoscaling becomes appropriate. Above 10,000 requests/sec, multi-zone federation and cache-aware disaggregation are necessary.

10 minute read

TL;DR: The March 2026 vision paper “The Workload-Router-Pool Architecture for LLM Inference Optimization” (arXiv:2603.21354, vLLM Semantic Router project, MBZUAI/McGill/Mila/UChicago) formalizes fleet LLM inference as three interdependent dimensions: Workload (what you serve), Router (how requests are dispatched), and Pool (where inference runs). Agentic tasks generate far more tokens per request than single-turn chat (a full coding loop can produce 2,000 tokens where a chat reply produces 20), break KV-cache locality under round-robin routing, and require stateful session awareness that single-server deployments cannot provide. The WRP framework shows why routing policy, workload mix, and pool topology must co-evolve; optimizing one constrains the others.

Server rack with colored LED indicators representing different workload states in a fleet inference deployment

Most teams start with a single vLLM instance. Traffic grows. They add a second instance and route with round-robin. Traffic grows more. They add a load balancer, then a model router, then a KV-cache pool, each addition patching a new symptom. The system works. Barely, expensively, with a growing list of exceptions.

The March 2026 paper “The Workload-Router-Pool Architecture for LLM Inference Optimization” (arXiv:2603.21354, vLLM Semantic Router project) names the three dimensions that must co-evolve to get from patched single-server to coherent fleet.

The three dimensions

Workload characterizes what the fleet actually serves. Not all LLM requests are alike:

Chat vs. agentic: a single-turn question takes one LLM call; an agentic loop making 10–30 tool calls takes 10–30
Prefill-heavy vs. decode-heavy: a request with a 32K-token context but a short answer stresses the GPU differently than a short prompt generating a long response
Warm vs. cold start: a request that shares a prefix with recent requests can reuse KV-cache; one with a novel prefix cannot
Single-turn vs. multi-turn: stateless requests can go anywhere; multi-turn sessions benefit from instance affinity

Router determines how requests are dispatched across the pool. The paper identifies four routing strategies:

Strategy	Mechanism	When to use
Static semantic rules	Domain/keyword classification	Stable workload mix, few models
Online bandit	Multi-armed bandit learning from latency feedback	Dynamic workload, cost/latency trade-off
RL-based model selection	Learned policy over model quality and cost	Complex multi-objective routing
Quality-aware cascading	Route to small model first, escalate if confidence low	Cost reduction with quality floor

Pool defines where inference runs:

Homogeneous vs. heterogeneous GPU clusters (mixing H100s, A100s, consumer GPUs)
Disaggregated prefill/decode: separate servers for the two inference phases
KV-cache topology: per-instance, shared pool, or hierarchical offload

The key architectural insight: these three dimensions are not independent. Your pool topology constrains which routing strategies are viable. Your routing strategy depends on your workload mix. You cannot design any one dimension in isolation.

Why agentic workloads are the stress test

Single-turn chat requests are stateless and roughly uniform in cost. An agentic loop is neither.

A user asking a question generates one LLM call. An agent solving a coding task generates: one planning call, 5–10 tool calls, one synthesis call, possibly a self-correction loop. The WRP paper documents this directly: a coding request may produce 200 or 2,000 tokens depending on task complexity, and an agentic loop compounds that across 10–30 tool calls. At scale, this is a fleet sizing problem disguised as an application feature.

The KV-cache problem is subtler. In a single-turn deployment, every request starts fresh with no KV-cache reuse expected. In an agentic loop, iterations 2 through 30 share the same growing context. If iteration 1 lands on server A and iteration 2 lands on server B (because round-robin doesn’t know they’re related), server B has no context to reuse. It processes the full prompt from scratch, burning compute and latency that should have been zero.

sequenceDiagram
    participant Client
    participant Router
    participant Server A
    participant Server B

    Note over Router: Round-robin (naive)
    Client->>Router: Agentic iteration 1
    Router->>Server A: Dispatch (builds KV-cache)
    Client->>Router: Agentic iteration 2 (same session)
    Router->>Server B: Dispatch (KV-cache miss!)
    Server B-->>Client: Recomputes full context

    Note over Router: Cache-aware routing
    Client->>Router: Agentic iteration 1
    Router->>Server A: Dispatch (builds KV-cache)
    Client->>Router: Agentic iteration 2 (same session)
    Router->>Server A: Dispatch (KV-cache hit — free!)
    Server A-->>Client: Token generation only

The vLLM Semantic Router (v0.1 Iris, January 2026) addresses this at the Router dimension. Signal-driven routing (using session ID, domain classification, and prefix hash) dispatches related iterations to the same instance or to instances sharing a KV-cache pool. In benchmarks (arXiv:2510.08731), this approach achieved a 10.24 percentage point accuracy improvement on MMLU-Pro and 47.1% latency reduction versus naive routing, with domain-specific peaks exceeding 20pp in business/economics tasks.

Many enterprises now run 5+ models in production. At that scale, routing decisions happen thousands of times per second and each bad decision has cascading costs.

The disaggregation question

Disaggregated prefill-decode is the Pool dimension’s most significant architectural choice. The argument:

LLM inference has two distinct phases with incompatible resource profiles. Prefill (processing the input prompt in parallel) is compute-bound: essentially matrix multiplication that scales well with more compute. Decode (generating tokens one at a time, autoregressively) is memory-bandwidth bound, since each token generation reads the full KV-cache and the model weights.

Running both phases on the same server forces each to compete for the same resources. Separating them onto dedicated servers allows each to be optimized and scaled independently.

The gains are real: 4.48x throughput improvement for long prompts (>4K tokens) is reported in Splitwise/DistServe benchmarks. Amazon’s llm-d deployment and NVIDIA Dynamo both use disaggregation for this reason.

The cost is KV-cache locality. When prefill and decode run on separate servers, the KV-cache generated during prefill must transfer to the decode server. If your workload has high prefix reuse (many requests sharing the same system prompt or context), that transfer undoes the gains. Together.ai’s cache-aware disaggregated inference paper documents 40% throughput gain versus naive disaggregation when prefix reuse is accounted for in routing.

The decision tree:

Prefill latency (TTFT) dominant?
├─ Yes + prompt tokens > 4K?
│  └─ Disaggregate. Expected: 4.48x throughput improvement (Splitwise/DistServe).
│     But: measure KV-cache hit rate first.
│     If hit rate > 50%, shared pool may win.
└─ No: decode latency (TPOT) dominant?
   └─ Scale decode replicas horizontally.
      Keep prefill co-located for cache locality.

The fleet scaling decision

vLLM’s own throughput numbers (793 tokens/sec versus Ollama’s 41 tokens/sec, P99 latency of 80ms versus 673ms) establish why it became the default inference engine. But these numbers assume a single server under controlled load.

Fleet scaling introduces new problems the single-server numbers don’t address:

Session affinity. Multi-turn and agentic requests need instance affinity. Stateless load balancing breaks them. You need either sticky sessions (which limit load balancing flexibility) or shared KV-cache (which adds infrastructure complexity).

Workload heterogeneity. A fleet serving both short chat requests and long agentic tasks has two different latency SLAs, two different cost profiles, and two different optimal batching strategies. A single homogeneous pool optimized for one breaks the other.

Model multiplicity. When enterprises run multiple models in production, the Router dimension becomes an economic decision: small models for simple tasks, large models for complex ones, specialized models for domain-specific work. The vLLM Semantic Router’s quality-aware cascading (route to small model first, escalate on low confidence) can reduce costs significantly without quality loss.

For teams at different scales, the practical guidance from the WRP paper:

Scale	Architecture	Key tool
< 1,000 req/sec	Single vLLM instance, maximize batch size	vLLM PagedAttention
1,000–10,000 req/sec	Fleet with semantic routing, KEDA autoscaling	vLLM Semantic Router
> 10,000 req/sec	Multi-zone, disaggregated prefill/decode, shared KV pool	llm-d + NVIDIA Dynamo

For the routing layer specifically, see vLLM’s semantic router, which covers the classification algorithms within the Router dimension. The WRP paper extends that to show how the Router design is constrained by the Workload profile and must co-evolve with Pool topology.

Key takeaways

The WRP framework (arXiv:2603.21354) names three interdependent fleet inference dimensions: Workload, Router, Pool. Optimizing any one without the others produces a locally optimal, globally broken system.
Agentic multi-step tasks compound token generation across 10–30 tool calls (WRP paper) and require stateful KV-cache routing, not round-robin.
Cache-aware routing (arXiv:2510.08731) achieved 10.24pp accuracy improvement on MMLU-Pro and 47.1% latency reduction versus naive routing.
Disaggregated prefill-decode reduces TTFT by 30–40% for long prompts but loses KV-cache locality. Use it when prefill dominates and prefix reuse is low.
As enterprises run 5+ models in production, the Router dimension doubles as cost optimization: quality-aware cascading routes cheap tasks to small models and escalates complex ones.

FAQ

What is the Workload-Router-Pool architecture? WRP (arXiv:2603.21354, vLLM Semantic Router project, MBZUAI/McGill/Mila/UChicago, March 2026) formalizes fleet LLM inference as three orthogonal dimensions: Workload (what you serve), Router (how requests are dispatched), and Pool (where inference runs). The key insight is their interdependence: routing policy, workload mix, and pool topology must co-evolve.

Why do agentic workloads break naive round-robin routing? Agentic multi-step requests compound token generation across 10–30 tool calls and are stateful: each iteration shares context with prior ones. Round-robin sends iterations to random instances, causing KV-cache misses. Cache-aware routing keeps sessions on the same instance or shares KV-cache across the pool.

What is disaggregated prefill-decode and when does it help? Separating the compute-intensive prefill phase from the memory-bandwidth-bound decode phase onto dedicated servers allows each to scale independently. It reduces TTFT by 30–40% for long prompts (>4K tokens) but loses KV-cache locality. Use it when prefill latency dominates and prefix reuse is low.

What is the vLLM Semantic Router and how does it relate to WRP? The vLLM Semantic Router (v0.1 Iris, January 2026) is a concrete Router implementation: signal-driven routing via domain, keyword, embedding, and session signals, deployed as an Envoy external processor. WRP is the architectural framework; Semantic Router is one Router implementation within it.

When should a team move from single-server to fleet LLM inference? Move to fleet when P99 latency exceeds 2 seconds under production load, you serve multiple models, your workload mixes short chat and long agentic requests, or you hit GPU memory limits. At 1,000–10,000 req/sec, llm-d with KEDA is appropriate. Above 10,000 req/sec, multi-zone federation and cache-aware disaggregation are necessary.

The Workload-Router-Pool: how vLLM thinks about fleet inference

The three dimensions

Why agentic workloads are the stress test

The disaggregation question

The fleet scaling decision

Key takeaways

FAQ

Further reading

Related across topics

Share on

The three dimensions

Why agentic workloads are the stress test

The disaggregation question

The fleet scaling decision

Key takeaways

FAQ

Further reading

Related across topics

vLLM’s semantic router: smarter inference for multi-model deployments

Agent Orchestration

Share on