The Workload-Router-Pool: how vLLM thinks about fleet inference
TL;DR: The March 2026 vision paper “The Workload-Router-Pool Architecture for LLM Inference Optimization” (arXiv:2603.21354, vLLM Semantic Router project, MBZUAI/McGill/Mila/UChicago) formalizes fleet LLM inference as three interdependent dimensions: Workload (what you serve), Router (how requests are dispatched), and Pool (where inference runs). Agentic tasks generate far more tokens per request than single-turn chat (a full coding loop can produce 2,000 tokens where a chat reply produces 20), break KV-cache locality under round-robin routing, and require stateful session awareness that single-server deployments cannot provide. The WRP framework shows why routing policy, workload mix, and pool topology must co-evolve; optimizing one constrains the others.

Most teams start with a single vLLM instance. Traffic grows. They add a second instance and route with round-robin. Traffic grows more. They add a load balancer, then a model router, then a KV-cache pool, each addition patching a new symptom. The system works. Barely, expensively, with a growing list of exceptions.
The March 2026 paper “The Workload-Router-Pool Architecture for LLM Inference Optimization” (arXiv:2603.21354, vLLM Semantic Router project) names the three dimensions that must co-evolve to get from patched single-server to coherent fleet.
The three dimensions
Workload characterizes what the fleet actually serves. Not all LLM requests are alike:
- Chat vs. agentic: a single-turn question takes one LLM call; an agentic loop making 10–30 tool calls takes 10–30
- Prefill-heavy vs. decode-heavy: a request with a 32K-token context but a short answer stresses the GPU differently than a short prompt generating a long response
- Warm vs. cold start: a request that shares a prefix with recent requests can reuse KV-cache; one with a novel prefix cannot
- Single-turn vs. multi-turn: stateless requests can go anywhere; multi-turn sessions benefit from instance affinity
Router determines how requests are dispatched across the pool. The paper identifies four routing strategies:
| Strategy | Mechanism | When to use |
|---|---|---|
| Static semantic rules | Domain/keyword classification | Stable workload mix, few models |
| Online bandit | Multi-armed bandit learning from latency feedback | Dynamic workload, cost/latency trade-off |
| RL-based model selection | Learned policy over model quality and cost | Complex multi-objective routing |
| Quality-aware cascading | Route to small model first, escalate if confidence low | Cost reduction with quality floor |
Pool defines where inference runs:
- Homogeneous vs. heterogeneous GPU clusters (mixing H100s, A100s, consumer GPUs)
- Disaggregated prefill/decode: separate servers for the two inference phases
- KV-cache topology: per-instance, shared pool, or hierarchical offload
The key architectural insight: these three dimensions are not independent. Your pool topology constrains which routing strategies are viable. Your routing strategy depends on your workload mix. You cannot design any one dimension in isolation.
Why agentic workloads are the stress test
Single-turn chat requests are stateless and roughly uniform in cost. An agentic loop is neither.
A user asking a question generates one LLM call. An agent solving a coding task generates: one planning call, 5–10 tool calls, one synthesis call, possibly a self-correction loop. The WRP paper documents this directly: a coding request may produce 200 or 2,000 tokens depending on task complexity, and an agentic loop compounds that across 10–30 tool calls. At scale, this is a fleet sizing problem disguised as an application feature.
The KV-cache problem is subtler. In a single-turn deployment, every request starts fresh with no KV-cache reuse expected. In an agentic loop, iterations 2 through 30 share the same growing context. If iteration 1 lands on server A and iteration 2 lands on server B (because round-robin doesn’t know they’re related), server B has no context to reuse. It processes the full prompt from scratch, burning compute and latency that should have been zero.
sequenceDiagram
participant Client
participant Router
participant Server A
participant Server B
Note over Router: Round-robin (naive)
Client->>Router: Agentic iteration 1
Router->>Server A: Dispatch (builds KV-cache)
Client->>Router: Agentic iteration 2 (same session)
Router->>Server B: Dispatch (KV-cache miss!)
Server B-->>Client: Recomputes full context
Note over Router: Cache-aware routing
Client->>Router: Agentic iteration 1
Router->>Server A: Dispatch (builds KV-cache)
Client->>Router: Agentic iteration 2 (same session)
Router->>Server A: Dispatch (KV-cache hit — free!)
Server A-->>Client: Token generation only
The vLLM Semantic Router (v0.1 Iris, January 2026) addresses this at the Router dimension. Signal-driven routing (using session ID, domain classification, and prefix hash) dispatches related iterations to the same instance or to instances sharing a KV-cache pool. In benchmarks (arXiv:2510.08731), this approach achieved a 10.24 percentage point accuracy improvement on MMLU-Pro and 47.1% latency reduction versus naive routing, with domain-specific peaks exceeding 20pp in business/economics tasks.
Many enterprises now run 5+ models in production. At that scale, routing decisions happen thousands of times per second and each bad decision has cascading costs.
The disaggregation question
Disaggregated prefill-decode is the Pool dimension’s most significant architectural choice. The argument:
LLM inference has two distinct phases with incompatible resource profiles. Prefill (processing the input prompt in parallel) is compute-bound: essentially matrix multiplication that scales well with more compute. Decode (generating tokens one at a time, autoregressively) is memory-bandwidth bound, since each token generation reads the full KV-cache and the model weights.
Running both phases on the same server forces each to compete for the same resources. Separating them onto dedicated servers allows each to be optimized and scaled independently.
The gains are real: 4.48x throughput improvement for long prompts (>4K tokens) is reported in Splitwise/DistServe benchmarks. Amazon’s llm-d deployment and NVIDIA Dynamo both use disaggregation for this reason.
The cost is KV-cache locality. When prefill and decode run on separate servers, the KV-cache generated during prefill must transfer to the decode server. If your workload has high prefix reuse (many requests sharing the same system prompt or context), that transfer undoes the gains. Together.ai’s cache-aware disaggregated inference paper documents 40% throughput gain versus naive disaggregation when prefix reuse is accounted for in routing.
The decision tree:
Prefill latency (TTFT) dominant?
├─ Yes + prompt tokens > 4K?
│ └─ Disaggregate. Expected: 4.48x throughput improvement (Splitwise/DistServe).
│ But: measure KV-cache hit rate first.
│ If hit rate > 50%, shared pool may win.
└─ No: decode latency (TPOT) dominant?
└─ Scale decode replicas horizontally.
Keep prefill co-located for cache locality.
The fleet scaling decision
vLLM’s own throughput numbers (793 tokens/sec versus Ollama’s 41 tokens/sec, P99 latency of 80ms versus 673ms) establish why it became the default inference engine. But these numbers assume a single server under controlled load.
Fleet scaling introduces new problems the single-server numbers don’t address:
Session affinity. Multi-turn and agentic requests need instance affinity. Stateless load balancing breaks them. You need either sticky sessions (which limit load balancing flexibility) or shared KV-cache (which adds infrastructure complexity).
Workload heterogeneity. A fleet serving both short chat requests and long agentic tasks has two different latency SLAs, two different cost profiles, and two different optimal batching strategies. A single homogeneous pool optimized for one breaks the other.
Model multiplicity. When enterprises run multiple models in production, the Router dimension becomes an economic decision: small models for simple tasks, large models for complex ones, specialized models for domain-specific work. The vLLM Semantic Router’s quality-aware cascading (route to small model first, escalate on low confidence) can reduce costs significantly without quality loss.
For teams at different scales, the practical guidance from the WRP paper:
| Scale | Architecture | Key tool |
|---|---|---|
| < 1,000 req/sec | Single vLLM instance, maximize batch size | vLLM PagedAttention |
| 1,000–10,000 req/sec | Fleet with semantic routing, KEDA autoscaling | vLLM Semantic Router |
| > 10,000 req/sec | Multi-zone, disaggregated prefill/decode, shared KV pool | llm-d + NVIDIA Dynamo |
For the routing layer specifically, see vLLM’s semantic router, which covers the classification algorithms within the Router dimension. The WRP paper extends that to show how the Router design is constrained by the Workload profile and must co-evolve with Pool topology.
Key takeaways
- The WRP framework (arXiv:2603.21354) names three interdependent fleet inference dimensions: Workload, Router, Pool. Optimizing any one without the others produces a locally optimal, globally broken system.
- Agentic multi-step tasks compound token generation across 10–30 tool calls (WRP paper) and require stateful KV-cache routing, not round-robin.
- Cache-aware routing (arXiv:2510.08731) achieved 10.24pp accuracy improvement on MMLU-Pro and 47.1% latency reduction versus naive routing.
- Disaggregated prefill-decode reduces TTFT by 30–40% for long prompts but loses KV-cache locality. Use it when prefill dominates and prefix reuse is low.
- As enterprises run 5+ models in production, the Router dimension doubles as cost optimization: quality-aware cascading routes cheap tasks to small models and escalates complex ones.
FAQ
What is the Workload-Router-Pool architecture? WRP (arXiv:2603.21354, vLLM Semantic Router project, MBZUAI/McGill/Mila/UChicago, March 2026) formalizes fleet LLM inference as three orthogonal dimensions: Workload (what you serve), Router (how requests are dispatched), and Pool (where inference runs). The key insight is their interdependence: routing policy, workload mix, and pool topology must co-evolve.
Why do agentic workloads break naive round-robin routing? Agentic multi-step requests compound token generation across 10–30 tool calls and are stateful: each iteration shares context with prior ones. Round-robin sends iterations to random instances, causing KV-cache misses. Cache-aware routing keeps sessions on the same instance or shares KV-cache across the pool.
What is disaggregated prefill-decode and when does it help? Separating the compute-intensive prefill phase from the memory-bandwidth-bound decode phase onto dedicated servers allows each to scale independently. It reduces TTFT by 30–40% for long prompts (>4K tokens) but loses KV-cache locality. Use it when prefill latency dominates and prefix reuse is low.
What is the vLLM Semantic Router and how does it relate to WRP? The vLLM Semantic Router (v0.1 Iris, January 2026) is a concrete Router implementation: signal-driven routing via domain, keyword, embedding, and session signals, deployed as an Envoy external processor. WRP is the architectural framework; Semantic Router is one Router implementation within it.
When should a team move from single-server to fleet LLM inference? Move to fleet when P99 latency exceeds 2 seconds under production load, you serve multiple models, your workload mixes short chat and long agentic requests, or you hit GPU memory limits. At 1,000–10,000 req/sec, llm-d with KEDA is appropriate. Above 10,000 req/sec, multi-zone federation and cache-aware disaggregation are necessary.
Further reading
- The Workload-Router-Pool Architecture for LLM Inference Optimization — the vision paper this post is built on (vLLM Semantic Router project, March 2026)
- vLLM Semantic Router v0.1 Iris Release — the production Router implementation
- Cache-Aware Disaggregated Inference — Together.ai’s analysis of KV-cache locality in disaggregated serving
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch