KV cache for MoE: the memory wall blocking mixture-of-experts at scale
“MoE sparsifies the computation. The memory bill arrives in full.”
TL;DR
MoE models activate 2-4 experts per token — a 46.7B model (Mixtral) runs like a 12.9B model. But KV cache ignores routing. Every token stores full attention state regardless of which experts fired. A 7B-scale MoE with 128K context and 16 experts exceeds 24GB for cache alone. PiKV shards cache across GPUs with adaptive eviction (2.2x throughput, 65% memory reduction). DeepSeek’s MLA compresses KV to 70KB/token vs 516KB for LLaMA-3.1 (93% reduction). For context window management techniques that apply to all architectures, see context window management.

Why does MoE not help with memory?
Mixture-of-Experts gets its efficiency from sparse activation. Mixtral-8x7B has 46.7 billion total parameters across 8 experts, but each token activates only 2 experts — roughly 12.7 billion parameters. The compute cost looks like a 12.9B dense model. The quality looks like a much larger one.
The KV cache does not participate in this bargain.
During attention computation, every query token must attend to key-value pairs from all prior tokens in the sequence. This happens in the attention layers, which are shared across all experts — they are not part of the sparse routing. The routing decision selects which FFN (feed-forward network) blocks process the token. The attention computation is dense.
This means KV cache grows at the same rate regardless of how many experts are active. For Mixtral-8x7B, KV cache requires 0.000244 GiB per token. At 128K context length, that is 31.2 GB — more than the active parameters (12.9B × 2 bytes = 25.4 GB in FP16). The cache becomes the dominant memory consumer, not the model weights.
graph TD
subgraph "What MoE Sparsifies"
A[FFN Layers<br/>8 experts, 2 active<br/>75% compute savings]
end
subgraph "What Stays Dense"
B[Attention Layers<br/>Full KV cache<br/>0% memory savings]
C[Every token attends to<br/>ALL prior KV pairs<br/>regardless of routing]
end
A -.->|Independent| B
B --> D[Memory wall:<br/>KV cache > model weights<br/>at 128K context]
Three deployed MoE models illustrate the scale:
| Model | Total params | Active params | KV cache per token |
|---|---|---|---|
| Mixtral-8x7B | 46.7B | 12.9B | 0.244 MB |
| DeepSeek-V3 (MLA) | 671B | ~37B | 0.070 MB |
| Grok-1 | 314B | ~70-80B | Not published |
| LLaMA-3.1 405B (dense) | 405B | 405B | 0.516 MB |
DeepSeek-V3’s number is dramatically lower because it uses MLA — a fundamentally different attention architecture. The others use standard multi-head attention where KV cache is proportional to hidden size × number of heads × number of layers.
What standard KV cache techniques exist and why are they insufficient?
Three widely deployed optimizations help but do not solve the MoE-specific problem.
PagedAttention (vLLM) breaks KV cache into fixed-size blocks that can be stored anywhere in physical GPU memory, like virtual memory pages in an operating system. This reduces memory waste from 60-80% to under 4% and enables flexible sharing across concurrent requests. vLLM achieves 2-4x throughput improvement from this alone. But PagedAttention does not reduce the total cache size — it just manages what you have more efficiently.
Grouped Query Attention (GQA) shares key-value heads across multiple query heads. Instead of each query head having its own K and V, groups of query heads share a reduced set. This directly reduces cache size. Llama 2 70B uses GQA with 8 KV heads shared across 64 query heads — an 8x reduction in KV cache. But GQA is a model architecture decision made at training time, not a serving optimization. You cannot add GQA to a model that was not trained with it.
Multi-Query Attention (MQA) is the extreme version: all query heads share a single KV head. Maximum efficiency, potential quality trade-off. Same limitation as GQA — training-time decision.
None of these address the MoE-specific problem: cache fragmentation across GPUs. When experts are distributed across multiple GPUs (tensor parallelism), the KV cache for tokens routed to different experts lives on different devices. A query that needs to attend to prior tokens must gather KV entries from multiple GPUs, incurring cross-device communication overhead. PagedAttention optimizes memory layout on a single device. GQA/MQA reduce cache size. Neither addresses the distributed coordination problem.
How does PiKV solve expert-aware KV management?
PiKV (arXiv 2508.06526, open-source at github.com/NoakLiu/PiKV) is designed specifically for MoE KV cache. It addresses three problems simultaneously.
Expert-sharded storage. Instead of maintaining a globally synchronized KV cache, PiKV partitions the cache across GPUs aligned with expert placement. Tokens are stored on the same device as the expert that processed them. This eliminates the cross-GPU gather that standard attention requires — queries can access their relevant KV entries locally.
Adaptive eviction. Not all KV entries are equally useful. PiKV tracks access patterns and evicts entries that are rarely queried. For MoE specifically, this means tokens routed to rarely-used experts can have their cache entries pruned more aggressively without quality loss. The eviction policy is query-aware — it retains entries that current queries are likely to need based on routing patterns.
Integrated compression. PiKV supports multiple compression methods within the same framework: LoRA compression, SVD (singular value decomposition), quantization, and FastV compression. The quantized variant achieves the best throughput-memory trade-off: 2.2x faster inference with 65% memory reduction. Across architectures, PiKV reports 2.3-3.1x throughput gains and 2.8-3.5x memory reductions.
The combination matters. Expert-sharding alone creates fragmentation. Eviction alone loses information. Compression alone adds latency. PiKV coordinates all three — sharding provides locality, eviction manages capacity, compression reduces the remaining footprint.
What is DeepSeek’s MLA and why does it cut cache by 93%?
Multi-Head Latent Attention (MLA) takes a radically different approach: instead of optimizing how you store full KV tensors, change what you store.
Standard multi-head attention caches K and V tensors at their full dimensionality — typically hidden_size × num_heads per layer. For a model with hidden size 4,096 and 32 heads, each token’s KV entry is 4,096 × 2 (K+V) × 2 bytes (FP16) = 16 KB per layer. Across 60+ layers, this compounds quickly.
MLA projects K and V through learned low-rank matrices before caching. A 4,096-dimensional vector becomes a 512-dimensional latent representation. The cache stores this compressed version. At attention time, the latent is decompressed back to full dimensionality for the computation. The compression is lossy but the projection matrices are trained jointly with the model, so the information loss is minimized for the model’s specific task distribution.
The result for DeepSeek-V3: 70 KB per token versus 516 KB for LLaMA-3.1 405B — a 93% reduction. At 128K context, that is 8.7 GB for DeepSeek-V3 versus 64.2 GB for LLaMA 405B. The difference between fitting on one GPU and needing four.
The trade-off: MLA requires training with the latent attention architecture from scratch. You cannot retrofit it onto an existing model. And MLA is incompatible with standard rotational position embeddings (RoPE) without additional engineering — DeepSeek solved this with a specialized RoPE-compatible variant, but it adds implementation complexity.
What should practitioners deploying MoE models do?
If you are serving Mixtral or similar standard-attention MoE models:
- Start with vLLM + PagedAttention (baseline efficiency).
- Add PiKV for expert-aware cache management (2-3x additional throughput).
- Enable KV cache quantization (NVFP4 on NVIDIA hardware) for another 2-4x memory reduction.
- Profile cross-GPU communication — if KV gather is the bottleneck, expert-aligned sharding (PiKV’s approach) provides the largest improvement.
If you are choosing a model to deploy:
- DeepSeek-V3 with MLA gives you 93% KV cache reduction built into the architecture.
- Mixtral with standard MHA gives you ecosystem maturity (vLLM, TGI, full tooling support) but requires external cache optimization.
If you are training a new MoE model:
- Evaluate MLA or GQA at architecture selection time. These are training-time decisions that cannot be retrofitted. The serving cost savings over the model’s lifetime will likely exceed the additional training engineering cost.
Key takeaways
- MoE sparsifies compute, not memory. Expert routing saves FFN computation. KV cache remains dense. At 128K context, cache exceeds model weights.
- Standard optimizations are necessary but insufficient. PagedAttention reduces fragmentation. GQA/MQA reduce size. Neither addresses MoE-specific expert-shard coordination.
- PiKV solves the coordination problem. Expert-aligned sharding + adaptive eviction + compression = 2.2x throughput, 65% memory reduction. Open-source.
- MLA is the architectural solution. DeepSeek’s 93% cache reduction changes the economics fundamentally, but requires training from scratch.
- Choose your battle at architecture selection time. MLA vs standard MHA is a training decision with permanent serving consequences.
Further reading
- Context window management — general KV cache optimization for agents
- 1-bit LLMs on consumer hardware — another approach to memory reduction through extreme quantization
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch