Why doesn't MoE's sparse routing reduce KV cache memory?

Sparse routing reduces computation — only 2-4 experts activate per token instead of all of them. But KV cache stores the attention states from all prior tokens, and every new query must attend to the entire KV history regardless of which experts were active. The routing decision affects which FFN weights run but not how much attention state is stored. A Mixtral-8x7B model uses only 12.9B parameters per token but its KV cache grows at the same rate as a dense 12.9B model.

What is PiKV and how does it solve the MoE KV cache problem?

PiKV is a parallel KV cache framework designed specifically for MoE architectures. It shards KV caches across GPUs aligned with expert placement, uses activity-based eviction to retain only query-relevant entries, and supports multiple compression methods (LoRA, SVD, quantization). The quantized variant achieves 2.2x faster inference with 65% memory reduction. Open-source at github.com/NoakLiu/PiKV.

What is DeepSeek's MLA and how does it compress KV cache?

Multi-Head Latent Attention (MLA) compresses KV tensors into a low-rank latent space using learned projections. Instead of caching full K and V tensors (e.g., 4096 dimensions per head), MLA projects them to a compact latent vector (e.g., 512 dimensions), caches the compressed version, and decompresses on demand. DeepSeek-V3 reduces KV cache to 70KB per token versus 516KB for LLaMA-3.1 405B — a 93% reduction. The trade-off: MLA is incompatible with standard rotational position embeddings without special handling.

Should I use PagedAttention or PiKV for serving MoE models?

Use both. PagedAttention (vLLM) solves memory fragmentation by breaking KV cache into fixed-size blocks, reducing waste from 60-80% to under 4%. PiKV solves the MoE-specific problem of expert-sharded cache coordination and eviction. They operate at different levels: PagedAttention manages how blocks are stored in memory, PiKV manages which entries to keep and how to route queries to the right cache shard.

KV cache for MoE: the memory wall blocking mixture-of-experts at scale

8 minute read

“MoE sparsifies the computation. The memory bill arrives in full.”

TL;DR

MoE models activate 2-4 experts per token — a 46.7B model (Mixtral) runs like a 12.9B model. But KV cache ignores routing. Every token stores full attention state regardless of which experts fired. A 7B-scale MoE with 128K context and 16 experts exceeds 24GB for cache alone. PiKV shards cache across GPUs with adaptive eviction (2.2x throughput, 65% memory reduction). DeepSeek’s MLA compresses KV to 70KB/token vs 516KB for LLaMA-3.1 (93% reduction). For context window management techniques that apply to all architectures, see context window management.

$A dam wall holding back a massive reservoir of water with visible cracks and stress fractures$

Why does MoE not help with memory?

Mixture-of-Experts gets its efficiency from sparse activation. Mixtral-8x7B has 46.7 billion total parameters across 8 experts, but each token activates only 2 experts — roughly 12.7 billion parameters. The compute cost looks like a 12.9B dense model. The quality looks like a much larger one.

The KV cache does not participate in this bargain.

During attention computation, every query token must attend to key-value pairs from all prior tokens in the sequence. This happens in the attention layers, which are shared across all experts — they are not part of the sparse routing. The routing decision selects which FFN (feed-forward network) blocks process the token. The attention computation is dense.

This means KV cache grows at the same rate regardless of how many experts are active. For Mixtral-8x7B, KV cache requires 0.000244 GiB per token. At 128K context length, that is 31.2 GB — more than the active parameters (12.9B × 2 bytes = 25.4 GB in FP16). The cache becomes the dominant memory consumer, not the model weights.

graph TD
    subgraph "What MoE Sparsifies"
        A[FFN Layers<br/>8 experts, 2 active<br/>75% compute savings]
    end
    subgraph "What Stays Dense"
        B[Attention Layers<br/>Full KV cache<br/>0% memory savings]
        C[Every token attends to<br/>ALL prior KV pairs<br/>regardless of routing]
    end
    A -.->|Independent| B
    B --> D[Memory wall:<br/>KV cache > model weights<br/>at 128K context]

Three deployed MoE models illustrate the scale:

Model	Total params	Active params	KV cache per token
Mixtral-8x7B	46.7B	12.9B	0.244 MB
DeepSeek-V3 (MLA)	671B	~37B	0.070 MB
Grok-1	314B	~70-80B	Not published
LLaMA-3.1 405B (dense)	405B	405B	0.516 MB

DeepSeek-V3’s number is dramatically lower because it uses MLA — a fundamentally different attention architecture. The others use standard multi-head attention where KV cache is proportional to hidden size × number of heads × number of layers.

What standard KV cache techniques exist and why are they insufficient?

Three widely deployed optimizations help but do not solve the MoE-specific problem.

PagedAttention (vLLM) breaks KV cache into fixed-size blocks that can be stored anywhere in physical GPU memory, like virtual memory pages in an operating system. This reduces memory waste from 60-80% to under 4% and enables flexible sharing across concurrent requests. vLLM achieves 2-4x throughput improvement from this alone. But PagedAttention does not reduce the total cache size — it just manages what you have more efficiently.

Grouped Query Attention (GQA) shares key-value heads across multiple query heads. Instead of each query head having its own K and V, groups of query heads share a reduced set. This directly reduces cache size. Llama 2 70B uses GQA with 8 KV heads shared across 64 query heads — an 8x reduction in KV cache. But GQA is a model architecture decision made at training time, not a serving optimization. You cannot add GQA to a model that was not trained with it.

Multi-Query Attention (MQA) is the extreme version: all query heads share a single KV head. Maximum efficiency, potential quality trade-off. Same limitation as GQA — training-time decision.

None of these address the MoE-specific problem: cache fragmentation across GPUs. When experts are distributed across multiple GPUs (tensor parallelism), the KV cache for tokens routed to different experts lives on different devices. A query that needs to attend to prior tokens must gather KV entries from multiple GPUs, incurring cross-device communication overhead. PagedAttention optimizes memory layout on a single device. GQA/MQA reduce cache size. Neither addresses the distributed coordination problem.

How does PiKV solve expert-aware KV management?

PiKV (arXiv 2508.06526, open-source at github.com/NoakLiu/PiKV) is designed specifically for MoE KV cache. It addresses three problems simultaneously.

Expert-sharded storage. Instead of maintaining a globally synchronized KV cache, PiKV partitions the cache across GPUs aligned with expert placement. Tokens are stored on the same device as the expert that processed them. This eliminates the cross-GPU gather that standard attention requires — queries can access their relevant KV entries locally.

Adaptive eviction. Not all KV entries are equally useful. PiKV tracks access patterns and evicts entries that are rarely queried. For MoE specifically, this means tokens routed to rarely-used experts can have their cache entries pruned more aggressively without quality loss. The eviction policy is query-aware — it retains entries that current queries are likely to need based on routing patterns.

Integrated compression. PiKV supports multiple compression methods within the same framework: LoRA compression, SVD (singular value decomposition), quantization, and FastV compression. The quantized variant achieves the best throughput-memory trade-off: 2.2x faster inference with 65% memory reduction. Across architectures, PiKV reports 2.3-3.1x throughput gains and 2.8-3.5x memory reductions.

The combination matters. Expert-sharding alone creates fragmentation. Eviction alone loses information. Compression alone adds latency. PiKV coordinates all three — sharding provides locality, eviction manages capacity, compression reduces the remaining footprint.

What is DeepSeek’s MLA and why does it cut cache by 93%?

Multi-Head Latent Attention (MLA) takes a radically different approach: instead of optimizing how you store full KV tensors, change what you store.

Standard multi-head attention caches K and V tensors at their full dimensionality — typically hidden_size × num_heads per layer. For a model with hidden size 4,096 and 32 heads, each token’s KV entry is 4,096 × 2 (K+V) × 2 bytes (FP16) = 16 KB per layer. Across 60+ layers, this compounds quickly.

MLA projects K and V through learned low-rank matrices before caching. A 4,096-dimensional vector becomes a 512-dimensional latent representation. The cache stores this compressed version. At attention time, the latent is decompressed back to full dimensionality for the computation. The compression is lossy but the projection matrices are trained jointly with the model, so the information loss is minimized for the model’s specific task distribution.

The result for DeepSeek-V3: 70 KB per token versus 516 KB for LLaMA-3.1 405B — a 93% reduction. At 128K context, that is 8.7 GB for DeepSeek-V3 versus 64.2 GB for LLaMA 405B. The difference between fitting on one GPU and needing four.

The trade-off: MLA requires training with the latent attention architecture from scratch. You cannot retrofit it onto an existing model. And MLA is incompatible with standard rotational position embeddings (RoPE) without additional engineering — DeepSeek solved this with a specialized RoPE-compatible variant, but it adds implementation complexity.

What should practitioners deploying MoE models do?

If you are serving Mixtral or similar standard-attention MoE models:

Start with vLLM + PagedAttention (baseline efficiency).
Add PiKV for expert-aware cache management (2-3x additional throughput).
Enable KV cache quantization (NVFP4 on NVIDIA hardware) for another 2-4x memory reduction.
Profile cross-GPU communication — if KV gather is the bottleneck, expert-aligned sharding (PiKV’s approach) provides the largest improvement.

If you are choosing a model to deploy:

DeepSeek-V3 with MLA gives you 93% KV cache reduction built into the architecture.
Mixtral with standard MHA gives you ecosystem maturity (vLLM, TGI, full tooling support) but requires external cache optimization.

If you are training a new MoE model:

Evaluate MLA or GQA at architecture selection time. These are training-time decisions that cannot be retrofitted. The serving cost savings over the model’s lifetime will likely exceed the additional training engineering cost.

Key takeaways

MoE sparsifies compute, not memory. Expert routing saves FFN computation. KV cache remains dense. At 128K context, cache exceeds model weights.
Standard optimizations are necessary but insufficient. PagedAttention reduces fragmentation. GQA/MQA reduce size. Neither addresses MoE-specific expert-shard coordination.
PiKV solves the coordination problem. Expert-aligned sharding + adaptive eviction + compression = 2.2x throughput, 65% memory reduction. Open-source.
MLA is the architectural solution. DeepSeek’s 93% cache reduction changes the economics fundamentally, but requires training from scratch.
Choose your battle at architecture selection time. MLA vs standard MHA is a training decision with permanent serving consequences.

KV cache for MoE: the memory wall blocking mixture-of-experts at scale

TL;DR

Why does MoE not help with memory?

What standard KV cache techniques exist and why are they insufficient?

How does PiKV solve expert-aware KV management?

What is DeepSeek’s MLA and why does it cut cache by 93%?

What should practitioners deploying MoE models do?

Key takeaways

Further reading

Related across topics

Share on

TL;DR

Why does MoE not help with memory?

What standard KV cache techniques exist and why are they insufficient?

How does PiKV solve expert-aware KV management?

What is DeepSeek’s MLA and why does it cut cache by 93%?

What should practitioners deploying MoE models do?

Key takeaways

Further reading

Related across topics

Context Window Management

Share on