KV-Direct is a bounded-memory inference approach from arXiv:2603.19664 that stores residual vectors (5 KB per token on Gemma 3-4B) instead of full KV pairs (136 KB per token) — a 27x reduction. It recomputes K and V on demand using frozen weight matrices, reducing a 20-turn conversation from 103 MB to 42 MB peak memory.

The residual stream revelation: why KV cache may be theoretically unnecessary

Q: Can you run inference with no KV cache at all?

Yes. The arXiv:2603.19664 paper demonstrated 100% token identity under greedy decoding across four model families by checkpointing residual vectors instead of K/V pairs and recomputing projections on demand. The cost is latency — recomputation is slower than cache reads for short interactive contexts.

11 minute read

Every team running LLM inference at scale has the same conversation. Someone opens a memory profiler, sees the KV cache consuming most of the GPU memory budget, and asks: can we make this smaller? Then come the follow-up questions — quantize it, evict it, compress it, page it, offload it to CPU. The conversation is always about how to manage the cache, never about whether the cache stores something genuinely irreplaceable.

A March 2026 paper (arXiv:2603.19664) asks the question nobody was asking: is the information in the KV cache actually new? The answer — rigorously proven and empirically validated across four model families — is no.

TL;DR

The KV cache is theoretically redundant. Keys and values at every transformer layer are deterministic projections of the residual stream, which means they can be reconstructed bit-identically at any point. Recomputing K and V from checkpointed residuals produces 100% identical outputs under greedy decoding, with 27x less memory per token on Gemma 3-4B. This doesn’t mean the cache is obsolete — it means speed and information are decoupled, opening new design spaces for bounded-memory inference that the field hasn’t explored. For the broader context on memory management at the application layer, see context window management.

A large bank of server memory modules with most DIMM slots occupied and glowing active, beside a single-card configuration with only one bright DIM...

What is the residual stream, and why does every layer write to it?

The residual stream is the running state vector that flows through every transformer layer. When you feed a sequence of tokens to a transformer, each token starts as an embedding. From there, every layer reads from that vector, computes something, and adds its result back. The output of layer N becomes the input to layer N+1.

This “add and pass” architecture is why transformers are called residual networks. Each layer contributes its delta to a shared stream rather than transforming the entire state from scratch.

Token embeddings
       │
       ▼
┌─────────────────────────────────────────┐
│  Layer 1                                │
│  ┌──────────────┐   ┌───────────────┐  │
│  │ Self-Attention│   │     FFN       │  │
│  └──────┬───────┘   └───────┬───────┘  │
│         │  (+= residual)    │          │
└─────────┼───────────────────┼──────────┘
          │         +         │
          └────────┬──────────┘
                   │  ← residual stream: complete state at this position
                   ▼
┌─────────────────────────────────────────┐
│  Layer 2 reads the same stream          │
│         ...                             │
└─────────────────────────────────────────┘

A 2021 paper from Anthropic researchers (Elhage, Nanda, Olsson et al., “A Mathematical Framework for Transformer Circuits”) formalized this: attention heads read from the residual stream, perform their computations, and write back. Nothing is created outside the stream. Everything goes through it.

The implication — which the field largely ignored until now — is that at any given layer and position, the residual stream is the complete representation of what the model knows. Not a summary. The entire latent state.

How are K and V derived from the residual stream?

When an attention layer computes attention, it projects the residual stream through learned weight matrices:

K_l = RotaryEmbed(LayerNorm(r) @ W_K)
V_l =              LayerNorm(r) @ W_V

Where r is the residual vector at that token position and layer, and W_K, W_V are weight matrices frozen after training — they never change during inference.

Given r, you can compute K_l and V_l exactly. The operations are deterministic: layer normalization, then a matrix multiply, then positional rotations for keys. No stochasticity, no learned dynamics, no hidden state. The same r always produces the same K and V.

The KV cache stores K_l and V_l for every layer and every past token. But since those values are deterministic projections of r, and r flows through the forward pass anyway, the cache is storing derived quantities that can always be reconstructed. That’s what arXiv:2603.19664 calls theoretical redundancy — not a claim that caching is wasteful in practice, but a claim that the cache contains no information the residual stream doesn’t already carry.

What does “theoretically redundant” actually mean?

The paper proves this through the Markov property of transformer generation: if the residual stream at position t fully determines the output distribution for token t+1, then the KV cache adds nothing to that distribution. The residual stream is sufficient.

To test this, the authors run a residual patching experiment. They take two different generation tasks (Task A and Task B), then at some layer l, swap the residual stream from Task B into the forward pass of Task A, while continuing with Task A’s KV cache from layer l+1 onward. If the KV cache contains information beyond the residual stream, the output distribution should change.

The measured KL divergence between original and patched output: D_KL = 0.0. Exact zero, not close-to-zero.

Then they eliminate the cache entirely: checkpoint residual vectors at each position, recompute K and V on demand. Run 30-token greedy generation across six models in four families (LLaMA/SmolLM2, Qwen, DeepSeek, Gemma3) ranging from 135M to 4B parameters.

Token identity rate: 100%. Every token matched exactly.

The memory implications follow directly:

                    Standard KV cache    KV-Direct
                    ─────────────────    ─────────────────
Per-token size
(Gemma 3-4B):       136 KB               5 KB  (27x smaller)

20-turn convo
peak memory:        103 MB               42 MB (59% reduction)

Output identity
(greedy, 30 tok):   baseline             100% match

The 27x comes from a simple fact: residual vectors are smaller than the combined K and V projections across all layers. You store one vector per token instead of 2 × num_layers vectors.

What this opens: bounded-memory inference

The immediate practical implication is not “stop using KV cache.” It’s that you now have a choice you didn’t know you had.

Before arXiv:2603.19664, the engineering question was always: which KV cache optimization should we apply? Eviction, quantization, compression, tiered storage. Every option accepted the cache as a given.

Now there’s a new option: recompute rather than store, trading some latency for a dramatic memory reduction. This tradeoff matters in three scenarios:

Long-context applications. For a Qwen-14B model handling a 1M-token context, the KV cache alone requires roughly 200 GB of VRAM (per Qwen’s Hugging Face model card). At that scale, partial caching with aggressive eviction is already the reality. KV-Direct offers a principled alternative: evict everything, checkpoint residuals, recompute on access. Memory scales with residual size, not K/V pair size.

Memory-constrained deployments. A 27x reduction in per-token storage changes what fits in unified memory on edge devices and smaller GPU tiers. Chatbot applications with long conversation histories become viable without paging to CPU.

Latency-tolerant batch workloads. The recomputation cost is real — and hardware-dependent. For interactive applications, the cache’s speed advantage is non-negotiable. For batch processing, document analysis, or offline agentic pipelines, the latency penalty is acceptable. At N=500 evicted tokens, the paper reports recomputation takes 0.17x the time of reading from cache, because memory bandwidth — not arithmetic — is the bottleneck on modern HBM-equipped hardware.

flowchart LR
    A[Inference workload] --> B{Latency budget?}
    B -->|Interactive, sub-200ms| C[Standard KV cache]
    B -->|Batch or offline| D{Memory constrained?}
    C --> E[Optimize: quantize, evict, page]
    D -->|GPU memory abundant| C
    D -->|Memory constrained| F[KV-Direct: checkpoint residuals]
    F --> G[Recompute K and V on demand]
    E --> H[TurboQuant / LookaheadKV / paged attention]

Which KV cache optimizations still matter?

Most of them — but the theoretical basis for them has shifted.

For latency-sensitive inference, the cache speed advantage remains non-negotiable. The 2603.20397 survey of KV cache optimization (March 2026) documents five active research directions: cache eviction, compression, hybrid memory, novel attention mechanisms, and combination strategies. Each remains relevant for systems where latency matters.

What changes is what “aggressive” means for compression. If K and V are deterministic projections of the residual stream, then any information lost by quantizing the cache can, in principle, be recovered from the residual. This isn’t just theoretical comfort — it’s a tighter bound on how far compression can go without quality loss.

TurboQuant (Google, ICLR 2026) demonstrates this ceiling: 3-bit quantization of the KV cache achieves 6x memory reduction with zero measured accuracy loss, and enables up to 8x faster attention computation on H100 due to reduced memory bandwidth pressure. The fact that the underlying data is reconstructible from a lossless source is part of why such aggressive quantization doesn’t degrade model output.

LookaheadKV (arXiv:2603.10899, ICLR 2026) takes a different angle — predicting which cache entries will be needed in future tokens without running full generation, reducing eviction overhead by up to 14.5x compared to prior eviction methods. This remains valuable when the cache is retained and the question is which K/V pairs to keep. With KV-Direct, the equivalent question becomes which residual checkpoints to retain — a simpler problem, since residuals are uniform in size.

The practical 18-month picture:

Hybrid systems that cache residuals rather than K/V pairs for long-context deployments, with prefetch pipelines to amortize recomputation cost
More aggressive quantization benchmarked against the zero-loss baseline that the redundancy result implies
Edge LLM deployments where the latency tradeoff is acceptable and the memory savings are decisive

The cache isn’t going away. But it’s no longer a fundamental requirement — it’s an optimization with a known information-theoretic price: zero.

For cost reduction approaches that complement this work at the application layer, see token efficiency optimization.

FAQ

What is the residual stream in a transformer? The residual stream is the shared state vector that flows through every transformer layer. Each layer reads from it, computes a delta, and adds that delta back. At any point in the forward pass, the residual stream contains the complete representation of everything the model has processed. Elhage et al. (Anthropic, 2021) formalized this: all attention heads and FFN layers communicate exclusively through the residual stream.

Why is the KV cache theoretically redundant if it speeds up inference? Speed and information are separate properties. The KV cache contains no information the residual stream doesn’t already hold — K and V are deterministic projections of the residual at each layer, computable from frozen weight matrices. The cache trades memory for latency. Calling it “redundant” means it stores derived data, not that storing it is useless.

Can you run inference with no KV cache at all? Yes. The arXiv:2603.19664 paper (March 2026) demonstrated 100% token identity under greedy decoding across four model families by checkpointing residual vectors instead of K/V pairs and recomputing projections on demand. The approach, called KV-Direct, reduces per-token storage from 136 KB to 5 KB on Gemma 3-4B. The cost is latency — recomputation is slower than cache reads for short interactive contexts.

What is KV-Direct? KV-Direct is a bounded-memory inference method from arXiv:2603.19664 that stores residual vectors (5 KB per token on Gemma 3-4B) instead of full KV pairs (136 KB per token) — a 27x per-token memory reduction. It recomputes K and V on demand using frozen weight matrices. A 20-turn conversation requires 42 MB peak memory versus 103 MB with a standard cache, with identical output under greedy decoding.

Does this mean KV cache compression can achieve zero information loss? Theoretically yes — the ground truth residual stream is always available, so information lost in cache compression can be recovered. This explains why TurboQuant (Google, ICLR 2026) achieves 6x KV memory reduction with zero measured accuracy degradation: the underlying data is reconstructible from a lossless source, making aggressive quantization safe in a way that model weight quantization is not.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

The residual stream revelation: why KV cache may be theoretically unnecessary

TL;DR

What is the residual stream, and why does every layer write to it?

How are K and V derived from the residual stream?

What does “theoretically redundant” actually mean?

What this opens: bounded-memory inference

Which KV cache optimizations still matter?

FAQ

Related across topics

Share on

TL;DR

What is the residual stream, and why does every layer write to it?

How are K and V derived from the residual stream?

What does “theoretically redundant” actually mean?

What this opens: bounded-memory inference

Which KV cache optimizations still matter?

FAQ

Related across topics

Context Window Management

Share on