9 minute read

Weight quantization gets all the attention. Quantize to INT8, maybe INT4, watch the benchmark score. But model weights are a one-time cost. The KV cache grows with every token, and that’s where the real pressure lives at scale.

A Llama 3.1 70B request at 128K context consumes roughly 40 GB of KV cache alone. Serve four concurrent requests at that length and you need 160 GB just for the cache, more than the model weights themselves. Weight quantization doesn’t touch that number. TurboQuant does.

TL;DR: TurboQuant (Google, ICLR 2026) compresses the KV cache to 3 bits using a two-stage algorithm (PolarQuant for main compression, QJL for residual correction), cutting memory 6x and speeding up attention computation 8x on H100 GPUs with zero measured accuracy loss. LookaheadKV (arXiv:2603.10899, also ICLR 2026) takes a complementary route, predicting which KV entries to evict 14.5x faster than prior methods. The two techniques solve different problems and can be stacked.

A cross-section of layered data storage media under laboratory lighting, most layers compressed to ultra-thin densely packed sheets while standard ...


Why KV cache compression is different from weight quantization

Weight quantization doesn’t solve your long-context problem. At a 70B model serving requests at 8K context, the KV cache sits around 20 GB per request, roughly 640 GB for a batch of 32. The cache can consume up to 70% of total GPU memory during long-context inference, and it scales linearly with every new token generated.

The structural difference is what matters. Model weights are static: quantize once, done. The KV cache is dynamic. It grows with sequence length, fluctuates with batch size, and gets written and read on every forward pass. That changes the requirements: you need low-overhead encoding and decoding at inference time, not a one-shot calibration pass at deployment.

There’s also a targeting problem with weight-only quantization. It helps when your batch is small and your sequences are short. Once you push toward 32K+ context or large batches on long documents, the cache is where you’re losing. The concrete gap: INT8 weight quantization on a 70B model saves roughly 35 GB of model memory. TurboQuant’s 6x KV compression on a single 128K-context request saves roughly 33 GB. Same order of magnitude, but on the part of memory that keeps growing.


What TurboQuant actually does

TurboQuant achieves 6x memory reduction at 3-bit precision with no measurable accuracy loss by solving a problem most 3-bit approaches get wrong. Standard quantizers introduce systematic bias into attention score computation. TurboQuant doesn’t.

The algorithm runs in two stages.

PolarQuant is the main compression pass. Instead of quantizing KV vectors in Cartesian space (where outliers blow up quantization error), PolarQuant converts pairs of coordinates into polar form. The insight is that angles in attention representations follow a concentrated, predictable distribution. Because the angles cluster, PolarQuant can eliminate per-block normalization constants entirely – the metadata overhead that makes naive 4-bit quantization less efficient than it sounds. You end up with genuine 3-bit storage, not “3-bit plus 8-bit scale factors buried in the overhead.”

QJL (Johnson-Lindenstrauss residual correction) is a 1-bit cleanup pass. Any quantizer leaves residual error; QJL takes that residual, projects it through a random Gaussian matrix, and stores only the sign of each projection – exactly 1 bit per dimension. The Johnson-Lindenstrauss transform guarantees the resulting attention score estimator is unbiased, with variance O(1/d) where d is the head dimension (typically 128). This is what makes accuracy preservation rigorous rather than empirical. The estimator is mathematically unbiased, which is a stronger claim than “it scored well on benchmarks.”

On H100 GPUs, the result is up to 8x faster attention computation over 32-bit baseline. Memory bandwidth is the bottleneck on long sequences, not FLOPS, so reducing per-token storage has a direct multiplier on throughput. The approach was validated on LongBench, Needle In A Haystack, and ZeroSCROLLS using Gemma and Mistral – including needle retrieval tasks specifically designed to break approximate attention.

One thing practitioners often miss: no retraining, no calibration dataset, no fine-tuning. It’s a drop-in change at inference time.

Memory breakdown -- 70B model, 128K context, single request

Without TurboQuant:
  Model weights (INT8):   ~35 GB
  KV cache (BF16):        ~40 GB
  Activations + misc:     ~5 GB
  ────────────────────────────────
  Total:                  ~80 GB

With TurboQuant (3-bit KV):
  Model weights (INT8):   ~35 GB
  KV cache (TurboQuant):  ~6.7 GB   <- 6x reduction
  Activations + misc:     ~5 GB
  ────────────────────────────────
  Total:                  ~46.7 GB

Effect: single H100 (80 GB) goes from barely fitting 1 request
to fitting ~10 concurrent requests at this context length.

LookaheadKV: predicting what to keep vs. compressing what you keep

TurboQuant compresses everything in the cache. LookaheadKV asks a prior question: which entries deserve to stay in the cache at all?

KV cache eviction (dropping low-importance entries to cap cache size) has a longstanding accuracy problem. Eviction decisions have to be made during the prefill phase, before the model generates any response. But importance scores depend on what queries will be issued during generation, and those don’t exist yet. Existing heuristics – recency weighting, attention sink patterns – work adequately at short context but degrade as sequences lengthen.

LookaheadKV trains lightweight, parameter-efficient modules that predict future query importance without generating draft tokens. The modules augment each transformer layer and learn to approximate true importance scores with high fidelity. The design choice is what they call “surrogate future response”: the modules simulate what the model would ask of the cache, without actually running generation. This cuts eviction decision cost by 14.5x versus prior costly approximation methods, while beating cheap heuristics on accuracy.

The practical payoff is faster time-to-first-token at long contexts. The prefill phase can confidently drop entries that would never be attended to, rather than holding everything just in case. This is different from what TurboQuant provides. TurboQuant doesn’t reduce cache entry count; it reduces per-entry memory cost. Both appeared at ICLR 2026, and they’re solving orthogonal constraints.


A decision matrix: which compression technique for which constraint

The choice between eviction and quantization maps onto deployment constraints fairly cleanly:

flowchart TD
    A[KV cache is your bottleneck] --> B{What's the primary constraint?}
    B --> C[Memory capacity\ne.g. won't fit in VRAM]
    B --> D[Latency / TTFT\ne.g. too slow to respond]
    B --> E[Both -- long context\n+ high throughput]

    C --> F{Acceptable accuracy\ntrade-off?}
    F --> G[Near-zero loss required] --> H[TurboQuant 3-bit\n6x memory reduction]
    F --> I[Moderate loss OK] --> J[INT8 KV in vLLM\ntoday's safe default]

    D --> K{Is eviction already\nin your stack?}
    K --> L[No] --> M[LookaheadKV\n14.5x faster eviction]
    K --> N[Yes, but inaccurate] --> M

    E --> O[LookaheadKV eviction\n+ TurboQuant on kept entries\nmaximum compression]

    H --> P[Validate on your task:\nneedle retrieval, multi-hop]
    J --> P
    M --> P
    O --> P

TurboQuant is the right choice when sequences are long, you need to maximize concurrent requests in VRAM, and dropping any context is not acceptable (RAG over full documents, legal document review, long multi-turn sessions). All tokens stay; they just cost 6x less memory each.

LookaheadKV fits when you’re optimizing TTFT on a fixed context budget and your workload has natural redundancy – long system prompts, retrieved chunks that are mostly irrelevant. The 14.5x eviction speedup directly cuts prefill time. You do give up some context, so accuracy on tasks that need every token will degrade.

Combining them makes sense for high-throughput long-context serving where both density and speed matter. Evict the low-importance entries first, then quantize what remains. The gains compound.

Using neither yet is also a valid call if you’re on vLLM mainline and want to avoid community forks. INT8 KV quantization (already in vLLM) halves KV memory with under 1% impact on most benchmarks. That’s the safe default for production in 2026 until TurboQuant lands in mainline.

One constraint that overrides everything else: if your application needs to retrieve specific facts buried deep in long context (the needle-in-a-haystack pattern), test your compression method explicitly on that task. Aggressive eviction schemes can silently discard the needle. TurboQuant’s unbiased estimator is designed to prevent this, but verify before deploying.


FAQ

Does TurboQuant require retraining or fine-tuning the model? No. TurboQuant operates entirely at inference time. You quantize the KV cache on the fly with no changes to model weights, no calibration dataset, no fine-tuning pass.

Can TurboQuant and LookaheadKV be combined? Yes. LookaheadKV evicts low-importance entries before they’re stored; TurboQuant compresses the entries that remain. They target different parts of the memory problem and have no architectural conflict.

What context lengths justify KV cache compression over weight quantization? Once your context exceeds roughly 16K tokens on a 70B model, the KV cache often matches or exceeds model weight memory. At 128K context on Llama 3.1 70B, the cache alone is roughly 40 GB per request. Weight quantization does nothing to that number.

Is 3-bit KV quantization safe for all tasks? TurboQuant validates on LongBench, Needle In A Haystack, and ZeroSCROLLS with Gemma and Mistral, showing zero measured accuracy loss. Tasks with high attention sensitivity to specific tokens – precise fact retrieval, multi-hop reasoning – are the ones to stress-test before going to production.

Is TurboQuant in vLLM yet? Community implementations targeting vLLM exist (see github.com/0xSero/turboquant), but as of March 2026 it is not in the vLLM mainline. INT8 KV cache remains the production default in vLLM until official integration lands.


Related reading:

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch