4 minute read

“Speculative decoding used to be a research paper. Now it is a checkbox in vLLM.”

TL;DR

ML-SpecQD shows speculative decoding + 4-bit quantization produces up to 2.72x on code generation tasks. EAGLE-3 in vLLM hits 2.5x standalone. P-EAGLE (AWS) generates all draft tokens in one forward pass for an additional 1.1-1.4x. The combination is now a standard production optimization. For the foundational speculative decoding mechanics, see the earlier post on speculative decoding fundamentals.

Two precision gear systems connected in series inside a glass-sided industrial casing, a large input gear driving a smaller gear for speed increase...

What changed since the fundamentals post?

Three developments moved speculative decoding from research curiosity to production default.

ML-SpecQD (arXiv 2503.13565) systematically studied the interaction between speculative decoding and quantization — a combination most practitioners assumed was simply additive. The finding: the interaction is multiplicative. Quantizing the target model to 4-bit reduces its memory footprint, freeing GPU memory. That freed memory enables larger batch sizes. Speculative decoding’s draft-and-verify pattern amortizes the target model’s per-step cost across multiple tokens. Larger batches mean more tokens verified per step.

The result on Qwen2.5-Coder 7B with 4-bit quantization: up to 2.72x speedup on Intel CPU hardware. The paper focused on CPU-based inference where quantization’s memory savings translate directly to bandwidth improvements — the regime where most cost-sensitive deployments operate.

EAGLE-3 in vLLM brought speculative decoding from standalone research implementations into the most widely used LLM serving framework. Enabling it is a configuration flag. The 2.5x speedup requires no custom code, no separate draft model deployment, no changes to the request pipeline. This is what moves a technique from “interesting” to “adopted.”

P-EAGLE (AWS, arXiv 2504.xxxxx) addressed the remaining bottleneck: the draft model itself. Standard speculative decoding generates draft tokens autoregressively — one at a time. P-EAGLE generates all K draft tokens in a single forward pass by parallelizing the draft generation. This removes the ceiling where draft model latency limits the overall speedup.

Why does the quantization interaction work?

The intuition says quantization noise should degrade speculative decoding. The draft model proposes tokens. The target model verifies them. If the target model is quantized, its verification is noisier — it accepts fewer draft tokens, reducing the speedup.

This happens. The acceptance rate does drop with quantization. But the throughput gain from larger batch sizes more than compensates.

Here is the math. Suppose a full-precision Llama-3-70B uses 140 GB in FP16. At batch size 1, the GPU memory is fully consumed by the model. Quantize to 4-bit: the model uses 35 GB. The freed 105 GB can hold KV cache for dozens of concurrent sequences. Speculative decoding with EAGLE-3 processes 3-5 draft tokens per verification step. At batch size 32, each verification step processes 32 × 4 = 128 tokens in one forward pass.

The per-step cost barely changes (it is dominated by memory bandwidth, not compute). The per-token cost drops proportionally to the batch size. Quantization noise reduces acceptance rate by maybe 10-15%. Batch size increase boosts throughput by 200-400%. The net effect is strongly positive.

Configuration Speedup Hardware
EAGLE-3 alone (FP16) Up to 2.5x GPU (various)
Quantization alone (4-bit GPTQ) 1.3-1.5x GPU (various)
ML-SpecQD (speculative + 4-bit) Up to 2.72x Intel CPU
P-EAGLE (parallel draft) Additional 1.1-1.4x over EAGLE-3 NVIDIA B200

How should you enable this in production?

Step 1: Enable EAGLE-3 in vLLM. This is the lowest-effort, highest-impact change. vLLM’s Speculators v0.3.0 supports end-to-end training and deployment of draft models. For supported model families (Llama, Mistral, Qwen), pre-trained speculator weights are available.

Step 2: Combine with quantization. Apply GPTQ or AWQ 4-bit quantization to the target model. vLLM supports quantized models natively. The speculator (draft model) can remain at higher precision — it is small and speed is its priority.

Step 3: Profile acceptance rates. Speculative decoding’s benefit varies by workload. Code generation and structured output have high acceptance rates (70-80% of draft tokens accepted). Creative text and multi-turn conversation have lower rates (40-60%). Profile your specific traffic to estimate the effective speedup.

Step 4: Monitor for regressions. Quantized speculative decoding changes the output distribution subtly. The verification step ensures that the output distribution matches what the target model would produce — but “matches” has a tolerance parameter. Run quality evaluations on your specific benchmarks after enabling the combination.

Key takeaways

  • The combination is multiplicative, not additive. Quantization frees memory bandwidth. Speculative decoding amortizes per-step cost. Up to 2.72x on code generation tasks.
  • EAGLE-3 in vLLM makes it a configuration flag. No custom code. 2.5x standalone.
  • P-EAGLE removes the draft bottleneck. Parallel draft generation adds 1.1-1.4x on NVIDIA B200 by eliminating the autoregressive ceiling in draft models.
  • Profile your workload. Code generation benefits most (high acceptance). Creative text benefits least (low acceptance).

Further reading

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch