The quantization decision tree: which method for which hardware in 2026
Every week, r/LocalLLaMA gets the same post: “I have X GB of VRAM and want to run Y model. What quantization should I use?” The replies converge on the same approximate answers (Q4_K_M is usually fine, AWQ is better for GPUs, FP8 is for H100s), but nobody has actually written the 2026 version of this guide. The landscape changed. BitNet LoRA is now running on phones. TurboQuant (ICLR 2026) kills KV cache memory at 3-bit with no accuracy loss. Flash-MoE got 397B running on a MacBook. The old advice still mostly holds but the map needs updating.
TL;DR: Pick quantization by hardware tier. Edge and mobile use BitNet or GGUF Q4_K_M. Consumer GPUs (8–24 GB VRAM) get GGUF Q8_0 or AWQ. Data center A100 runs INT8; H100 runs FP8. KV cache compression (TurboQuant’s 3-bit, 6x reduction) is a separate choice that stacks on top of any weight quantization. The decision tree below walks you through the whole thing in under two minutes.

The three axes that determine your quantization choice
Before reaching for a format name, answer three questions.
First: how much memory do you have? This is a hard constraint, not a preference. GGUF Q4_K_M of a 70B model takes roughly 40 GB; Q8_0 takes 80 GB. If it doesn’t fit, nothing else matters.
Second: what’s your latency budget? Interactive chat at less than 200 ms TTFT has different tolerances than overnight batch processing. Lower bit-width usually means faster decode, but can add first-token latency on formats without native hardware support.
Third: what quality floor can you live with? GGUF Q4_K_M retains 92% quality vs. full precision. AWQ at 4-bit retains 95%. FP8 on H100 loses less than 0.5%. If you’re doing medical summarization, your floor is higher than it is for a personal coding assistant.
Those three questions determine the answer. The format names come after.
Edge and mobile (≤16 GB unified memory)
GGUF Q4_K_M for standard models, BitNet TQ1_0 for 1-bit-native models, Flash-MoE 4-bit if you’re running MoE architectures on 48 GB Apple Silicon. And as of March 2026, you can fine-tune a 1B BitNet model on a phone in under 90 minutes — Tether’s QVAC Fabric made that real, which changes what “edge AI” actually means.
GGUF Q4_K_M
GGUF is the llama.cpp format. It runs on CPU, Apple Metal, and Vulkan, so it works on anything with enough RAM. Q4_K_M is the 4-bit K-quant variant with mixed precision on attention and feed-forward layers, and it’s the community default for good reason: 6.74 perplexity on Llama 3.1 8B versus AWQ’s 6.84 in JarvisLabs benchmarks (lower is better, minimal difference). On consumer hardware with < 16 GB, Q4_K_M is still the safe default for models that weren’t trained as BitNet.
BitNet (TQ1_0)
Microsoft’s BitNet stores weights as ternary values (-1, 0, +1) at 1.58 bits per parameter. That’s not a quantization of an existing model — the model gets trained this way from scratch. You can’t take Llama 3 and BitNet-ize it. That’s the constraint most people miss.
What you get in exchange: BitNet-1B uses 77.8% less VRAM than Gemma-3-1B at 16-bit. On x86 CPUs, bitnet.cpp hits 2.37x–6.17x speedups with 71.9%–82.2% less energy. And since March 2026, QVAC Fabric lets you run LoRA fine-tuning on BitNet models on-device — 1B parameters in 1 hour 18 minutes on a Samsung S25, 1 hour 45 minutes on an iPhone 16.
Flash-MoE (48 GB Apple Silicon)
Flash-MoE shipped in March 2026. It runs Qwen3.5-397B on a MacBook Pro with 48 GB RAM at 4.4 tokens/second. The 209 GB model doesn’t fit in memory. It streams only the K=4 active experts from NVMe SSD at 17.5 GB/s using hand-tuned Metal shaders, with 4-bit quantization on the weights. The result works because MoE models only activate a fraction of their parameters per token, so you never need the full model at once.
One important thing: 2-bit quantization pushes this to 5.74 tok/s, but it breaks tool calling with malformed JSON. Stick with 4-bit if you need function calling to work.
Edge format table:
| Format | Bit-width | Memory vs FP16 | Quality retention | CPU support | Best for |
|---|---|---|---|---|---|
| GGUF Q4_K_M | ~4.5-bit | ~28% | 92% | Yes | Universal edge default |
| GGUF Q8_0 | 8-bit | ~52% | ~99% | Yes | Quality-critical, larger RAM |
| BitNet TQ1_0 | 1.58-bit | ~10% | Comparable (native) | Yes | BitNet-native models only |
| Flash-MoE 4-bit | 4-bit | streaming | Production quality | Metal only | 48 GB MoE inference |
Consumer GPU (8–24 GB VRAM)
Most people reading this are in this tier. GGUF Q4_K_M or Q8_0 if you need flexibility across hardware. AWQ if you’re on NVIDIA and care about accuracy. GPTQ if you care more about throughput and can tolerate ~90% quality retention at 4-bit.
GGUF on GPU
GGUF isn’t just for CPU. With llama.cpp GPU offloading, you push layers to VRAM and overflow to CPU RAM. On a 16 GB GPU running a 13B model at Q4_K_M (~8 GB), you can offload all layers and get full GPU inference speed. The advantage: flexibility. One format, any hardware, CPU-GPU split as needed. Throughput won’t match dedicated GPU formats, but the operational simplicity is worth it for personal deployments.
AWQ (Activation-Aware Weight Quantization)
AWQ calibrates quantization by observing actual activation magnitudes and protecting the outlier weights that matter most. The accuracy story at 4-bit is better than GPTQ: at Qwen3-32B, AWQ achieves 42% GPU memory reduction with only 1.2% accuracy drop. With the Marlin kernel on NVIDIA, AWQ reaches 741 tok/s output throughput, a 10.9x speedup over unoptimized baseline, and slightly ahead of GPTQ’s 712 tok/s. If you’re on NVIDIA and quality retention matters, AWQ is the call.
GPTQ
GPTQ uses layer-by-layer Hessian-based weight compensation. It’s slightly less accurate than AWQ (90% vs. 95% quality retention at 4-bit) but generates smaller model files and has broader toolchain support. ExLlama v2 kernels push GPTQ decode speed well beyond baseline. For raw decoding throughput on a single NVIDIA GPU, GPTQ with ExLlama v2 is still competitive with AWQ + Marlin.
The 24 GB sweet spot
At 24 GB VRAM (RTX 3090, RTX 4090, L40S): a 70B model at Q4_K_M fits at ~40 GB, still too large for VRAM-only. Use either CPU-GPU split in GGUF or a smaller model (34B at AWQ 4-bit fits at ~20 GB). At 34B and below, AWQ or GPTQ run fully in VRAM. A 70B at INT8 (80 GB) requires two GPUs or cloud.
Consumer GPU format comparison:
| Format | Accuracy at 4-bit | Peak throughput | Hardware req. | Toolchain |
|---|---|---|---|---|
| GGUF Q4_K_M | 92% | Moderate | Any | llama.cpp, Ollama |
| GGUF Q8_0 | ~99% | Moderate | Any | llama.cpp, Ollama |
| AWQ 4-bit | 95% | 741 tok/s (Marlin) | NVIDIA preferred | vLLM, LMDeploy |
| GPTQ 4-bit | 90% | 712 tok/s (Marlin) | NVIDIA preferred | vLLM, ExLlama v2 |
| bitsandbytes NF4 | ~92% | Slower | NVIDIA | HuggingFace |
Prosumer GPU (24–80 GB VRAM)
AWQ or GPTQ for 24–48 GB. FP8 weight-only (W8A16) on Ada Lovelace (L40S, RTX 4090). INT8 for A100. FP8 full (W8A8) for H100. This tier is where format choice starts overlapping with serving infrastructure — you’re probably not running llama.cpp anymore.
At 48 GB (dual RTX 3090, A6000): AWQ 4-bit handles 70B models in a single node. INT8 handles 34B–70B with better quality. For production multi-user inference, shift to vLLM with AWQ or INT8. vLLM’s continuous batching handles concurrent users far better than llama.cpp’s sequential processing.
At 80 GB (A100 80 GB): A single A100 fits 70B at INT8 (~80 GB) or 34B at FP16. This is where you move away from GGUF and toward dedicated serving stacks. A100 does not support native FP8. Forcing FP8 via vLLM on A100 degrades performance 10–20% via software emulation. Stick with INT8. It shows only 0.04% accuracy drop from BF16 baseline and runs natively on A100 Tensor Cores. INT8 on A100 supports models up to ~70B parameters via vLLM.
Data center and cloud (A100/H100, cloud API)
A100 gets INT8 via vLLM. H100 gets FP8 (W8A8) at less than 0.5% accuracy loss and up to 1.6x throughput improvement. For long-context workloads, TurboQuant KV cache compression is a separate, stackable decision — don’t confuse it with weight quantization.
A100: INT8 is the answer
INT8 W8A8 on A100 shows 1–3% accuracy degradation without calibration. With SmoothQuant (the standard calibration method for A100 in production), that number drops to 0.04% vs. BF16. That’s not a tradeoff, that’s noise. Memory reduction: a 70B model goes from ~140 GB to ~70 GB, fitting on a single A100 80 GB.
H100: FP8 is the answer
H100 has native FP8 Tensor Core support. FP8 achieves a 3.5x speedup over FP16 on Mixtral 70B with less than 0.5% accuracy loss. The reason FP8 beats INT8 on quality isn’t mysterious: FP8’s exponential bit spacing handles outlier activations better than INT8’s linear range. H100 FP8 gives you 4.6x max throughput and 4.4x faster first-token latency vs. A100. Turn it on in vLLM with --dtype fp8. One caveat: full W8A8 is Hopper-only. Ada Lovelace (L40S, RTX 4090) gets W8A16 weight-only FP8 — good, but not the same thing.
TurboQuant: KV cache is a separate problem
Weight quantization and KV cache compression solve different problems. Weight quantization reduces how much memory the model itself takes up. KV cache compression reduces the runtime memory consumed during inference as context length grows. They’re independent, and you can run both.
Google’s TurboQuant (ICLR 2026) compresses KV cache to 3-bit with zero accuracy loss: 6x lower KV cache memory, up to 8x faster attention on H100. The approach: randomly rotate the data vectors to simplify their geometry, apply standard per-channel quantization, then use 1 bit of residual QJL to clean up the bias. Community implementations are already on GitHub with vLLM integration. If you’re serving 32K+ context at any volume, run TurboQuant on top of whatever weight quantization you chose.
MoE serving with Flash-MoE and FP8
For Mixture-of-Experts models at data center scale, only active experts need to be loaded per token. FP8 quantization of expert weights + Flash-MoE expert streaming gives a practical path to serving 397B+ models at inference costs competitive with dense 70B models. At 4-bit + expert streaming, production tool calling works. At 2-bit, tool calling breaks.
Data center format comparison:
| Format | Hardware | Accuracy loss | Memory reduction | Throughput gain |
|---|---|---|---|---|
| INT8 W8A8 | A100, older | 1–3% | ~50% | Moderate |
| INT8 SmoothQuant | A100 | ~0.04% | ~50% | 1.3–1.5x |
| FP8 W8A8 | H100, Ada | < 0.5% | ~50% | Up to 1.6x |
| FP8 + TurboQuant KV | H100 | ~0% (KV) | 6x KV cache | Up to 8x attn |
| AWQ 4-bit | A100/H100 | 1.2–5% | ~75% | 10.9x (Marlin) |
The decision tree
Work top to bottom. Stop at the first branch that matches your situation.
START: What's your hardware?
│
├─ Edge / mobile (≤16 GB unified or phone)
│ ├─ Is the model BitNet-native (trained with ternary weights)?
│ │ ├─ Yes → BitNet TQ1_0 via bitnet.cpp or QVAC Fabric
│ │ └─ No → GGUF Q4_K_M (llama.cpp, Ollama)
│ └─ 48 GB Apple Silicon, running MoE?
│ └─ Yes → Flash-MoE 4-bit (NOT 2-bit — breaks tool calling)
│
├─ Consumer GPU (8–24 GB VRAM, NVIDIA)
│ ├─ Do you need CPU-GPU split or portability?
│ │ └─ Yes → GGUF Q4_K_M or Q8_0
│ ├─ Quality is the priority (< 2% accuracy loss)?
│ │ └─ Yes → AWQ 4-bit (Marlin kernel for best throughput)
│ └─ Throughput is the priority and quality loss acceptable?
│ └─ Yes → GPTQ 4-bit (ExLlama v2 kernel)
│
├─ Prosumer GPU (24–80 GB VRAM)
│ ├─ Single A6000/A100 80 GB, serving users?
│ │ ├─ Yes → INT8 via vLLM (SmoothQuant for best accuracy)
│ │ └─ Ada Lovelace (L40S, RTX 4090)?
│ │ └─ Yes → FP8 W8A16 (weight-only, not full W8A8)
│ └─ Multi-GPU cluster?
│ └─ Yes → See data center tier below
│
├─ Data center (A100 / H100)
│ ├─ A100?
│ │ └─ INT8 (SmoothQuant) → Do NOT use FP8 (software emulation, −20% perf)
│ ├─ H100 / H200 / Blackwell?
│ │ └─ FP8 W8A8 via vLLM (--dtype fp8)
│ └─ Long-context serving (32K+ tokens)?
│ └─ + TurboQuant 3-bit KV cache compression (stacks on any weight quant)
│
└─ Cloud API
└─ You don't control quantization. Pick the provider tier:
- Groq / Fireworks: FP8 or INT8 optimized inference
- OpenAI / Anthropic: Proprietary, no direct access
- Together / Replicate: Usually AWQ or GPTQ 4-bit
See /ai-agents/0047-cost-management-for-agents/ for cost per token analysis
Mermaid flowchart
flowchart TD
A[What hardware?] --> B[Edge/mobile ≤16GB]
A --> C[Consumer GPU 8-24GB VRAM]
A --> D[Data center A100/H100]
A --> E[Cloud API]
B --> B1{BitNet-native model?}
B1 -->|Yes| B2[BitNet TQ1_0\nQVAC Fabric]
B1 -->|No| B3[GGUF Q4_K_M\nllama.cpp / Ollama]
B -->|48GB Apple Silicon + MoE| B4[Flash-MoE 4-bit]
C --> C1{Need CPU-GPU split?}
C1 -->|Yes| C2[GGUF Q4_K_M or Q8_0]
C1 -->|No| C3{Quality priority?}
C3 -->|Yes| C4[AWQ 4-bit\nMarlin kernel]
C3 -->|No| C5[GPTQ 4-bit\nExLlama v2]
D --> D1{GPU type?}
D1 -->|A100| D2[INT8 SmoothQuant\nvLLM]
D1 -->|H100/H200| D3[FP8 W8A8\nvLLM --dtype fp8]
D2 --> D4{Long context?}
D3 --> D4
D4 -->|Yes 32K+| D5[+ TurboQuant 3-bit KV\n6x memory reduction]
E --> E1[Provider-managed\nSee cost analysis post]
Quick-reference comparison table
| Format | Bit-width | Quality | CPU | NVIDIA GPU | Apple Metal | Primary use |
|---|---|---|---|---|---|---|
| GGUF Q4_K_M | ~4.5 | 92% | ✓ | ✓ (offload) | ✓ | Edge, consumer, universal |
| GGUF Q8_0 | 8 | ~99% | ✓ | ✓ (offload) | ✓ | Quality-critical, ≥16 GB |
| AWQ 4-bit | 4 | 95% | — | ✓ (Marlin) | — | Consumer/prosumer NVIDIA |
| GPTQ 4-bit | 4 | 90% | — | ✓ (ExLlama) | — | High-throughput NVIDIA |
| BitNet TQ1_0 | 1.58 | Native | ✓ | ✓ | ✓ | BitNet-trained models only |
| INT8 SmoothQuant | 8 | ~99.96% | — | ✓ A100 | — | A100 production serving |
| FP8 W8A8 | 8 | ~99.5% | — | ✓ H100+ | — | H100 production serving |
| TurboQuant KV | 3 (KV) | ~100% | — | ✓ H100 | — | KV cache at scale (stacks) |
| Flash-MoE 4-bit | 4 (MoE) | Production | — | — | ✓ 48GB | Large MoE on Apple Silicon |
FAQ
What is the best quantization format for a 16 GB MacBook?
GGUF Q4_K_M for most models. BitNet (TQ1_0) for models trained natively in 1.58-bit. Flash-MoE with 4-bit for MoE models like Qwen3.5-397B if you have 48 GB.
Does AWQ actually outperform GPTQ?
For accuracy retention, yes — AWQ holds 95% quality vs. GPTQ’s 90% at 4-bit. For raw throughput on NVIDIA GPUs using the Marlin kernel, AWQ reaches 741 tok/s vs. GPTQ’s 712 tok/s. GPTQ is still fine for most use cases.
Can I use FP8 on my A100?
No. A100 does not have native FP8 hardware. Forcing FP8 on A100 via vLLM degrades performance 10–20% due to software emulation. Use INT8 on A100; reserve FP8 for H100, H200, Ada Lovelace (RTX 4090, L40S), and Blackwell GPUs.
What is TurboQuant and do I need it?
TurboQuant (Google, ICLR 2026) compresses KV cache to 3-bit with no accuracy loss, 6x memory reduction, and up to 8x attention speedup on H100. It targets inference serving, not weight compression. If you’re serving 32K+ context requests at scale, it’s worth adding on top of your weight quantization choice.
Is BitNet only for Microsoft’s models?
BitNet b1.58 is a training-time technique — the model must be trained with ternary weights from the start. You can’t post-training-quantize a standard LLM into BitNet. Tether’s QVAC Fabric adds LoRA fine-tuning on-device for adaptation, but the base model still needs to be BitNet-native.
Related posts
- Model serving architecture — how quantization fits into the broader serving stack
- Cost management for agents — cloud API cost analysis by provider and model tier
- Agent deployment patterns — where quantized local models fit in multi-agent architectures
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch