8 minute read

A large crystal being compressed between hydraulic press plates into a smaller glowing gem, representing extreme model compression

TL;DR — NanoQuant (arXiv 2602.06694) compresses a 70B model from 138GB to 5.35GB — 26x reduction — while staying competitive on language modeling benchmarks. This is the first effective sub-1-bit post-training quantization framework. A 70B model on a single 8GB consumer GPU is now feasible. This extends the quantization decision tree with a new extreme compression branch.


Below 1 bit per weight: what that actually means

Every quantization method you have used so far operates above 1 bit per weight. INT8 uses 8 bits. INT4 uses 4 bits. GPTQ and AWQ typically compress to 3-4 bits. Binary quantization — the previous floor — uses 1 bit. Below that, information theory suggests you cannot meaningfully represent individual weights.

NanoQuant (arXiv 2602.06694, February 2026) breaks through this floor by not representing individual weights. Instead, it groups weights and represents each group with a shared codebook entry. When the codebook is small enough relative to the group size, the effective bits-per-weight drops below 1.

The result: a 70B parameter model that occupies 138GB in FP16 compresses to 5.35GB. That is 26x compression. A model that required a multi-GPU server now fits on a laptop with a single 8GB consumer GPU.

How product quantization enables sub-1-bit precision

The technique is product quantization (PQ), borrowed from approximate nearest neighbor search in computer vision. The principle: instead of quantizing each weight independently, group weights into vectors and map each vector to the nearest entry in a learned codebook.

Quantization method Bits per weight 70B model size Compression ratio
FP16 (baseline) 16 138 GB 1x
INT8 8 69 GB 2x
INT4 (GPTQ/AWQ) 4 34.5 GB 4x
2-bit (extreme) 2 17.3 GB 8x
Binary 1 8.6 GB 16x
NanoQuant (sub-1-bit) ~0.6 5.35 GB 26x

At sub-1-bit precision, each weight is not stored individually. Instead, a group of (say) 8 weights is represented by a single codebook index. If the codebook has 256 entries, each index uses 8 bits to represent 8 weights — 1 bit per weight. If the codebook has 16 entries, each index uses 4 bits for 8 weights — 0.5 bits per weight.

NanoQuant optimizes the codebook entries through a combination of k-means clustering on the weight distributions and fine-tuning the codebook to minimize reconstruction error on calibration data. The result is a codebook where each entry represents a common weight pattern in the model, and most weight groups map closely to one of these patterns.

The residual compression trick

Product quantization alone introduces noticeable accuracy loss at sub-1-bit rates. The distance between a weight group and its nearest codebook entry — the quantization residual — accumulates across billions of parameters.

NanoQuant adds a second stage: residual compression. After the first-round PQ, it computes the residual (original weights minus reconstructed weights) and applies a second, smaller codebook to compress the residuals. This two-stage approach recovers most of the accuracy lost in the initial quantization.

graph LR
    A[Original weights - FP16] --> B[Product quantization - first codebook]
    B --> C[Reconstructed weights - coarse]
    A --> D[Compute residual: original minus coarse]
    D --> E[Residual compression - second codebook]
    C --> F[Final: coarse + compressed residual]
    
    style A fill:#1976d2,color:#fff
    style F fill:#4caf50,color:#fff

The final compressed representation is the first codebook index plus the residual codebook index. Both are compact integer values. The decompression at inference time is a pair of codebook lookups followed by addition — fast enough that it does not dominate the inference latency, which remains memory-bandwidth-bound on consumer hardware.

What the accuracy tradeoff actually looks like

The 26x compression number is meaningless if the model produces garbage. NanoQuant reports competitive perplexity on standard language modeling benchmarks, but “competitive” requires unpacking.

General language tasks: Performance is close to INT4 quantization — the compression comes at a cost comparable to going from 4-bit to sub-1-bit, which is smaller than the cost of going from FP16 to 4-bit. For tasks like summarization, classification, and general Q&A, the degradation is modest.

Mathematical reasoning: More sensitive to quantization. The precision loss in weight representation affects multi-step numerical reasoning, where small errors compound. Expect measurable degradation on math benchmarks compared to INT4.

Code generation: Moderate sensitivity. Code has stricter correctness requirements — a slightly wrong weight can produce syntactically valid but semantically wrong code. Performance is between general tasks (modest impact) and math (significant impact).

Long-context tasks: At sub-1-bit precision, the model’s ability to attend to information across long contexts degrades faster than at higher precisions. For short-to-medium contexts (under 4K tokens), this is negligible. For 16K+ contexts, consider higher-precision quantization.

The honest recommendation: evaluate NanoQuant on your specific task distribution before deploying. The 26x compression is compelling enough to justify the evaluation cost, but the accuracy profile is task-dependent in ways that aggregate benchmarks do not capture.

Where NanoQuant fits in the quantization decision tree

The existing quantization decision tree covers the standard range: FP16 for quality-critical workloads, INT8 for balanced serving, INT4/GPTQ for memory-constrained deployment, and 1-bit approaches for extreme compression.

NanoQuant adds a new branch below 1-bit:

Use NanoQuant when:

  • Hardware constraint is absolute: you must fit a 70B model on a single consumer GPU (8GB VRAM)
  • The task is general language (summarization, classification, Q&A) where moderate accuracy loss is acceptable
  • Latency is less important than being able to run the model at all
  • Cost of renting GPU servers exceeds the value of the deployment

Do not use NanoQuant when:

  • The task requires high mathematical precision or code correctness
  • Long-context processing (16K+ tokens) is required
  • You have access to INT4-capable hardware with sufficient VRAM — INT4 gives better accuracy at 4x less compression
  • The deployment is production-facing with SLAs on output quality

The practical niche: researchers, hobbyists, and edge deployments where running a 70B model locally at reduced quality is better than not running it at all, and where API access is either too expensive or unavailable.

Key takeaways

  • NanoQuant (arXiv 2602.06694) achieves sub-1-bit quantization through product quantization with shared codebooks plus residual compression, shrinking a 70B model from 138GB to 5.35GB
  • The 26x compression ratio enables running 70B models on single 8GB consumer GPUs for the first time
  • Accuracy tradeoff is task-dependent: general language tasks see modest degradation, while mathematical reasoning and code generation show more significant loss
  • The technique works because weights are grouped and mapped to codebook entries rather than quantized individually — enabling effective bit rates below 1 bit per parameter
  • NanoQuant fills a specific niche in the quantization decision tree: absolute hardware constraints where running the model at reduced quality beats not running it
  • For production workloads with quality SLAs, INT4 quantization remains the better tradeoff; NanoQuant is for scenarios where INT4’s memory footprint (34.5GB for 70B) still exceeds available hardware

FAQ

What is sub-1-bit quantization? Representing model weights at less than 1 bit per weight on average. NanoQuant achieves this by grouping weights and mapping each group to a shared codebook entry. When the codebook is compact relative to the group size, the effective bits-per-weight drops below 1.

Can you run a 70B model on an 8GB GPU? With NanoQuant, the compressed model (5.35GB) fits with room for KV cache and activations at short contexts. For contexts under 4K tokens, a single 8GB consumer GPU can serve the model. Longer contexts require additional memory for KV cache.

Is the accuracy loss acceptable for production use? Depends on the task. General language tasks (summarization, classification, Q&A) see modest degradation. Mathematical reasoning and code generation show more significant loss. Evaluate on your specific workload before deploying.

How does NanoQuant compare to GPTQ and AWQ? GPTQ and AWQ typically compress to 3-4 bits per weight. NanoQuant compresses to approximately 0.6 bits per weight — roughly 5-6x more compression. The tradeoff is more accuracy loss, especially on precision-sensitive tasks. GPTQ/AWQ are the better choice when their memory footprint fits your hardware.

Further reading

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch