7 minute read

“Matrix multiplication without multiplication. That’s not a riddle — it’s how ternary weights work.”

TL;DR

BitNet.cpp replaces floating-point matrix multiplication with addition and subtraction using ternary weights ({-1, 0, +1}), cutting energy by 71.4x and memory by 85%. The largest released model is 2B parameters — the 100B-on-a-CPU promise remains theoretical. ParoQuant (ICLR 2026) takes a different path: pairwise rotation at INT4 that beats AWQ by 2.4% on reasoning, with MLX support for Apple Silicon. The practical answer: 4-bit remains the production sweet spot today, ternary weights are coming for edge and batch workloads.

A kitchen scale weighing a single feather against a stack of heavy books

What changes when weights have only three values?

Standard LLM inference is dominated by matrix multiplication. Every token generated requires multiplying activation vectors against weight matrices with billions of floating-point entries. This is what GPUs are built for — and why inference is expensive.

BitNet b1.58 restricts every weight to one of three values: -1, 0, or +1. The “b1.58” refers to the information content: log₂(3) ≈ 1.58 bits per weight. This is not standard quantization — it is a fundamentally different compute paradigm.

When a weight is -1, you subtract the activation. When it is +1, you add it. When it is 0, you skip it entirely. No floating-point multiplication occurs at any point. The entire forward pass reduces to additions and subtractions.

The quantization method is absmean: divide weights by their mean absolute value, round to {-1, 0, +1}, and pair with 8-bit activations. Microsoft’s 2024 paper showed this matches FP16 perplexity and task performance at 3B+ parameter scale — comparable to 4-bit quantization methods while being architecturally simpler.

graph LR
    A[Standard LLM<br/>FP16 weights] -->|Matrix multiply| B[GPU required<br/>140 GB for 100B]
    C[BitNet b1.58<br/>Ternary weights] -->|Add/subtract only| D[CPU viable<br/>~20 GB for 100B]
    A -->|Energy| E[1x baseline]
    C -->|Energy| F[71.4x reduction<br/>on 7nm chips]

The energy numbers come from the arithmetic itself. A 16-bit floating-point multiply-accumulate uses roughly 50x more energy than an integer addition on the same process node. Microsoft measured 71.4x total energy reduction on 7nm, factoring in the memory access patterns that dominate real-world inference.

BitNet.cpp (github.com/microsoft/BitNet, 28,000+ stars) is the inference runtime. It compiles ternary models to optimized CPU kernels. The repo reports 1.37-5.07x speedup on ARM and 2.37-6.17x on x86 compared to baseline implementations. Practical throughput depends heavily on model size and hardware — early benchmarks suggest usable interactive speeds on Apple Silicon and batch-only speeds on older Intel hardware.

Where is the 100B model everyone talks about?

It does not exist yet. The largest released BitNet model is 2B parameters.

The math works: a 100B ternary model would need roughly 20 GB of memory (1.58 bits × 100 billion ÷ 8 bits per byte ≈ 19.75 GB). That fits in the RAM of a MacBook Pro. A 100B FP16 model needs 140 GB — requiring a cluster of A100s.

But training a 100B model with ternary weights from scratch is expensive. BitNet is not a post-training quantization method. You cannot take Llama-3 70B and make it ternary. The model must be trained with the ternary constraint from the first gradient step. Microsoft has not invested that compute publicly.

The 2B model demonstrates that the architecture works. It proves the energy and memory claims. It does not prove that quality scales to 100B, or that the near-FP16 quality observed at 2B holds at larger sizes. Those are open questions.

If you need to run large models on consumer hardware today, standard 4-bit quantization (GGUF via llama.cpp, or GPTQ) applied to existing 70B models is the practical choice. BitNet’s advantage materializes when ternary-native models at 70B+ eventually ship.

What does ParoQuant do differently?

ParoQuant (ICLR 2026, github.com/z-lab/paroquant) solves a different problem. Instead of training new models with extreme constraints, it takes existing FP16 models and quantizes them to INT4 using pairwise Givens rotations that suppress outlier values before rounding.

Standard quantization struggles with outliers. A few weights with unusually large magnitudes dominate the quantization grid, forcing all other weights into a narrow range with poor precision. AWQ (Activation-aware Weight Quantization) and GPTQ both attempt to preserve these outliers through scaling tricks. ParoQuant eliminates them.

Givens rotations are orthogonal transforms — they redistribute weight magnitudes across pairs of dimensions without losing information. Apply enough pairwise rotations and the weight distribution smooths out. The outliers disappear. The resulting tensor quantizes cleanly to INT4 with minimal quality loss.

The reasoning-preservation angle is the differentiator. ParoQuant outperforms AWQ by 2.4% specifically on reasoning benchmarks, because the weight distributions that reasoning depends on are exactly the ones that outliers distort most. Math, logic, and multi-step inference rely on precise weight interactions that standard quantization corrupts first.

Deployment support is broad: NVIDIA GPUs via vLLM and Hugging Face Transformers, Apple Silicon via MLX with 2x speedup on M-series chips. Pre-quantized Qwen3.5 models are available. Unlike BitNet, you can apply ParoQuant to any existing model without retraining.

What degrades first when you quantize?

Not all capabilities are equally robust to precision reduction.

Precision Perplexity (vs FP16) Math/Reasoning impact Language understanding
FP16 (baseline) 5.09 Baseline Baseline
4-bit (GPTQ/AWQ) ~6.7 (+32%) -10-15% accuracy -3-5% accuracy
2-bit (AQLM/QuIP#) ~6.5-7.5 (+28-47%) -20-30% accuracy -8-12% accuracy
1-bit (OneBit) ~9.18 (+80%) -30-50% accuracy -15-20% accuracy

The pattern: math and reasoning degrade 2-3x faster than language understanding. A model that reads text fine at 4-bit might fail at arithmetic. A model that summarizes well might struggle with multi-step logic. This asymmetry explains why ParoQuant’s reasoning-aware rotations matter — they protect the weight interactions that reasoning depends on.

Emerging research pushes further. LittleBit achieves usable results at 0.55 bits per weight — below one bit — through learned sub-bit representations. AQLM operates at 2-3 bits and is Pareto-optimal on the compression-quality frontier. QuIP# specializes at 2 bits with 40% throughput improvement over SqueezeLLM.

When does each approach make sense?

The decision depends on what you are optimizing for.

Constraint Best choice Why
Production serving, quality matters 4-bit GPTQ/AWQ Best quality-speed tradeoff today
Apple Silicon, reasoning tasks ParoQuant INT4 + MLX 2x speedup, reasoning-aware
Edge/IoT, no GPU available BitNet (when models ship) CPU-only, minimal memory
Cost-sensitive batch inference BitNet on commodity CPUs $0.20-0.40/Mtok vs $2-4 on GPU
Maximum quality, cost no object FP16/BF16 on GPU No quality loss
Research, pushing boundaries AQLM 2-bit or LittleBit 0.55-bit Pareto frontier exploration

The economics tell the story. GPU inference costs $2-4 per million tokens. BitNet CPU inference costs $0.20-0.40 per million tokens — a 90% reduction. The hardware cost gap is wider: $500 for a consumer CPU versus $40,000+ for a GPU cluster capable of serving a 100B model.

The trade-off: latency. CPU inference on BitNet runs at 50-200ms per token versus 10-50ms on GPU. For interactive applications, that gap matters. For batch processing, document analysis, and offline workloads, it does not.

Key takeaways

  • Ternary weights eliminate multiplication. BitNet’s {-1, 0, +1} scheme reduces inference to addition and subtraction, cutting energy 71.4x. The architecture is proven at 2B; 100B remains theoretical.
  • ParoQuant preserves reasoning at INT4. Pairwise Givens rotations smooth outliers before quantization, beating AWQ by 2.4% on reasoning benchmarks. Works with existing models, no retraining.
  • Math degrades first. Quantization hits reasoning 2-3x harder than language understanding. Choose your quantization method based on your task profile.
  • 4-bit is still the sweet spot. For production serving today, GPTQ/AWQ at 4-bit gives the best quality-speed ratio. BitNet’s economics become compelling when larger ternary models ship.
  • CPU inference is a different product. Not a cheaper GPU — a different deployment model. Offline batch processing, edge devices, and privacy-sensitive workloads that cannot use cloud GPUs.

Further reading

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch