Gemma 4: the three architectural decisions that changed what a small model can do
TL;DR: Gemma 4 (Google DeepMind, April 2026, Apache 2.0) ships in four sizes — 2B, 4B multimodal, 26B MoE, 31B dense — with three architectural decisions that matter beyond benchmark numbers: hybrid attention enabling 256K context at O(n) local cost, native multimodal training via SigLIP end-to-end (not bolted-on), and a 26B MoE variant that activates only 3.8B parameters and achieves ~97% of the 31B dense model’s AIME 2026 score at 12% compute cost. The 31B scores 85.2% MMLU-Pro, 89.2% AIME 2026. Apache 2.0 removes legal friction for production deployment.

Every major model release produces two kinds of coverage: benchmark tables and architectural analysis. The benchmark tables move fast and forget faster. The architectural analysis is where the durable lessons are.
Gemma 4’s benchmarks are strong: #3 open model on LMSYS Arena at launch, 3–4 points behind GPT-4o and Claude 3.5 Sonnet on reasoning tasks. But the benchmarks aren’t why ML systems teams should study this release. Three architectural decisions in Gemma 4 tell you something concrete about the direction Google DeepMind is moving on open-weight models.
Decision 1: Hybrid attention for 256K context without the memory cost
Naive full-attention with 256K context requires computing an attention matrix of size 256K × 256K for each layer. At full float32 precision, that’s 256GB per layer, which is impractical on any single device.
Gemma 4 solves this with hybrid attention: alternating between two attention types within the same model.
Layer 1: Local attention (512-token window)
Layer 2: Local attention (512-token window)
Layer 3: Global attention (full 256K context)
Layer 4: Local attention (512-token window)
Layer 5: Local attention (512-token window)
Layer 6: Global attention (full 256K context)
...
Local attention layers attend only to a sliding window of 512–1,024 tokens. Memory cost: O(window × sequence) rather than O(sequence²). Most of the model’s layers are local.
Global attention layers attend to the full sequence. They’re a minority of layers, but they allow information to flow across the full 256K context when needed. Dual RoPE variants handle positional encoding: one RoPE for local (short-range relative positions), one for global (long-range absolute positions).
The result: 256K context (on the 26B and 31B models; the 2B and 4B support 128K) that fits on consumer hardware, where naive full-attention would require a data center. The 31B model at 8-bit quantization runs on a workstation with 128GB RAM; the 4B model runs on consumer hardware.
The tradeoff is that local layers cannot directly relate tokens that are far apart; that connection must route through the infrequent global layers. For tasks where long-range dependencies are critical (cross-document reasoning, very long-context RAG), performance may lag full-attention models at the same parameter count. For most production tasks, the hybrid performs near-identically.
Decision 2: Native multimodality from training, not adapter layers
The standard approach to adding vision to an existing language model: train a vision encoder separately (usually CLIP or SigLIP), train a projection layer to map image embeddings to text embedding space, freeze the LM and fine-tune the projection. Fast to deploy, but the two spaces are never fully integrated. The model “translates” vision to language rather than thinking in a unified representation.
Gemma 4 trains the SigLIP vision encoder jointly with the text decoder from the start. No pre-trained LM to freeze, no adapter projection to align. The model develops spatial-linguistic representations together.
Three practical consequences:
Variable aspect ratio: fixed-crop encoders (CLIP’s default) lose spatial information when images don’t match the training aspect ratio. Gemma 4’s encoder supports variable aspect ratios with configurable token budgets: 70 tokens per image for efficient processing, up to 1,120 tokens when spatial detail matters. A screenshot gets more tokens than a thumbnail; the model allocates compute proportionally.
Configurable token budget: practitioners can trade off cost and quality. 70 tokens/image for a visual search pipeline processing thousands of images per second; 1,120 tokens/image for document analysis where spatial layout is critical.
Integrated spatial reasoning: text and vision share the same embedding space from training, not a bridged projection. Early reports from HuggingFace and Google’s own evals show stronger spatial understanding (chart reading, diagram interpretation) than models using adapter architectures at the same parameter count.
The 4B multimodal variant (Gemma-4-E4B) is the entry point for vision tasks. At 4B parameters and Apache 2.0 license, it’s deployable on a single consumer GPU, making it the first Google open-weight multimodal model with no legal restrictions.
Decision 3: MoE at 26B with 3.8B active parameters
Mixture-of-Experts has been in frontier models since GPT-4 and Mixtral, but Gemma 4’s 26B MoE is the first Google open-weight model to ship MoE alongside a dense baseline explicitly designed for comparison.
The architecture: 26B total parameters, ~85.4% sparsity (14.6% active), 3.8B active per forward pass.
Token arrives at MoE layer
│
▼
┌────────────────────────┐
│ Routing mechanism │
│ Selects top-K experts │
└────────┬───────────────┘
│
┌────┴────┐
▼ ▼
Expert 1 Expert 7 (K of N experts activated)
│ │
└────┬────┘
│
▼
Weighted sum of expert outputs
At inference time, only the selected experts load into compute; the rest of the 26B parameters sit idle. GPU memory requirement: proportional to total parameters (you need to store all experts). Compute requirement: proportional to active parameters (only active experts process each token).
Benchmark comparison:
| Model | MMLU-Pro | AIME 2026 | LiveCodeBench | Active params | Relative compute |
|---|---|---|---|---|---|
| Gemma-4-31B dense | 85.2% | 89.2% | 80.0% | 31B | 1.0x |
| Gemma-4-26B MoE | 82.6% | 88.3% | 77.1% | 3.8B | ~0.12x |
| GPT-4o (reference) | ~87% | ~91% | ~82% | Unknown | — |
The MoE achieves 82.6% MMLU-Pro, which is 97% of the dense model’s score at 12% of the compute cost per token. On AIME 2026 (the hardest reasoning benchmark in the set), the MoE is within rounding of the dense model.
For production deployment, this matters in one specific scenario: high-throughput inference where the bottleneck is GPU compute (not GPU memory). If you have enough memory to store all 26B parameters but want to process tokens faster, the 3.8B active compute gives you ~8x throughput advantage over the 31B dense at comparable quality. If your bottleneck is memory bandwidth, the advantage shrinks.
The Apache 2.0 decision
Previous Gemma releases used Google’s Gemma Terms of Use, which was permissive for most uses but not OSI-approved, with usage restrictions that created legal review requirements in some enterprise contexts. Gemma 4 ships under Apache 2.0: unambiguous commercial use, modification, and distribution.
The practical effect: enterprises with legal review requirements can deploy Gemma 4 on the same track as other Apache 2.0 software. The multi-week approval cycle that blocked some Gemma 3 deployments disappears. For the model to be competitive with Llama (also Apache 2.0 since Llama 3.3) and Mistral (Apache 2.0), license parity was necessary.
Deployment reality
Day-one support from all major inference stacks:
| Stack | Notes |
|---|---|
| Ollama | ollama pull gemma4:31b, consumer hardware ready |
| vLLM | Production-grade serving, PagedAttention, full batching |
| llama.cpp | GGUF quantized weights, CPU inference option |
| MLX | Apple Silicon native, M-series optimized |
| Vertex AI | Google Cloud managed inference, SLA-backed |
Quantization: 31B in INT4 ≈ 8GB RAM. The 4B model runs in 4GB. NVIDIA is promoting NVFP4 quantization for Blackwell hardware for further memory reduction with maintained quality.
The 31B runs on a workstation with 128GB RAM at full BF16. For cloud deployment, it fits on a single H100 (80GB) at INT8. The MoE’s 26B total parameters require more GPU memory than the dense 31B might suggest, because you need to store all experts, even the inactive ones.
Where Gemma 4 falls short
The #3 LMSYS Arena ranking is impressive for an open model, but the 3–4 point gap behind GPT-4o and Claude 3.5 Sonnet matters for some tasks. On complex multi-step reasoning chains, frontier closed models retain an edge. Gemma 4’s AIME 2026 score of 89.2% is strong — but AIME is a reasoning benchmark where practice effects are significant and frontier models have more extensive training.
Multimodal performance: early evals show competitive spatial understanding, but GPT-4V and Claude 3.5 Sonnet with vision have more extensive fine-tuning data. For production vision tasks, you’d validate on your specific data distribution before committing.
For the KV-cache and memory wall implications of serving MoE models in fleet inference, see KV cache for MoE: the memory wall blocking mixture-of-experts at scale.
Key takeaways
- Gemma 4’s hybrid attention enables 256K context by alternating local 512-token windows (O(n) memory) with sparse global layers — consumer-hardware deployable at context lengths that previously required data center scale.
- Native multimodal training via SigLIP (not adapter layers) enables variable aspect ratio, configurable token budgets (70–1,120 per image), and integrated spatial-linguistic representations.
- The 26B MoE activates 3.8B parameters per token, achieving 82.6% MMLU-Pro (97% of the 31B dense) at ~12% compute cost. The right tool when memory can hold all 26B but compute is the bottleneck.
- The 31B scores 85.2% MMLU-Pro, 89.2% AIME 2026, 80.0% LiveCodeBench — #3 open model at launch, 3–4 points behind frontier closed models.
- Apache 2.0 license eliminates the legal review overhead that blocked some Gemma 3 enterprise deployments.
FAQ
What are the four Gemma 4 model variants? Gemma-4-E2B (2B), Gemma-4-E4B (4B, multimodal), Gemma-4-26B-A4B (26B MoE, 3.8B active), Gemma-4-31B-IT (31B dense). Apache 2.0. All available on HuggingFace and supported by Ollama, vLLM, llama.cpp, MLX, Vertex AI on day one.
What is hybrid attention in Gemma 4? Most layers use local sliding-window attention (512–1,024 tokens, O(n) memory). A minority of layers use global full-context attention. The combination achieves 256K context without O(n²) memory. Dual RoPE handles short-range and long-range positional encoding separately.
How does the 26B MoE achieve ~97% performance at 12% compute? ~85.4% sparsity: only 3.8B of 26B parameters activate per token. The routing mechanism selects top-K expert layers per token; others remain inactive. 82.6% MMLU-Pro vs. 85.2% for the dense 31B — 97% of performance at ~12% compute cost per token.
How is Gemma 4’s multimodal different from bolted-on vision? SigLIP vision encoder trained end-to-end with the text decoder (not post-hoc adapter). Variable aspect ratio support. Configurable token budgets per image (70–1,120). Integrated spatial-linguistic representations rather than cross-modal translation.
What does Apache 2.0 licensing change? Removes legal ambiguity of the prior Gemma Terms of Use. Unambiguous commercial use, modification, distribution. Matches Llama and Mistral license standards, eliminating enterprise legal review overhead.
Further reading
- Gemma 4 technical report — architecture details and benchmark methodology
- HuggingFace Gemma 4 blog post — deployment guide and model card
- KV cache for MoE: the memory wall blocking mixture-of-experts at scale — what MoE inference costs in memory terms
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch