How does the Gemma 4 MoE variant achieve ~97% performance at 12% compute?

Gemma-4-26B-A4B has 26 billion parameters but activates only 3.8 billion per token — ~85.4% parameter sparsity (14.6% active). During a forward pass, the MoE routing mechanism selects a small subset of 'expert' feed-forward layers for each token while leaving others inactive. The result: inference compute scales with active parameters (3.8B), not total parameters (26B). The MoE achieves 82.6% MMLU-Pro versus 85.2% for the 31B dense — ~97% of dense performance at approximately 12% of the compute cost per token.

How is Gemma 4's multimodal architecture different from bolted-on vision?

Many multimodal models add a vision encoder after training the text backbone, requiring an alignment step (bridging encoder space to text embedding space). Gemma 4 uses SigLIP as the vision encoder trained end-to-end with the text decoder from the start, with variable aspect ratio support (no fixed-size crops that lose spatial information) and configurable token budgets per image (70–1,120 tokens). This joint training means the model develops integrated spatial-linguistic representations rather than translating between two separately optimized spaces.

What does Apache 2.0 licensing change for production deployment?

Previous Gemma releases used the Gemma Terms of Use — permissive but not OSI-approved, with usage restrictions that created legal ambiguity for some commercial deployments. Apache 2.0 removes that friction: unambiguous commercial use, modification, and distribution rights with no model-specific clauses. For enterprises with legal review requirements, this is the difference between a multi-week approval cycle and no approval cycle.

Gemma 4: the three architectural decisions that changed what a small model can do

Q: What are the four Gemma 4 model variants?

Gemma 4 ships in four sizes: Gemma-4-E2B (2B), Gemma-4-E4B (4B, multimodal), Gemma-4-26B-A4B (26B MoE, activates 3.8B parameters), and Gemma-4-31B-IT (31B dense). All are available on HuggingFace under Apache 2.0 license and supported on day-one by Ollama, vLLM, llama.cpp, MLX, and Vertex AI. The 31B quantized to INT4 requires approximately 8GB RAM.

Q: What is hybrid attention in Gemma 4?

Gemma 4's hybrid attention alternates between local and global layers. Local layers use a sliding window of 512–1,024 tokens — each token only attends to its immediate neighborhood, at O(n) memory cost. Global layers apply full-context attention — each token can attend to the entire sequence. By using mostly local layers with occasional global layers, the 26B and 31B Gemma 4 models achieve 256K context length without the O(n²) memory cost of naive full-attention (smaller models support 128K). Dual RoPE variants handle positional encoding for the different attention radii.

10 minute read

TL;DR: Gemma 4 (Google DeepMind, April 2026, Apache 2.0) ships in four sizes — 2B, 4B multimodal, 26B MoE, 31B dense — with three architectural decisions that matter beyond benchmark numbers: hybrid attention enabling 256K context at O(n) local cost, native multimodal training via SigLIP end-to-end (not bolted-on), and a 26B MoE variant that activates only 3.8B parameters and achieves ~97% of the 31B dense model’s AIME 2026 score at 12% compute cost. The 31B scores 85.2% MMLU-Pro, 89.2% AIME 2026. Apache 2.0 removes legal friction for production deployment.

$Server cluster with sparse activation pattern, only a fraction of nodes illuminated — representing Gemma 4's MoE sparse computation$

Every major model release produces two kinds of coverage: benchmark tables and architectural analysis. The benchmark tables move fast and forget faster. The architectural analysis is where the durable lessons are.

Gemma 4’s benchmarks are strong: #3 open model on LMSYS Arena at launch, 3–4 points behind GPT-4o and Claude 3.5 Sonnet on reasoning tasks. But the benchmarks aren’t why ML systems teams should study this release. Three architectural decisions in Gemma 4 tell you something concrete about the direction Google DeepMind is moving on open-weight models.

Decision 1: Hybrid attention for 256K context without the memory cost

Naive full-attention with 256K context requires computing an attention matrix of size 256K × 256K for each layer. At full float32 precision, that’s 256GB per layer, which is impractical on any single device.

Gemma 4 solves this with hybrid attention: alternating between two attention types within the same model.

Layer 1: Local attention (512-token window)
Layer 2: Local attention (512-token window)
Layer 3: Global attention (full 256K context)
Layer 4: Local attention (512-token window)
Layer 5: Local attention (512-token window)
Layer 6: Global attention (full 256K context)
...

Local attention layers attend only to a sliding window of 512–1,024 tokens. Memory cost: O(window × sequence) rather than O(sequence²). Most of the model’s layers are local.

Global attention layers attend to the full sequence. They’re a minority of layers, but they allow information to flow across the full 256K context when needed. Dual RoPE variants handle positional encoding: one RoPE for local (short-range relative positions), one for global (long-range absolute positions).

The result: 256K context (on the 26B and 31B models; the 2B and 4B support 128K) that fits on consumer hardware, where naive full-attention would require a data center. The 31B model at 8-bit quantization runs on a workstation with 128GB RAM; the 4B model runs on consumer hardware.

The tradeoff is that local layers cannot directly relate tokens that are far apart; that connection must route through the infrequent global layers. For tasks where long-range dependencies are critical (cross-document reasoning, very long-context RAG), performance may lag full-attention models at the same parameter count. For most production tasks, the hybrid performs near-identically.

Decision 2: Native multimodality from training, not adapter layers

The standard approach to adding vision to an existing language model: train a vision encoder separately (usually CLIP or SigLIP), train a projection layer to map image embeddings to text embedding space, freeze the LM and fine-tune the projection. Fast to deploy, but the two spaces are never fully integrated. The model “translates” vision to language rather than thinking in a unified representation.

Gemma 4 trains the SigLIP vision encoder jointly with the text decoder from the start. No pre-trained LM to freeze, no adapter projection to align. The model develops spatial-linguistic representations together.

Three practical consequences:

Variable aspect ratio: fixed-crop encoders (CLIP’s default) lose spatial information when images don’t match the training aspect ratio. Gemma 4’s encoder supports variable aspect ratios with configurable token budgets: 70 tokens per image for efficient processing, up to 1,120 tokens when spatial detail matters. A screenshot gets more tokens than a thumbnail; the model allocates compute proportionally.

Configurable token budget: practitioners can trade off cost and quality. 70 tokens/image for a visual search pipeline processing thousands of images per second; 1,120 tokens/image for document analysis where spatial layout is critical.

Integrated spatial reasoning: text and vision share the same embedding space from training, not a bridged projection. Early reports from HuggingFace and Google’s own evals show stronger spatial understanding (chart reading, diagram interpretation) than models using adapter architectures at the same parameter count.

The 4B multimodal variant (Gemma-4-E4B) is the entry point for vision tasks. At 4B parameters and Apache 2.0 license, it’s deployable on a single consumer GPU, making it the first Google open-weight multimodal model with no legal restrictions.

Decision 3: MoE at 26B with 3.8B active parameters

Mixture-of-Experts has been in frontier models since GPT-4 and Mixtral, but Gemma 4’s 26B MoE is the first Google open-weight model to ship MoE alongside a dense baseline explicitly designed for comparison.

The architecture: 26B total parameters, ~85.4% sparsity (14.6% active), 3.8B active per forward pass.

Token arrives at MoE layer
         │
         ▼
┌────────────────────────┐
│  Routing mechanism     │
│  Selects top-K experts │
└────────┬───────────────┘
         │
    ┌────┴────┐
    ▼         ▼
Expert 1  Expert 7  (K of N experts activated)
    │         │
    └────┬────┘
         │
         ▼
Weighted sum of expert outputs

At inference time, only the selected experts load into compute; the rest of the 26B parameters sit idle. GPU memory requirement: proportional to total parameters (you need to store all experts). Compute requirement: proportional to active parameters (only active experts process each token).

Benchmark comparison:

Model	MMLU-Pro	AIME 2026	LiveCodeBench	Active params	Relative compute
Gemma-4-31B dense	85.2%	89.2%	80.0%	31B	1.0x
Gemma-4-26B MoE	82.6%	88.3%	77.1%	3.8B	~0.12x
GPT-4o (reference)	~87%	~91%	~82%	Unknown	—

The MoE achieves 82.6% MMLU-Pro, which is 97% of the dense model’s score at 12% of the compute cost per token. On AIME 2026 (the hardest reasoning benchmark in the set), the MoE is within rounding of the dense model.

For production deployment, this matters in one specific scenario: high-throughput inference where the bottleneck is GPU compute (not GPU memory). If you have enough memory to store all 26B parameters but want to process tokens faster, the 3.8B active compute gives you ~8x throughput advantage over the 31B dense at comparable quality. If your bottleneck is memory bandwidth, the advantage shrinks.

The Apache 2.0 decision

Previous Gemma releases used Google’s Gemma Terms of Use, which was permissive for most uses but not OSI-approved, with usage restrictions that created legal review requirements in some enterprise contexts. Gemma 4 ships under Apache 2.0: unambiguous commercial use, modification, and distribution.

The practical effect: enterprises with legal review requirements can deploy Gemma 4 on the same track as other Apache 2.0 software. The multi-week approval cycle that blocked some Gemma 3 deployments disappears. For the model to be competitive with Llama (also Apache 2.0 since Llama 3.3) and Mistral (Apache 2.0), license parity was necessary.

Deployment reality

Day-one support from all major inference stacks:

Stack	Notes
Ollama	`ollama pull gemma4:31b`, consumer hardware ready
vLLM	Production-grade serving, PagedAttention, full batching
llama.cpp	GGUF quantized weights, CPU inference option
MLX	Apple Silicon native, M-series optimized
Vertex AI	Google Cloud managed inference, SLA-backed

Quantization: 31B in INT4 ≈ 8GB RAM. The 4B model runs in 4GB. NVIDIA is promoting NVFP4 quantization for Blackwell hardware for further memory reduction with maintained quality.

The 31B runs on a workstation with 128GB RAM at full BF16. For cloud deployment, it fits on a single H100 (80GB) at INT8. The MoE’s 26B total parameters require more GPU memory than the dense 31B might suggest, because you need to store all experts, even the inactive ones.

Where Gemma 4 falls short

The #3 LMSYS Arena ranking is impressive for an open model, but the 3–4 point gap behind GPT-4o and Claude 3.5 Sonnet matters for some tasks. On complex multi-step reasoning chains, frontier closed models retain an edge. Gemma 4’s AIME 2026 score of 89.2% is strong — but AIME is a reasoning benchmark where practice effects are significant and frontier models have more extensive training.

Multimodal performance: early evals show competitive spatial understanding, but GPT-4V and Claude 3.5 Sonnet with vision have more extensive fine-tuning data. For production vision tasks, you’d validate on your specific data distribution before committing.

For the KV-cache and memory wall implications of serving MoE models in fleet inference, see KV cache for MoE: the memory wall blocking mixture-of-experts at scale.

Key takeaways

Gemma 4’s hybrid attention enables 256K context by alternating local 512-token windows (O(n) memory) with sparse global layers — consumer-hardware deployable at context lengths that previously required data center scale.
Native multimodal training via SigLIP (not adapter layers) enables variable aspect ratio, configurable token budgets (70–1,120 per image), and integrated spatial-linguistic representations.
The 26B MoE activates 3.8B parameters per token, achieving 82.6% MMLU-Pro (97% of the 31B dense) at ~12% compute cost. The right tool when memory can hold all 26B but compute is the bottleneck.
The 31B scores 85.2% MMLU-Pro, 89.2% AIME 2026, 80.0% LiveCodeBench — #3 open model at launch, 3–4 points behind frontier closed models.
Apache 2.0 license eliminates the legal review overhead that blocked some Gemma 3 enterprise deployments.

FAQ

What are the four Gemma 4 model variants? Gemma-4-E2B (2B), Gemma-4-E4B (4B, multimodal), Gemma-4-26B-A4B (26B MoE, 3.8B active), Gemma-4-31B-IT (31B dense). Apache 2.0. All available on HuggingFace and supported by Ollama, vLLM, llama.cpp, MLX, Vertex AI on day one.

What is hybrid attention in Gemma 4? Most layers use local sliding-window attention (512–1,024 tokens, O(n) memory). A minority of layers use global full-context attention. The combination achieves 256K context without O(n²) memory. Dual RoPE handles short-range and long-range positional encoding separately.

How does the 26B MoE achieve ~97% performance at 12% compute? ~85.4% sparsity: only 3.8B of 26B parameters activate per token. The routing mechanism selects top-K expert layers per token; others remain inactive. 82.6% MMLU-Pro vs. 85.2% for the dense 31B — 97% of performance at ~12% compute cost per token.

How is Gemma 4’s multimodal different from bolted-on vision? SigLIP vision encoder trained end-to-end with the text decoder (not post-hoc adapter). Variable aspect ratio support. Configurable token budgets per image (70–1,120). Integrated spatial-linguistic representations rather than cross-modal translation.

What does Apache 2.0 licensing change? Removes legal ambiguity of the prior Gemma Terms of Use. Unambiguous commercial use, modification, distribution. Matches Llama and Mistral license standards, eliminating enterprise legal review overhead.