What is Mistral Small 4 and what does it replace?

Mistral Small 4 is a 119B parameter Mixture-of-Experts model with 6B active parameters per token. It consolidates four previously separate Mistral products into one: Mistral Small (instruction following), Magistral (reasoning), Pixtral (multimodal vision), and Devstral (agentic coding). Organizations that previously needed four separate model deployments can now serve one model that handles all four capabilities.

What is the reasoning_effort parameter?

Instead of routing requests to different models based on task complexity (a cheap model for simple tasks, an expensive model for hard ones), reasoning_effort lets you dial the amount of reasoning the same model applies to a request. Low effort for formatting and extraction. High effort for math and multi-step logic. This replaces infrastructure-level model routing with an API-level parameter, simplifying the serving stack.

How does the MoE architecture make this economical?

With 119B total parameters but only 6B active per token, Mistral Small 4 has the quality of a large model with the compute cost of a small one. The MoE routing layer selects which expert blocks activate for each token. Instruction following, reasoning, vision, and coding tasks activate different expert combinations, giving each task access to specialized capacity without running the full 119B parameter set.

Should I migrate from separate model deployments to Mistral Small 4?

If you are running three or more model deployments for different capabilities (reasoning, vision, coding, general instruction), consolidating to one model reduces operational surface area significantly: one deployment pipeline, one monitoring stack, one scaling policy. The trade-off: a single consolidated model may not beat the best specialized model on any individual task. Evaluate on your specific workloads before committing.

Mistral Small 4: one model, three jobs, and what ‘reasoning effort’ means for serving

4 minute read

“Four models, four deployments, four scaling policies, four monitoring dashboards. Or: one model with a dial.”

TL;DR

Mistral Small 4 (March 16, 2026) merges four Mistral products into one 119B MoE model with 6B active params. The reasoning_effort parameter replaces model routing with a dial. 256K context, Apache 2.0, vLLM-compatible. The ML systems story: consolidation from many specialized deployments to one.

A single server module with three distinct cable types converging from different directions — fiber optic, standard ethernet, and coaxial cables — ...

What did Mistral consolidate?

Three flagship models, each previously requiring its own deployment:

Previous model	Capability	Now in Small 4
Magistral	Mathematical and logical reasoning	`reasoning_effort: high`
Pixtral	Vision and multimodal understanding	Image input support
Devstral	Agentic coding, tool use	Code + function calling

Each had its own deployment target, its own GPU allocation, its own scaling policy, and its own monitoring. An organization using all three ran three separate inference services. Mistral Small 4 is the successor to Mistral Small that absorbs all three capabilities.

Mistral Small 4 is one model that handles all four capabilities through the MoE routing layer. Different tasks activate different expert combinations — reasoning tasks engage reasoning-specialized experts, vision tasks engage vision-specialized experts — but the model binary, the serving infrastructure, and the API endpoint are shared.

What does reasoning_effort change about serving?

The conventional approach to serving models of varying capability is model routing: a classifier examines incoming requests and routes simple ones to a cheap model and complex ones to an expensive model. This works but requires maintaining multiple deployments and building the router itself (which must be fast, accurate, and maintained).

Mistral’s reasoning_effort parameter flips this. Instead of choosing which model handles a request, you choose how much reasoning a single model applies.

reasoning_effort: none — fast responses for extraction, formatting, simple classification. Equivalent to Mistral Small 3.2 behavior. The model generates directly with minimal deliberation.
reasoning_effort: high — extended reasoning for math, multi-step logic, and complex analysis. The model spends more compute per token on internal planning.

Two positions, not a gradient. You either want reasoning or you do not. This simplicity is a feature — no tuning a continuous dial, no wondering whether “medium” is right.

The infrastructure implication: one deployment serves all tiers. Scaling is one autoscaling policy, not three. Monitoring is one dashboard. Model updates are one rollout. The operational surface area shrinks proportionally to the number of consolidated models.

What are the MoE economics?

119B total parameters sounds large. 6B active parameters per token is small — comparable to Llama-3.1-8B in compute cost but with access to 119B parameters of learned knowledge through the expert routing layer.

The MoE trade-off is always the same: memory for quality. You need enough GPU memory to hold the full 119B parameter set (even though only 6B activate per token), because any expert might be needed for the next token. At FP16, the full model requires roughly 238 GB — distributed across 2-3 GPUs. At 4-bit quantization, roughly 60 GB — potentially one GPU.

The compute cost per token, however, scales with the active parameters (6B), not the total. This is why MoE models offer better quality-per-FLOP than dense models: you get the pattern recognition capacity of 119B parameters at the inference cost of 6B.

For the KV cache implications of serving MoE models, see KV cache for MoE — the memory savings from MoE’s sparse compute do not extend to the attention cache, which remains dense.

When does consolidation make sense?

Consolidate when:

You run 3+ model deployments for different capabilities
Operational complexity (monitoring, scaling, updates) is a bigger cost than inference
No single specialized model is dramatically better than Small 4 on your benchmarks
You want a single API contract for clients, regardless of task type

Keep separate when:

One task dominates (90%+ of traffic) and a specialized model significantly outperforms
Latency requirements differ dramatically between tasks (vision at 2s is fine, chat at 200ms is required)
You need different scaling policies for different workloads (batch vision processing vs real-time chat)

The 256K context window and Apache 2.0 license remove two common blockers. The context handles long-document tasks. The license allows unrestricted commercial deployment and fine-tuning.

Key takeaways

Three flagships, one successor. Reasoning, vision, coding — all in a single 119B MoE deployment with 6B active params.
reasoning_effort replaces model routing. One API parameter instead of a routing classifier and multiple deployments.
Operational surface area shrinks. One pipeline, one monitor, one scaling policy.
MoE trade-off applies. 238 GB memory for 6B of compute per token. GPU memory is the binding constraint.
Evaluate before migrating. Consolidated models trade peak specialization for breadth. Benchmark on your specific workloads.