4 minute read

“Four models, four deployments, four scaling policies, four monitoring dashboards. Or: one model with a dial.”

TL;DR

Mistral Small 4 (March 16, 2026) merges four Mistral products into one 119B MoE model with 6B active params. The reasoning_effort parameter replaces model routing with a dial. 256K context, Apache 2.0, vLLM-compatible. The ML systems story: consolidation from many specialized deployments to one.

A single server module with three distinct cable types converging from different directions — fiber optic, standard ethernet, and coaxial cables — ...

What did Mistral consolidate?

Three flagship models, each previously requiring its own deployment:

Previous model Capability Now in Small 4
Magistral Mathematical and logical reasoning reasoning_effort: high
Pixtral Vision and multimodal understanding Image input support
Devstral Agentic coding, tool use Code + function calling

Each had its own deployment target, its own GPU allocation, its own scaling policy, and its own monitoring. An organization using all three ran three separate inference services. Mistral Small 4 is the successor to Mistral Small that absorbs all three capabilities.

Mistral Small 4 is one model that handles all four capabilities through the MoE routing layer. Different tasks activate different expert combinations — reasoning tasks engage reasoning-specialized experts, vision tasks engage vision-specialized experts — but the model binary, the serving infrastructure, and the API endpoint are shared.

What does reasoning_effort change about serving?

The conventional approach to serving models of varying capability is model routing: a classifier examines incoming requests and routes simple ones to a cheap model and complex ones to an expensive model. This works but requires maintaining multiple deployments and building the router itself (which must be fast, accurate, and maintained).

Mistral’s reasoning_effort parameter flips this. Instead of choosing which model handles a request, you choose how much reasoning a single model applies.

  • reasoning_effort: none — fast responses for extraction, formatting, simple classification. Equivalent to Mistral Small 3.2 behavior. The model generates directly with minimal deliberation.
  • reasoning_effort: high — extended reasoning for math, multi-step logic, and complex analysis. The model spends more compute per token on internal planning.

Two positions, not a gradient. You either want reasoning or you do not. This simplicity is a feature — no tuning a continuous dial, no wondering whether “medium” is right.

The infrastructure implication: one deployment serves all tiers. Scaling is one autoscaling policy, not three. Monitoring is one dashboard. Model updates are one rollout. The operational surface area shrinks proportionally to the number of consolidated models.

What are the MoE economics?

119B total parameters sounds large. 6B active parameters per token is small — comparable to Llama-3.1-8B in compute cost but with access to 119B parameters of learned knowledge through the expert routing layer.

The MoE trade-off is always the same: memory for quality. You need enough GPU memory to hold the full 119B parameter set (even though only 6B activate per token), because any expert might be needed for the next token. At FP16, the full model requires roughly 238 GB — distributed across 2-3 GPUs. At 4-bit quantization, roughly 60 GB — potentially one GPU.

The compute cost per token, however, scales with the active parameters (6B), not the total. This is why MoE models offer better quality-per-FLOP than dense models: you get the pattern recognition capacity of 119B parameters at the inference cost of 6B.

For the KV cache implications of serving MoE models, see KV cache for MoE — the memory savings from MoE’s sparse compute do not extend to the attention cache, which remains dense.

When does consolidation make sense?

Consolidate when:

  • You run 3+ model deployments for different capabilities
  • Operational complexity (monitoring, scaling, updates) is a bigger cost than inference
  • No single specialized model is dramatically better than Small 4 on your benchmarks
  • You want a single API contract for clients, regardless of task type

Keep separate when:

  • One task dominates (90%+ of traffic) and a specialized model significantly outperforms
  • Latency requirements differ dramatically between tasks (vision at 2s is fine, chat at 200ms is required)
  • You need different scaling policies for different workloads (batch vision processing vs real-time chat)

The 256K context window and Apache 2.0 license remove two common blockers. The context handles long-document tasks. The license allows unrestricted commercial deployment and fine-tuning.

Key takeaways

  • Three flagships, one successor. Reasoning, vision, coding — all in a single 119B MoE deployment with 6B active params.
  • reasoning_effort replaces model routing. One API parameter instead of a routing classifier and multiple deployments.
  • Operational surface area shrinks. One pipeline, one monitor, one scaling policy.
  • MoE trade-off applies. 238 GB memory for 6B of compute per token. GPU memory is the binding constraint.
  • Evaluate before migrating. Consolidated models trade peak specialization for breadth. Benchmark on your specific workloads.

Further reading

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch