ReasonFlux: why reasoning templates beat token-by-token chain-of-thought
TL;DR: ReasonFlux (arXiv:2502.06772, Princeton/PKU, ICML 2025) introduces thought templates: compact, metadata-rich reasoning strategies agents select and compose rather than generating reasoning token-by-token. A library of ~500 templates extracted by Gemini-2.0 from math competition data provides structured exploration of the reasoning space. Hierarchical RL optimizes template sequences, not individual tokens. On MATH: 91.2%, beating o1-preview by 6.7pp. On AIME 2024: 56.7% vs. 44.6% for o1-preview. Pre-trained models on HuggingFace; no training required.

Chain-of-thought reasoning generates thinking as a sequence of tokens. The model produces reasoning the same way it produces text: one token at a time, without any structural commitment to what kind of reasoning it’s doing. The result is verbose, inconsistent, and expensive: a model might “think” about the same type of problem in completely different ways on different runs, with no mechanism for learning from successful strategies.
ReasonFlux asks a different question: what if the agent picked reasoning strategies from a library rather than improvising them from tokens?
The template library
ReasonFlux (Ling Yang, Zhaochen Yu, Bin Cui, Mengdi Wang, Princeton/PKU, ICML 2025, arXiv:2502.06772) builds a library of ~500 thought templates extracted from mathematical problem-solving data.
Each template is not a worked example. It’s a structured reasoning pattern:
Template: "Algebraic Substitution for Irrational Functions"
Tags: [algebra, substitution, optimization, calculus]
Description: Transform irrational expressions by variable substitution to reduce to polynomial form
Scope: Problems involving sqrt, fractional powers, or transcendental functions in optimization
Application steps:
1. Identify the irrational expression as the target for substitution
2. Select substitution variable (t = sqrt(f(x)) or t = x^(1/n))
3. Express original variables in terms of t
4. Solve the transformed problem in t-space
5. Back-substitute to recover x
Examples: [worked example 1], [worked example 2]
The library of ~500 templates was curated by Gemini-2.0 analyzing 7.5K MATH problems and 2K Chinese math competition problems. Gemini-2.0 didn’t generate templates from scratch. It extracted the underlying reasoning patterns from existing solutions, abstracting them into reusable strategies.
The metadata is the key. Tags enable efficient retrieval: given a problem, the system embeds the problem and retrieves the most relevant templates without scanning all 500. Application steps enable instantiation: the abstract template gets filled with the specific problem’s variables and constraints.
How reasoning happens
At inference time, a ReasonFlux agent doesn’t generate a reasoning chain from scratch. It follows a three-phase process:
graph TD
P[Problem] --> R[Retrieve relevant templates\nembedding similarity]
R --> S[Select template sequence\nRL-optimized policy]
S --> I[Instantiate templates\nadapt to specific problem]
I --> T[Execute template sequence\nstep-by-step reasoning]
T --> A[Answer]
style R fill:#e8f4f8,stroke:#4a9eca
style S fill:#fff3cd,stroke:#e0a800
style I fill:#d4edda,stroke:#28a745
Retrieval: the problem embedding is compared against template metadata embeddings, returning the top-k most relevant candidates. Dense retrieval against 500 templates is fast even at inference.
Selection: the RL-trained policy picks which templates to use and in which order, given the problem and the retrieved candidates. The policy generalizes from training problems to new ones by learning which sequence types work for which problem structure.
Instantiation and execution: the selected templates are filled with the problem’s specific variables, constraints, and unknowns. Those instantiated templates become the structured reasoning chain the model follows.
Why hierarchical RL on template sequences matters
Standard RLHF or RLAIF trains on individual reasoning steps: score this step, reinforce good steps. The policy optimizes locally but has no mechanism for learning that this kind of step works best when preceded by that kind of step.
ReasonFlux’s hierarchical RL trains on template trajectories: complete sequences of templates applied to a problem. The reward signal is the final answer’s correctness. The policy learns which template sequences lead to correct answers on which problem types, a genuinely long-horizon learning signal.
This is related to the Buffer of Thoughts approach (NeurIPS 2024 Spotlight, arXiv:2406.04271), which also proposes thought templates and a meta-buffer for storage. ReasonFlux extends it with hierarchical RL on template sequences rather than static template selection, and demonstrates significantly stronger benchmark performance.
The numbers
| Benchmark | ReasonFlux-32B | o1-preview | DeepSeek-V3 | Notes |
|---|---|---|---|---|
| MATH | 91.2% | 84.5% | ~89% | +6.7pp vs. o1-preview |
| AIME 2024 | 56.7% | 44.6% | 39.2% | +12pp vs. o1, +17pp vs. DeepSeek |
The AIME numbers are significant. AIME (American Invitational Mathematics Examination) tests multi-step proof and problem-solving; most LLMs score in the 10–20% range on it. ReasonFlux-32B at 56.7% clears a threshold that was considered out of reach without test-time compute scaling. The 12pp gap over o1-preview (44.6%) and 17pp over DeepSeek-V3 (39.2%) are meaningful on a benchmark this hard.
MATH is a broader benchmark and the 6.7pp margin over o1-preview is meaningful but less dramatic; MATH is approaching saturation for top-tier models.
Comparing to ReAct, ToT, and CoT
| Approach | Mechanism | Key limitation |
|---|---|---|
| Chain-of-Thought | Token-by-token reasoning, unstructured | No strategy reuse, high verbosity |
| ReAct | Thought → Action → Observation loop | Structures tool use, not reasoning itself |
| Tree-of-Thought | BFS/DFS over multiple paths | No persistent learning; re-derives strategies per problem |
| Buffer of Thoughts | Templates + meta-buffer | Static selection without RL-optimized sequencing |
| ReasonFlux | Template library + hierarchical RL on sequences | Requires template curation; math-domain primary validation |
ReasonFlux’s limitation is important: the template library is domain-specific. The current library of ~500 templates covers mathematical reasoning, extracted from competition math data. Extending to software engineering reasoning, scientific reasoning, or legal reasoning requires domain-specific template curation, which is non-trivial. The MarkTechPost review notes this: “the framework’s performance in domains beyond mathematical reasoning requires further validation.”
For practitioners building tool-using agents where the reasoning structure is sequential steps applied to retrieved context, the template approach is more naturally applicable. For open-ended creative or strategy tasks, the template library’s coverage limitations matter more.
How to use it today
No training required for standard use. Three paths:
Pre-trained models (recommended for most):
Gen-Verse/ReasonFlux-F1-32B— SOTA math reasoningGen-Verse/ReasonFlux-F1-7B— efficient 7B variantGen-Verse/ReasonFlux-Coder-7B/4B— code reasoning focus
Process Reward Model variants:
ReasonFlux-PRM-1.5B— edge deploymentReasonFlux-PRM-7B— higher quality, guides inference-time scaling- PRMs can be used with any compatible base model to guide beam search without fine-tuning the reasoner
Fine-tuning on your domain:
- Start from a ReasonFlux checkpoint
Gen-Verse/ReasonFlux-SFT-15kdataset on HuggingFace for supervised template training- Domain-specific template curation is the bottleneck; template extraction quality determines the ceiling
For production agents that currently use ReAct or chain-of-thought for multi-step reasoning tasks in structured domains (finance, code, math), swapping to a ReasonFlux-F1-7B backbone is a low-effort experiment. See tool calling fundamentals for the integration patterns and tool design principles for how structured reasoning interacts with tool dispatch.
Key takeaways
- ReasonFlux replaces token-by-token reasoning with template selection and composition. ~500 templates cover math reasoning strategies; RL learns optimal sequences.
- MATH: 91.2% (+6.7pp vs. o1-preview). AIME 2024: 56.7% vs. 44.6% for o1-preview (+12pp) and 39.2% for DeepSeek-V3 (+17pp).
- Hierarchical RL trains on complete template trajectories, not individual steps. That’s a long-horizon learning signal that improves cross-problem generalization.
- Pre-trained models (32B, 7B, Coder variants) available on HuggingFace under Gen-Verse. No training required for standard deployment.
- The primary limitation: template library coverage is math-domain. Extension to other reasoning domains requires domain-specific template curation.
FAQ
What is ReasonFlux and what problem does it solve? ReasonFlux (arXiv:2502.06772, ICML 2025) replaces token-by-token chain-of-thought with structured thought templates: compact, metadata-rich reasoning strategies agents select and compose. Hierarchical RL learns optimal template sequences rather than individual token generation. Available as pre-trained models on HuggingFace.
What are thought templates? Structured reasoning patterns with metadata: name, domain tags, application steps, scope conditions, worked examples. An “Algebraic Substitution” template specifies exactly how to transform irrational expressions: abstract enough to apply across problems, concrete enough to instantiate for a specific one.
How does it compare to chain-of-thought and ReAct? CoT generates reasoning as unstructured tokens with no strategy reuse and high verbosity. ReAct structures tool use, not reasoning. Tree-of-Thought branches without persistent learning. ReasonFlux encodes domain strategies into templates and uses RL to learn which sequences work, combining structure with long-horizon learning.
What are the benchmark results? MATH: 91.2% (+6.7pp vs. o1-preview). AIME 2024: 56.7% vs. 44.6% for o1-preview and 39.2% for DeepSeek-V3. ICML 2025.
How to use it without training? Download pre-trained checkpoints: ReasonFlux-32B-Reasoner (SOTA math), ReasonFlux-F1-7B (efficient), ReasonFlux-Coder-7B/4B (code). Process Reward Model variants (1.5B, 7B) guide inference-time scaling without fine-tuning. Fine-tuning data: ReasonFlux-SFT-15k on HuggingFace.
Further reading
- ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates — the paper
- Gen-Verse/ReasonFlux on GitHub — code and model cards
- Buffer of Thoughts (NeurIPS 2024) — predecessor work on thought templates
- Tool calling fundamentals — how structured reasoning integrates with tool dispatch
- Tool design principles — interaction patterns for reasoning-heavy agents
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch