What are thought templates in ReasonFlux?

Thought templates are structured reasoning patterns extracted from solved problems. Each template has metadata: name, domain tags, a description of the strategy, scope conditions where it applies, numbered application steps, and worked examples. An example template 'Irrational Function Optimization' would specify algebraic substitution steps. The library of ~500 templates was curated by Gemini-2.0 from 7.5K MATH problems and 2K competition math samples. Templates can be chained — the RL policy learns which sequences work for which problem types.

How does ReasonFlux compare to chain-of-thought, ReAct, and Tree-of-Thought?

Chain-of-thought generates step-by-step tokens without structure — high verbosity, no reuse of successful reasoning patterns. ReAct uses a Thought→Action→Observation loop useful for tool-using agents but doesn't structure the reasoning itself. Tree-of-Thought branches over multiple paths via BFS/DFS but reinvents reasoning strategies on every problem without persistent learning. ReasonFlux encodes domain strategies into templates (persistent, reusable) and uses RL to learn which template sequences work best — combining structure with learned strategy.

What are the ReasonFlux benchmark results?

On MATH benchmark: 91.2%, beating o1-preview by 6.7 percentage points. On AIME 2024 (USA Math Olympiad): 56.7% for ReasonFlux-32B vs. 44.6% for o1-preview and 39.2% for DeepSeek-V3 — a 12pp gap over o1-preview and 17pp over DeepSeek-V3. These are meaningful margins on a benchmark where most LLMs score 10–20%. The MATH improvement is more modest; MATH is approaching saturation for top-tier models.

Can practitioners use ReasonFlux without training their own models?

Yes. Pre-trained checkpoints are available on HuggingFace under Gen-Verse organization: ReasonFlux-F1-32B (SOTA math reasoning), ReasonFlux-F1-7B (efficient smaller variant), ReasonFlux-Coder-7B/4B (code reasoning). ReasonFlux-PRM variants (1.5B and 7B) are Process Reward Models that guide inference-time scaling without fine-tuning the base model. For fine-tuning on domain-specific tasks, the ReasonFlux-SFT-15k dataset is also on HuggingFace.

ReasonFlux: why reasoning templates beat token-by-token chain-of-thought

Q: What is ReasonFlux and what problem does it solve?

ReasonFlux (arXiv:2502.06772, ICML 2025) addresses the inefficiency of token-by-token chain-of-thought reasoning. Instead of generating reasoning step-by-step as tokens, ReasonFlux uses thought templates — compact, metadata-rich reasoning strategies with names, tags, application steps, and examples. An agent selects and composes templates from a library of ~500 rather than generating each token of reasoning from scratch. Hierarchical RL optimizes template sequences rather than individual tokens.

8 minute read

TL;DR: ReasonFlux (arXiv:2502.06772, Princeton/PKU, ICML 2025) introduces thought templates: compact, metadata-rich reasoning strategies agents select and compose rather than generating reasoning token-by-token. A library of ~500 templates extracted by Gemini-2.0 from math competition data provides structured exploration of the reasoning space. Hierarchical RL optimizes template sequences, not individual tokens. On MATH: 91.2%, beating o1-preview by 6.7pp. On AIME 2024: 56.7% vs. 44.6% for o1-preview. Pre-trained models on HuggingFace; no training required.

A circuit board decision tree with template nodes connected by copper traces, one path illuminated — representing template selection in hierarchical reasoning

Chain-of-thought reasoning generates thinking as a sequence of tokens. The model produces reasoning the same way it produces text: one token at a time, without any structural commitment to what kind of reasoning it’s doing. The result is verbose, inconsistent, and expensive: a model might “think” about the same type of problem in completely different ways on different runs, with no mechanism for learning from successful strategies.

ReasonFlux asks a different question: what if the agent picked reasoning strategies from a library rather than improvising them from tokens?

The template library

ReasonFlux (Ling Yang, Zhaochen Yu, Bin Cui, Mengdi Wang, Princeton/PKU, ICML 2025, arXiv:2502.06772) builds a library of ~500 thought templates extracted from mathematical problem-solving data.

Each template is not a worked example. It’s a structured reasoning pattern:

Template: "Algebraic Substitution for Irrational Functions"
Tags: [algebra, substitution, optimization, calculus]
Description: Transform irrational expressions by variable substitution to reduce to polynomial form
Scope: Problems involving sqrt, fractional powers, or transcendental functions in optimization
Application steps:
  1. Identify the irrational expression as the target for substitution
  2. Select substitution variable (t = sqrt(f(x)) or t = x^(1/n))
  3. Express original variables in terms of t
  4. Solve the transformed problem in t-space
  5. Back-substitute to recover x
Examples: [worked example 1], [worked example 2]

The library of ~500 templates was curated by Gemini-2.0 analyzing 7.5K MATH problems and 2K Chinese math competition problems. Gemini-2.0 didn’t generate templates from scratch. It extracted the underlying reasoning patterns from existing solutions, abstracting them into reusable strategies.

The metadata is the key. Tags enable efficient retrieval: given a problem, the system embeds the problem and retrieves the most relevant templates without scanning all 500. Application steps enable instantiation: the abstract template gets filled with the specific problem’s variables and constraints.

How reasoning happens

At inference time, a ReasonFlux agent doesn’t generate a reasoning chain from scratch. It follows a three-phase process:

graph TD
    P[Problem] --> R[Retrieve relevant templates\nembedding similarity]
    R --> S[Select template sequence\nRL-optimized policy]
    S --> I[Instantiate templates\nadapt to specific problem]
    I --> T[Execute template sequence\nstep-by-step reasoning]
    T --> A[Answer]

    style R fill:#e8f4f8,stroke:#4a9eca
    style S fill:#fff3cd,stroke:#e0a800
    style I fill:#d4edda,stroke:#28a745

Retrieval: the problem embedding is compared against template metadata embeddings, returning the top-k most relevant candidates. Dense retrieval against 500 templates is fast even at inference.

Selection: the RL-trained policy picks which templates to use and in which order, given the problem and the retrieved candidates. The policy generalizes from training problems to new ones by learning which sequence types work for which problem structure.

Instantiation and execution: the selected templates are filled with the problem’s specific variables, constraints, and unknowns. Those instantiated templates become the structured reasoning chain the model follows.

Why hierarchical RL on template sequences matters

Standard RLHF or RLAIF trains on individual reasoning steps: score this step, reinforce good steps. The policy optimizes locally but has no mechanism for learning that this kind of step works best when preceded by that kind of step.

ReasonFlux’s hierarchical RL trains on template trajectories: complete sequences of templates applied to a problem. The reward signal is the final answer’s correctness. The policy learns which template sequences lead to correct answers on which problem types, a genuinely long-horizon learning signal.

This is related to the Buffer of Thoughts approach (NeurIPS 2024 Spotlight, arXiv:2406.04271), which also proposes thought templates and a meta-buffer for storage. ReasonFlux extends it with hierarchical RL on template sequences rather than static template selection, and demonstrates significantly stronger benchmark performance.

The numbers

Benchmark	ReasonFlux-32B	o1-preview	DeepSeek-V3	Notes
MATH	91.2%	84.5%	~89%	+6.7pp vs. o1-preview
AIME 2024	56.7%	44.6%	39.2%	+12pp vs. o1, +17pp vs. DeepSeek

The AIME numbers are significant. AIME (American Invitational Mathematics Examination) tests multi-step proof and problem-solving; most LLMs score in the 10–20% range on it. ReasonFlux-32B at 56.7% clears a threshold that was considered out of reach without test-time compute scaling. The 12pp gap over o1-preview (44.6%) and 17pp over DeepSeek-V3 (39.2%) are meaningful on a benchmark this hard.

MATH is a broader benchmark and the 6.7pp margin over o1-preview is meaningful but less dramatic; MATH is approaching saturation for top-tier models.

Comparing to ReAct, ToT, and CoT

Approach	Mechanism	Key limitation
Chain-of-Thought	Token-by-token reasoning, unstructured	No strategy reuse, high verbosity
ReAct	Thought → Action → Observation loop	Structures tool use, not reasoning itself
Tree-of-Thought	BFS/DFS over multiple paths	No persistent learning; re-derives strategies per problem
Buffer of Thoughts	Templates + meta-buffer	Static selection without RL-optimized sequencing
ReasonFlux	Template library + hierarchical RL on sequences	Requires template curation; math-domain primary validation

ReasonFlux’s limitation is important: the template library is domain-specific. The current library of ~500 templates covers mathematical reasoning, extracted from competition math data. Extending to software engineering reasoning, scientific reasoning, or legal reasoning requires domain-specific template curation, which is non-trivial. The MarkTechPost review notes this: “the framework’s performance in domains beyond mathematical reasoning requires further validation.”

For practitioners building tool-using agents where the reasoning structure is sequential steps applied to retrieved context, the template approach is more naturally applicable. For open-ended creative or strategy tasks, the template library’s coverage limitations matter more.

How to use it today

No training required for standard use. Three paths:

Pre-trained models (recommended for most):

Gen-Verse/ReasonFlux-F1-32B — SOTA math reasoning
Gen-Verse/ReasonFlux-F1-7B — efficient 7B variant
Gen-Verse/ReasonFlux-Coder-7B/4B — code reasoning focus

Process Reward Model variants:

ReasonFlux-PRM-1.5B — edge deployment
ReasonFlux-PRM-7B — higher quality, guides inference-time scaling
PRMs can be used with any compatible base model to guide beam search without fine-tuning the reasoner

Fine-tuning on your domain:

Start from a ReasonFlux checkpoint
Gen-Verse/ReasonFlux-SFT-15k dataset on HuggingFace for supervised template training
Domain-specific template curation is the bottleneck; template extraction quality determines the ceiling

For production agents that currently use ReAct or chain-of-thought for multi-step reasoning tasks in structured domains (finance, code, math), swapping to a ReasonFlux-F1-7B backbone is a low-effort experiment. See tool calling fundamentals for the integration patterns and tool design principles for how structured reasoning interacts with tool dispatch.

Key takeaways

ReasonFlux replaces token-by-token reasoning with template selection and composition. ~500 templates cover math reasoning strategies; RL learns optimal sequences.
MATH: 91.2% (+6.7pp vs. o1-preview). AIME 2024: 56.7% vs. 44.6% for o1-preview (+12pp) and 39.2% for DeepSeek-V3 (+17pp).
Hierarchical RL trains on complete template trajectories, not individual steps. That’s a long-horizon learning signal that improves cross-problem generalization.
Pre-trained models (32B, 7B, Coder variants) available on HuggingFace under Gen-Verse. No training required for standard deployment.
The primary limitation: template library coverage is math-domain. Extension to other reasoning domains requires domain-specific template curation.

FAQ

What is ReasonFlux and what problem does it solve? ReasonFlux (arXiv:2502.06772, ICML 2025) replaces token-by-token chain-of-thought with structured thought templates: compact, metadata-rich reasoning strategies agents select and compose. Hierarchical RL learns optimal template sequences rather than individual token generation. Available as pre-trained models on HuggingFace.

What are thought templates? Structured reasoning patterns with metadata: name, domain tags, application steps, scope conditions, worked examples. An “Algebraic Substitution” template specifies exactly how to transform irrational expressions: abstract enough to apply across problems, concrete enough to instantiate for a specific one.

How does it compare to chain-of-thought and ReAct? CoT generates reasoning as unstructured tokens with no strategy reuse and high verbosity. ReAct structures tool use, not reasoning. Tree-of-Thought branches without persistent learning. ReasonFlux encodes domain strategies into templates and uses RL to learn which sequences work, combining structure with long-horizon learning.

What are the benchmark results? MATH: 91.2% (+6.7pp vs. o1-preview). AIME 2024: 56.7% vs. 44.6% for o1-preview and 39.2% for DeepSeek-V3. ICML 2025.

How to use it without training? Download pre-trained checkpoints: ReasonFlux-32B-Reasoner (SOTA math), ReasonFlux-F1-7B (efficient), ReasonFlux-Coder-7B/4B (code). Process Reward Model variants (1.5B, 7B) guide inference-time scaling without fine-tuning. Fine-tuning data: ReasonFlux-SFT-15k on HuggingFace.

ReasonFlux: why reasoning templates beat token-by-token chain-of-thought

The template library

How reasoning happens

Why hierarchical RL on template sequences matters

The numbers

Comparing to ReAct, ToT, and CoT

How to use it today

Key takeaways

FAQ

Further reading

Related across topics

Share on

The template library

How reasoning happens

Why hierarchical RL on template sequences matters

The numbers

Comparing to ReAct, ToT, and CoT

How to use it today

Key takeaways

FAQ

Further reading

Related across topics

Tool Calling Fundamentals

Share on