11 minute read

Reactive planning is betting on the next step. Anticipatory planning is mapping the whole path. TraceR1 shows that for tasks where early mistakes compound (computer-use, multi-tool orchestration, long-horizon GUI navigation) the difference in outcome is measurable, not theoretical.

TL;DR: TraceR1 trains agents to forecast multi-step trajectories before acting, using two-stage RL: first to optimize global trajectory consistency, then to refine executability with real tool feedback. Across 7 benchmarks, from AndroidWorld to GAIA, it outperforms reactive baselines by 8–40%, with the biggest gains on long-horizon computer-use tasks where early mistakes compound.

A chess board mid-game with a transparent holographic grid overlay projecting the next five moves as glowing amber path markers floating above the ...


The reactive planning problem: why early mistakes compound

Most agents running today work like this: observe the current state, reason about the next action, execute it, observe the new state, repeat. This is the ReAct loop, covered in depth in post 0014. It works well when tasks are short, or when the environment is unpredictable enough that any upfront plan would become stale within two steps.

The problem shows up on longer tasks. Consider a computer-use agent navigating a GUI to file an expense report. The agent has to open the right app, locate the correct form, fill fields in order, and submit. Each action eliminates options. Clicking into the wrong submenu may not cause an outright error — the agent might not realize the mistake until three steps later, when the context has drifted and recovery requires backtracking through UI state that’s no longer visible.

The core failure mode isn’t poor reasoning at any single step. Reactive agents have no representation of where the action sequence is going. They optimize for the current decision without a model of how that decision shapes the ones that follow. On AndroidWorld, a real Android task benchmark, reactive baselines achieve around 35–40% task completion. Five early wrong turns in ten-step tasks compound into a failure rate that no per-step improvement can fix.

This isn’t a problem MCTS solves at training time. MCTS explores branching futures at inference (the subject of a separate post currently in draft). TraceR1 asks a different question: can you train a model to carry trajectory-level reasoning as an intrinsic capability, rather than invoking it as an external search procedure on every call?


What TraceR1 does: two-stage RL for trajectory forecasting

TraceR1 (arXiv:2603.16777) was published March 17, 2026 by researchers at Adobe Research alongside the University of Maryland, The Ohio State University, and SUNY Buffalo. Before the agent takes a single action, it generates a forecast of the full action sequence it expects to follow.

The framework is trained in two stages, each with a distinct reward signal.

Stage 1 is trajectory-level RL. The model learns to predict multi-step trajectories that match reference sequences. The reward combines an alignment score measuring similarity between predicted and reference actions (sim(â_t, a_t*)), and a repetition penalty for redundant steps. A temporal discount factor γ weights near-term predictions more heavily than distant ones, since uncertainty grows with forecast horizon. After Stage 1, the model has learned to think several steps ahead as part of its normal reasoning process.

Stage 2 is grounded reinforcement fine-tuning. Stage 1 produces a model good at trajectory consistency but not necessarily grounded in what tools can actually execute. Stage 2 corrects this using binary rewards: coordinate matching for GUI actions, answer matching for tool calls, computed against execution feedback from frozen tool agents. The training objective is GRPO (group-relative policy optimization), normalizing rewards within a group of candidate trajectories. Ablations show removing Stage 2 causes an average 6% performance drop. Stage 1 builds the planning capacity; Stage 2 anchors it to physical executability.

At inference, TraceR1 predicts the full trajectory τ_{t:T}, executes only the first action a_t, then re-plans from the new state. Plan ahead, act one step, replan. The trajectory horizon that performs best in ablations is T=5–10 steps; beyond T=10 the model is forecasting too far into uncertain state space and performance degrades.

Across the 7 benchmarks: AndroidWorld (64.8%), OSWorld-Verified (41.2%, a 15.7% relative improvement over the Qwen3-VL baseline), AndroidControl-High (75.3%, more than 40% above prior RL-trained GUI agents like GUI-R1 and InfiGUI-R1), GUI-Odyssey (88.2%), Multimodal-Mind2Web (65.3%), GAIA (40.2%, +8.7 points over baseline), and GTA (56.7% answer accuracy, 65.7% tool accuracy).

The AndroidControl-High number deserves attention. A 40%+ margin over other RL-trained agents isn’t a marginal improvement from better training recipes. It suggests trajectory-level rewards are capturing something that step-level rewards miss entirely.


How it compares to ReAct, MCTS, and chain-of-thought planning

These three approaches are already in production at various levels of adoption. Where does TraceR1 fit?

ReAct interleaves reasoning traces with actions in a step-by-step loop. The agent sees an observation, produces a thought, produces an action, receives the next observation. There’s no trajectory representation — the “plan” is implicit in each reasoning trace and doesn’t extend beyond the current step. ReAct works fine when the environment is unpredictable and a fixed plan would become obsolete quickly. It breaks down on tasks requiring coherent multi-step strategy, because each step is optimized without awareness of downstream constraints.

Chain-of-thought planning adds an upfront decomposition step before acting. Better than pure ReAct for structured tasks, but the plan is generated by the model’s forward pass at inference time and never reinforced through execution feedback. There’s no training signal rewarding trajectories that cohere globally over ones that look reasonable step by step. CoT planning is static; TraceR1’s trajectory forecasting is trained against actual execution outcomes.

MCTS (draft post, not yet published) searches a tree of possible futures at inference time, branching on possible actions, evaluating outcomes, backtracking. Expensive per call: O(branching_factor × depth × node_evaluations) at inference. It works when you have a good value function and can afford the search budget. TraceR1 amortizes the trajectory reasoning into model weights at training time. Single forward pass at inference, then execute. TraceR1 is expensive to train and cheap to run; MCTS is cheap to train and expensive to run. They’re complementary rather than competing — you could run MCTS at inference on a TraceR1-trained base model.

One thing worth saying plainly: the comparison to CoT planning is the most important one for practitioners. Most teams adding “planning” to their agents are doing upfront CoT decomposition. TraceR1’s contribution is showing that static inference-time planning is weaker than planning trained via RL with execution feedback. The training is harder to set up, but the benchmark gap is real.


Adding anticipatory planning to an existing agent: an implementation sketch

TraceR1’s two-stage approach is a training pipeline, not a drop-in wrapper. The design principles still translate into practical patterns for engineers working with existing agents.

The two diagrams below show what changes when you switch from reactive to anticipatory:

Reactive (ReAct)
─────────────────────────────────────────────────
Obs_1 → Reason → Act_1 → Obs_2 → Reason → Act_2
         [next]            [next]

Anticipatory (TraceR1)
─────────────────────────────────────────────────
Obs_1 → Forecast [Act_1, Act_2, Act_3, ..., Act_T]
         ↓
       Execute Act_1 → Obs_2 → Reforecast [Act_2', ..., Act_T']
         ↓
       Execute Act_2' → ...

TraceR1 two-stage training process:

flowchart TD
    A[Base VLM\ne.g. Qwen3-VL] --> B[Stage 1: Trajectory RL]
    B --> C{Reward}
    C -->|Alignment score\nsim â_t, a_t*| D[Trajectory consistency]
    C -->|Repetition penalty\nλ_rep| D
    C -->|Temporal discount\nγ ∈ 0,1| D
    D --> E[Stage 1 Model\nglobal trajectory planner]
    E --> F[Stage 2: Grounded Fine-tuning]
    F --> G{Execution feedback\nfrom frozen tools}
    G -->|GUI: coordinate match| H[Binary reward]
    G -->|Tool: answer match| H
    H --> I[GRPO optimization\ngroup-relative advantage]
    I --> J[TraceR1\nAnticipatory agent]
    J --> K[Inference: forecast τ_t:T\nexecute a_t, replan]

If you’re not running a training pipeline, the trajectory-forecasting concept maps to a prompt-level pattern: require the agent to output a predicted action sequence before executing any step, then evaluate the sequence for consistency before allowing execution. This is weaker than TraceR1’s trained capacity (no execution feedback in the reward signal) but approximates anticipatory reasoning at inference time with no retraining cost:

SYSTEM_PROMPT = """
Before taking any action, first output a trajectory forecast:
<trajectory>
Step 1: [action]
Step 2: [action]
...
Step N: [action]
</trajectory>

Then execute only Step 1. After observing the result, update the forecast and execute Step 2.

Trajectory horizon: 5–10 steps. If you cannot forecast more than 2 steps ahead with confidence, state this explicitly.
"""

For teams running RL fine-tuning on their own agents, the Stage 1 reward structure is the key contribution to replicate: trajectory alignment reward + repetition penalty + temporal discount. The discount factor γ is important — without it, the model gets equal credit for correct near-term and distant predictions, which destabilizes training when distant steps are uncertain.

def trajectory_reward(predicted: list[Action], reference: list[Action], gamma: float = 0.9,
                       lambda_align: float = 1.0, lambda_rep: float = 0.5) -> float:
    align_score = sum(
        (gamma ** t) * similarity(predicted[t], reference[t])
        for t in range(min(len(predicted), len(reference)))
    )
    rep_penalty = repetition_penalty(predicted)
    return lambda_align * align_score - lambda_rep * rep_penalty

def repetition_penalty(actions: list[Action]) -> float:
    """Penalize semantically redundant consecutive actions."""
    if len(actions) < 2:
        return 0.0
    penalties = [
        similarity(actions[i], actions[i-1])
        for i in range(1, len(actions))
    ]
    return sum(penalties) / len(penalties)

Stage 2 ground-truth rewards are task-specific: for GUI tasks, check pixel coordinates against reference; for tool-calling tasks, compare final answers. The GRPO objective normalizes rewards within a candidate group — run multiple rollouts, normalize by group mean and std, train toward the better ones.

The horizon ablation finding (T=5–10) translates directly: don’t ask agents to plan 20 steps ahead. The forecast degrades into noise, and training on long-horizon predictions that are structurally unpredictable teaches the model nothing useful.


Frequently asked questions

What is TraceR1? A two-stage RL framework from Adobe Research (arXiv:2603.16777) that trains agents to forecast multi-step action trajectories before executing the first action. Evaluated on 7 benchmarks spanning computer-use and multimodal tool reasoning tasks.

How does TraceR1 differ from ReAct? ReAct interleaves reasoning and action one step at a time, with no forward trajectory plan. TraceR1 forecasts the full expected sequence before moving. The gap matters most on tasks where early decisions constrain later options and mistakes don’t surface until several steps downstream.

How does TraceR1 differ from MCTS? MCTS explores a branching tree of future states at inference time. TraceR1 bakes trajectory reasoning into the model weights through RL training. MCTS is expensive at inference; TraceR1 pays that cost at training time and amortizes it across all future runs. They’re complementary rather than competing. You could run MCTS at inference on a TraceR1-trained base model.

What benchmarks did TraceR1 beat? With Qwen3-VL-32B: 64.8% on AndroidWorld, 41.2% on OSWorld-Verified (+15.7% relative over baseline), 75.3% on AndroidControl-High (40%+ over prior RL-trained GUI agents), 88.2% on GUI-Odyssey, 65.3% on Multimodal-Mind2Web, 40.2% on GAIA (+8.7 points), and 56.7% on GTA. See arXiv:2603.16777 for full tables.

What trajectory horizon should I use? Ablations show performance peaks at T=5–10 steps ahead. Beyond T=10 the model is forecasting into uncertain state space and performance drops. For most computer-use tasks, start with T=5.


TraceR1 paper: arXiv:2603.16777. Related: ReAct pattern deep dive · Planning and decomposition · Hierarchical planning

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch