8 minute read

“Remove the images from your multimodal reasoning chain. If accuracy drops less than 5%, your agent is not actually looking.”

TL;DR

Chain-of-thought works for text. It struggles with images. VisBrowse-Bench — 169 visual QA instances for browsing agents — shows Claude-4.6-Opus at 47.6% and o3-deep-research at 41.1%. The “CoT mirage” paper finds CoT reflects memorized patterns, not abstract reasoning, and this limitation amplifies in visual domains. Research shows only 2% accuracy drop when images are removed from long chains — models ignore visual evidence and rely on text. For the latent-space reasoning approach that bypasses visible CoT entirely, see when LLMs stop talking to themselves.

A Pepper ghost theatrical illusion apparatus showing a sheet of angled glass reflecting an object from a hidden compartment below to create a phantom image that appears to float above the stage surface,  the deception mechanism visible from this behind-the-scenes angle revealing the glass and hidden source

What does VisBrowse-Bench actually measure?

Most agent benchmarks test text-based tasks: answer a question, write code, search the web. VisBrowse-Bench tests whether browsing agents can reason about what they see — images on web pages, visual layouts, screenshots — not just what they read.

The benchmark contains 169 VQA instances constructed by human experts. Each requires the agent to actively collect visual information during web search, cross-validate evidence across text and images, and reason jointly over both modalities.

Results from the paper (arXiv 2603.16289):

Model Accuracy
Claude-4.6-Opus 47.6%
o3-deep-research 41.1%

Less than half. On a benchmark designed to be solvable by a human who can browse the web and look at images. The gap between text-domain reasoning (where these models routinely score 80-95%) and visual-native reasoning is enormous.

This is not a story about these models being bad. It is a data point about what happens when reasoning leaves the text domain. The same models that dominate text benchmarks struggle when the evidence is primarily visual.

Is chain-of-thought real reasoning or sophisticated mimicry?

A research team built DataAlchemy — an isolated environment to train LLMs from scratch and test reasoning under controlled distribution conditions (arXiv 2508.01191). Their finding: sophisticated CoT prompting does not endow models with abstract, generalizable inference. Models succeed on familiar problem patterns. They fail when identical logical structures appear in unfamiliar contexts.

The conclusion, stated plainly: LLMs are “sophisticated simulators of reasoning-like text” rather than principled reasoners. CoT reflects structured inductive bias learned from in-distribution training data.

This sparked considerable debate on Hacker News and Reddit. The counter-argument is philosophical: if the model produces correct answers to novel problems, does it matter whether the internal mechanism is “real” reasoning? The empirical response: the failures are not random. They cluster at distribution boundaries — exactly where pattern matching breaks and abstract reasoning would be needed.

Separately, Anthropic’s faithfulness research adds nuance. Models do appear to compute answers during CoT generation, not purely after the fact. The tokens are not post-hoc rationalization — the model’s answer changes based on the reasoning steps it generates. But the sequential text presentation does not match the model’s actual computation, which happens in parallel across attention heads. The step-by-step narrative is a lossy compression of a parallel process.

Both findings matter for practitioners. CoT is useful. It improves accuracy on in-distribution tasks. It is not magic — it does not create reasoning capabilities the model lacks. And its benefits diminish sharply when the domain shifts from text to vision.

Why does CoT fail on visual inputs?

The most striking finding from recent multimodal research: removing images from long visual reasoning chains causes only a 2% accuracy drop. The models are overwhelmingly reasoning from text, not from the images they were given.

This is the visual forgetting problem. As a reasoning chain grows longer, the model’s attention to visual embeddings decreases progressively. Language tokens occupy sequential attention positions that crowd out the image representation. By step five or six of a long CoT chain, the model has effectively forgotten what the image showed.

Additional failure modes compound the problem:

Low-level perception failure. Models struggle to connect shapes, sizes, and spatial layouts with semantic reasoning. They can identify objects in an image. They cannot reliably reason about the spatial relationships between those objects.

Multi-object reasoning. Basic counting and localization tasks — “how many red circles are in the image?” — produce surprising failures. These are tasks that seem trivial for a system that processes visual input.

Robustness to perturbation. Minimal image changes (small rotations, slight color shifts) cause severe performance drops. A model that correctly answers a question about an upright image fails on the same image rotated 15 degrees.

Distractor interference. Irrelevant background elements interfere with reasoning about the foreground. Human visual reasoning filters distractors automatically. Model reasoning does not.

graph TD
    A[Visual input + Text question] --> B[Step 1: Perceive image<br/>Visual attention: HIGH]
    B --> C[Step 2: Begin reasoning<br/>Visual attention: MEDIUM]
    C --> D[Step 3: Continue chain<br/>Visual attention: LOW]
    D --> E[Step 4: Generate answer<br/>Visual attention: ~ZERO]
    E --> F[Answer based mostly<br/>on text, not image]

    G[Visual Forgetting:<br/>~2% accuracy drop<br/>when images removed] -.-> F

The implication for multimodal agent design is direct: if your browsing agent reasons about a web page through a long CoT chain, it is increasingly reasoning about the text it extracted from the page, not about what the page looks like. Visual layout information, image content, and spatial relationships are lost.

What alternatives exist to visible chain-of-thought?

Three approaches address the visual reasoning gap.

Decouple perception from reasoning. The IPVR framework uses a “see, think, confirm” pattern. First, extract visual facts from the image (perception step — short, focused, image-attended). Then reason over the extracted facts as text (reasoning step — can be long, no visual forgetting). Then confirm against the original image (verification step — re-anchors to visual input). This separation prevents the reasoning chain from drifting away from visual evidence.

Latent reasoning. Meta’s Coconut (Chain of Continuous Thought) reasons in the model’s internal latent space rather than generating visible text tokens. The last hidden state becomes the reasoning representation, fed back as an embedding without decoding to text. This bypasses the text-dominance problem because reasoning never enters the language modality. Coconut outperforms visible CoT on logical reasoning tasks and generates significantly fewer tokens — better accuracy at lower cost. The approach extends to visual domains through Latent Visual Reasoning (LVR), which integrates visual signals directly into the latent reasoning process.

Multimodal CoT. The two-stage rationale-then-answer framework (arXiv 2302.00923) generates a text rationale from multimodal input, then uses that rationale to produce the answer. A 1B model using this approach outperformed GPT-3.5 by 16 percentage points on ScienceQA (91.68% vs 75.17%). Separately, interleaved-modal CoT alternates between text and image reasoning steps, re-processing the image at each visual step to prevent forgetting.

For practitioners building multimodal agents, the concrete recommendations:

  1. Test with images removed. If accuracy drops less than 5%, your agent is not using visual evidence.
  2. Keep visual reasoning chains short. Three steps maximum before re-anchoring to the image.
  3. Decouple perception from reasoning. Extract visual facts first, then reason over the extraction as text.
  4. Re-anchor before critical steps. Explicitly re-process the image before any reasoning step that depends on visual content.
  5. Consider latent approaches. If visible CoT consistently underperforms, Coconut-style latent reasoning may help by keeping visual information in the representation throughout.

Key takeaways

  • VisBrowse-Bench exposes the visual gap. Claude-Opus at 47.6%, o3-deep-research at 41.1%. Visual-native reasoning is far behind text-domain performance.
  • CoT is useful, not magical. It improves in-distribution accuracy. It does not create abstract reasoning. Benefits diminish at distribution boundaries.
  • Models ignore images during long reasoning. 2% accuracy drop when images are removed. Visual forgetting is measurable and progressive.
  • Decouple perception from reasoning. Extract visual facts first, then reason over text. This prevents the text-dominance bias from corrupting visual evidence.
  • Latent reasoning is the emerging alternative. Coconut and LVR keep visual information in the representation throughout, avoiding the text bottleneck of visible CoT.

Further reading

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch