Societies of thought: what DeepSeek R1’s internal cognitive debates reveal about reasoning
In 1986, Marvin Minsky proposed that intelligence emerges from the interaction of many small processes — a “Society of Mind.” Researchers spent decades trying to build it. DeepSeek R1 apparently found it on its own.
TL;DR
Research from Kim et al. (arXiv:2601.10825) shows that DeepSeek-R1 and QwQ-32B spontaneously generate internal “societies of thought” — multiple perspectives with distinct personality and expertise profiles that debate, shift positions, and reconcile before producing output. This wasn’t trained in; it emerged from reinforcement learning on outcome-only rewards. The implication for system design: if a model already debates internally, external multi-agent debate needs to clear a higher bar to justify its cost and complexity. For context on a parallel but distinct phenomenon, see When LLMs stop talking to themselves: latent-space reasoning — that post covers Coconut and moving reasoning into vector space, which is a different mechanism entirely.

What the paper actually found in DeepSeek R1
The short answer: DeepSeek-R1’s reasoning traces behave more like debates than monologues, with multiple statistically distinct perspectives arguing, shifting positions, and reconciling — none of which was explicitly trained.
Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Agüera y Arcas, and James Evans published “Reasoning Models Generate Societies of Thought” (arXiv:2601.10825) in early 2026. They compared reasoning models — DeepSeek-R1 (671B) and QwQ-32B — against instruction-tuned baselines (DeepSeek-V3-0324, Qwen-2.5-32B-Instruct, Llama-3.3-70B-Instruct) across 8,262 problems spanning BigBench Hard, GPQA, MATH (Hard), MMLU-Pro, MUSR, and IFEval.
The finding was not incremental. Controlling for trace length, DeepSeek-R1 showed dramatically higher rates of three specific conversational behaviors: question-answering (β=0.345, p<1×10⁻³²³), perspective shifts (β=0.213, p<1×10⁻¹³⁷), and reconciliation of conflicting views (β=0.191, p<1×10⁻¹²⁵). These are not stylistic quirks — they are structurally present at statistical significance that makes “noise” an untenable explanation.
Using the Big Five personality index applied via LLM-as-judge, the team measured perspective diversity within a single model’s reasoning trace. DeepSeek-R1 showed neuroticism diversity of β=0.567 versus DeepSeek-V3, agreeableness diversity of β=0.297, and expertise diversity of β=0.179. The reasoning trace activates features associated with distinct personalities and domain knowledge — and those features conflict with each other before the model converges on an answer.
The mechanism they identified: a specific internal feature (feature 30939, a conversational surprise marker) appears at the 99th percentile of conversation ratio and just 0.016% of tokens. Steering this feature to strength +10 doubled accuracy on the Countdown arithmetic task, from 27.1% to 54.8%. That is a sharp causal lever on a small number of tokens — exactly the signature you would expect from a genuine structural mechanism rather than a statistical artifact.
Reasoning model internal structure (Kim et al., 2026):
DeepSeek-R1 thinking trace
├── Perspective A (neuroticism high, expertise: formal logic)
│ └── "The answer is X because..."
├── Perspective B (agreeableness high, expertise: domain knowledge)
│ └── "Wait, but what about Y..."
├── Reconciliation event
│ └── "Both are correct if we consider Z"
└── Convergence → final answer
DeepSeek-V3 (instruction-tuned baseline)
└── Single consistent voice → answer
(no conversational structure, perspective shifts, or reconciliation)
The validated measurement framework deserves a note: their LLM-as-judge system predicted distinct speaker identity and turn boundaries with ρ=0.86 and ρ=0.89 respectively on the Intelligence Squared Debates corpus (a human debate corpus), with inter-rater reliability of ICC=0.855 across two independent judge models. The tool works. What they applied it to is surprising.
How internal debate differs from chain-of-thought and latent reasoning
Chain-of-thought is a prompt technique that adds a scratchpad. Latent reasoning moves thinking into vector space. Internal debate is a different thing: a structural property of the reasoning trace itself, with multiple conflicting voices that were never requested.
It is worth being precise here because these three concepts travel in the same conversations and get blurred.
Chain-of-thought (Wei et al., 2022; Kojima et al., 2022) works by prompting the model to “think step by step.” The model generates intermediate reasoning tokens before answering. This works because each token gives the transformer additional computation steps to attend back to — effectively increasing compute depth. The voice is consistent. There is no conflict, no perspective shift. As I covered in multi-step reasoning, CoT roughly triples accuracy on math and logic tasks, but it operates in a single reasoning lane.
Latent-space reasoning (Coconut, Meta FAIR, December 2024) removes the language constraint entirely. Instead of generating human-readable tokens, the model feeds hidden states back as continuous vectors. Multiple reasoning paths coexist simultaneously. The upside is density — Coconut used 71% fewer tokens than CoT on planning tasks. The downside is opacity. That mechanism is about the substrate of thinking (tokens vs. vectors), not the social structure of thinking.
Internal debate is the new observation. It is not about what medium thinking uses — it still generates readable tokens. It is about who appears to be thinking. The reasoning traces in DeepSeek-R1 show multiple perspectives with statistically distinct personality and expertise profiles. They disagree. They reconcile. This emerged without any training objective targeting it.
graph TD
A[User prompt] --> B[Reasoning model]
B --> C{Reasoning mechanism}
C --> D[Chain-of-thought<br/>Single voice, sequential steps<br/>Prompted]
C --> E[Latent reasoning<br/>Vector space, multi-path<br/>Trained explicitly]
C --> F[Internal debate<br/>Multiple perspectives, conflict, reconciliation<br/>Emergent from RL]
D --> G[Visible token output]
E --> H[Opaque vector states]
F --> I[Visible token output<br/>with structural debate]
style F fill:#f9f,stroke:#333
What makes this distinct from both predecessors is the emergence. DeepSeek-R1 was trained with Group Relative Policy Optimization (GRPO) using outcome-only rewards: the model received a signal based solely on answer correctness. No training objective said “generate multiple perspectives.” No training data labeled “here is how to debate internally.” The structure appeared because, apparently, internal debate is a more effective path to correct answers than consistent monologue — and the reward signal found that path.
Kim et al. confirmed this causally. Training a Qwen-2.5-3B model on the Countdown arithmetic task with conversational scaffolding (a format that nudges the model toward dialogue-like structure) reached 38% accuracy by step 40, versus 28% for monologue fine-tuning. On Llama-3.2-3B, the gap was 40% versus 18% by step 150. Internal debate is not a side effect of intelligence — it may be a contributor to it.
What this means for multi-agent system design
If a reasoning model already runs internal debate across personality and expertise dimensions, external multi-agent debate needs to justify what it adds that the single model cannot provide itself. The honest answer is: sometimes nothing.
The multi-agent debate literature has promising results. Du et al. (2023) showed in “Improving Factuality and Reasoning in Language Models through Multiagent Debate” (arXiv:2305.14325) that having multiple LLM instances critique each other’s answers improves factual accuracy. More recent work on Adaptive Heterogeneous Multi-Agent Debate (A-HMAD, 2025) reports 4–6% absolute accuracy gains over single-model approaches on six benchmarks and 30%+ reduction in factual errors in biography generation.
But read those results carefully. A significant portion of multi-agent debate gains come from majority voting — running multiple completions and picking the most common answer. That is an ensemble technique, not a debate technique. A single model with self-consistency sampling (run five times, vote) often matches dedicated debate pipelines at a fraction of the coordination overhead.
The societies of thought finding sharpens this question. If DeepSeek-R1 already generates conflicting perspectives internally, what does a multi-agent debate system add? I see three cases where external debate genuinely adds value:
1. Genuinely independent context. Different agents can access different tools, databases, or retrieval sources. Internal debate cannot do this — all perspectives inside a single model share one context window. If your agents carry meaningfully different information, external debate is irreplaceable.
2. Architectural diversity. Internal perspectives in DeepSeek-R1 share the same weights. If you run debate across genuinely different model architectures — a reasoning model, a retrieval-augmented model, a code-specialized model — you get uncorrelated failure modes. One agent’s blind spot is not another’s.
3. Scale and specialization. Internal debate operates within a single forward pass’s budget. For tasks requiring deep domain expertise — a 200-page legal document, a full codebase audit — external agents with specialized contexts may reason more accurately than any single model’s internal plurality.
graph LR
A[Task] --> B{What does internal<br/>debate already provide?}
B -->|Multiple perspectives<br/>on shared context| C[Single reasoning model<br/>is sufficient]
B -->|Need separate<br/>information sources| D[External multi-agent<br/>adds value: different tools/RAG]
B -->|Need different<br/>model architectures| E[External multi-agent<br/>adds value: uncorrelated errors]
B -->|Need deep specialization<br/>beyond context window| F[External multi-agent<br/>adds value: specialized agents]
C --> G[Use DeepSeek-R1 or QwQ-32B<br/>with self-consistency]
D --> H[Multi-agent with<br/>heterogeneous retrieval]
E --> I[Cross-architecture<br/>debate ensemble]
F --> J[Domain-specialist<br/>agent pipeline]
The uncomfortable implication: a lot of multi-agent pipelines built in 2024 were probably externalizing what a single reasoning model could do internally. Not all of them, but some. If your agents are all the same model, all given the same context, and you are just having them argue — you may be paying 5x the inference cost to recreate something that DeepSeek-R1 does for free in its thinking trace. Worth auditing.
For reference on how multi-agent architectures are typically structured before considering this question, see Multi-Agent Architectures: The Power of Coordination.
The emergent vs. designed distinction — and why it matters for architecture
Emergence changes the design equation. When a capability appears spontaneously from a training objective, it is both harder to control and potentially more robust than a capability you designed in. That matters for how you architect systems around it.
The Minsky parallel is worth taking seriously. In “The Society of Mind” (1986), Minsky argued that human intelligence emerges from interactions among many simple, limited processes — “agents” in his vocabulary — none of which is intelligent alone. The system-level behavior is richer than any component. He was describing a theory of mind. What Kim et al. found is that, in the process of optimizing for correct answers, DeepSeek-R1 discovered a structurally similar arrangement on its own.
Minsky’s agents were hypothetical and modular, designed as a theoretical framework. DeepSeek-R1’s internal perspectives are empirically measured, statistically significant, and not designed by anyone. Evans, Bratton, and Agüera y Arcas noted in their March 2026 paper “Agentic AI and the Next Intelligence Explosion” (arXiv:2603.20639) that frontier reasoning models operate through “spontaneous cognitive debates that argue, verify, and reconcile to solve complex tasks” — treating this as a signal about the direction of intelligence scaling, not just a curiosity about one model.
The emergent nature creates three practical tensions:
Tension 1: You cannot directly audit it. Internal debate in DeepSeek-R1 happens in the thinking trace, which is partly visible. But you cannot reliably identify which segment represents “perspective A” versus “perspective B” in real time. Unlike a multi-agent system where you log each agent’s output separately, the internal debate is woven through a continuous text stream. Mechanistic interpretability (steering feature 30939) can probe it, but that is a research technique, not a production monitoring tool.
Tension 2: You cannot steer it precisely. You can influence the thinking trace with prompting — asking the model to “consider multiple perspectives” will activate some of this structure. But you cannot specify “debate from the perspective of a cautious legal analyst” at the internal feature level without fine-tuning. This is the gap between designed multi-agent systems (where you write each agent’s system prompt) and emergent internal debate (where the model decides).
Tension 3: The structure may be model-specific. The Kim et al. results hold for DeepSeek-R1 and QwQ-32B, both trained with extended RL. It is not clear that GPT-4o or Claude 3.7 Sonnet exhibit the same structural property — their training objectives differ. Before assuming your reasoning model has an internal society of thought, it is worth checking whether it was trained with the kind of extended RL that appears to produce this behavior.
The architectural implication I keep coming back to: for tasks where you need auditability and control of each reasoning perspective, external multi-agent systems remain the right choice. You can log, trace, and attribute each agent’s contribution. For tasks where you want maximum reasoning quality on a well-defined problem and you do not need to audit the deliberation, a single reasoning model with internal debate may outperform a hand-crafted multi-agent pipeline — partly because the internally-discovered structure may be more adaptive than any structure you design.
Key takeaways
- DeepSeek-R1 and QwQ-32B generate statistically significant internal debate — perspective shifts, question-answering, and reconciliation events — that instruction-tuned models do not, as measured across 8,262 problems by Kim et al. (arXiv:2601.10825).
- This is distinct from chain-of-thought (single voice, prompted) and latent reasoning (vector space, different substrate). Internal debate is a structural property of the reasoning trace itself, emergent from RL training on outcome-only rewards.
- The causal evidence is sharp: steering feature 30939 at strength +10 doubled accuracy on Countdown (27.1% → 54.8%). Conversational fine-tuning reached 38–40% accuracy vs. 18–28% for monologue fine-tuning across two model families.
- External multi-agent debate adds genuine value when agents carry different information, run different architectures, or need deep specialization beyond a single context window. It may add little when all agents share the same model, same context, and same world knowledge.
- Emergence vs. design matters for production: emergent internal debate is harder to audit and steer but may be more adaptive. Designed multi-agent systems offer auditability at the cost of brittleness.
FAQ
What are “societies of thought” in DeepSeek R1?
Societies of thought refers to the emergent internal debate structure discovered by Kim et al. (2026) in DeepSeek-R1 and QwQ-32B. Rather than reasoning in a single consistent voice, these models generate perspectives with distinct personality and expertise profiles — measured using the Big Five personality index — that visibly disagree, shift positions, and reconcile during the thinking trace. DeepSeek-R1 showed a neuroticism diversity score of β=0.567 versus its instruction-tuned counterpart DeepSeek-V3, indicating far greater internal conflict during reasoning.
How is internal debate in DeepSeek R1 different from chain-of-thought?
Chain-of-thought is a prompt technique that elicits visible step-by-step reasoning — the model explains its work in a single consistent voice. Internal debate, as documented in DeepSeek-R1, is a structural phenomenon where multiple perspectives with different expertise and personality profiles emerge, argue, and reconcile within the reasoning trace. It was not prompted or designed — it emerged from reinforcement learning training on outcome-only rewards.
How is this different from the Coconut/latent-space reasoning paper?
The Coconut paper (Meta FAIR, December 2024) showed that models can move reasoning out of token space entirely into continuous vector space, exploring multiple paths simultaneously with 71% fewer tokens. That is a different mechanism — it concerns where reasoning happens (tokens vs. vectors). The societies of thought paper concerns what the reasoning process looks like structurally: not a monologue but an internal dialogue with conflicting voices. Both are emergent reasoning phenomena but at different levels of abstraction.
Does internal debate mean multi-agent systems are unnecessary?
No, but it does shift the burden of proof. If a single reasoning model already runs internal debate across personality and expertise dimensions, external multi-agent debate needs to add something the model cannot provide internally: genuinely independent context windows, separate world knowledge, different architectures, or ensemble averaging across uncorrelated failure modes. Spinning up five instances of the same model and having them discuss the same context probably replicates what the model already does — at significant extra cost.
Was societies of thought intentionally trained into DeepSeek R1?
No. DeepSeek R1 was trained with Group Relative Policy Optimization using outcome-only rewards — the model received a signal based solely on whether its final answer was correct. No training objective targeted internal debate or perspective diversity. The societies of thought structure emerged spontaneously, which Kim et al. confirmed by running controlled RL experiments: when they trained smaller models (Qwen-2.5-3B, Llama-3.2-3B) with conversational scaffolding, accuracy improved dramatically versus monologue-style fine-tuning.
arXiv:2601.10825 — Kim, Lai, Scherrer, Agüera y Arcas, Evans. “Reasoning Models Generate Societies of Thought.” 2026. arXiv:2603.20639 — Evans, Bratton, Agüera y Arcas. “Agentic AI and the Next Intelligence Explosion.” March 2026.
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch