What is the practical token budget breakdown for a hierarchical memory agent?

For an 8K context window: system prompt 1,500 tokens (19%), working memory — last 3 turns — 1,000 tokens (12.5%), episodic memory retrieved per query 1,500 tokens (19%), semantic user profile 800 tokens (10%), tool results and reasoning space 2,200 tokens (27.5%), buffer 1,000 tokens (12.5%). With TiMem-style compression, episodic retrieval drops to 800 tokens and working memory compresses to 500, freeing ~1,700 tokens for additional reasoning depth.

Which production memory systems should teams evaluate first?

Three tiers of production readiness: (1) Managed services — Mem0 (90% token reduction vs. full-context, graph DB backend, compliance-ready) and Zep (p50 search latency 1.29s, hybrid managed/self-hosted); (2) Open-source self-hosted — LangMem (developer-friendly, high p95 latency at 59s — not for interactive use), SimpleMem (semantic lossless compression); (3) Research-grade — TiMem (GitHub: TiMEM-AI/timem), Memori (81.95% LoCoMo accuracy), MemOS (memory OS abstraction). For most teams, Mem0 or Zep is the fastest path to production.

The three-tier memory stack that keeps agents coherent across sessions

Q: What is hierarchical memory for AI agents?

Hierarchical memory organizes agent memory across three tiers that operate at different timescales: working memory (the current context window, ~2,000 tokens for the last 3–5 turns), episodic memory (timestamped events stored externally and retrieved by query, ~1,000–1,500 tokens per session), and semantic memory (learned patterns, user preferences, and behavioral profiles distilled from episodes, ~500–800 tokens). This structure mirrors human memory organization and prevents the token explosion and persona drift that occurs when agents use a single flat context.

Q: Why do agents without hierarchical memory fail at scale?

Stateless agents reset on every session — the user repeats themselves constantly. Agents with naive full-context memory hit the token ceiling around turn 100–150: an 8K context window filled with system prompt (1,500 tokens), retrieved documents (2,000 tokens), and conversation history leaves no space for reasoning. After turn 150, the agent silently truncates context and 'forgets' earlier information, producing inconsistency. TiMem's hierarchical consolidation keeps a 128K-token conversation to 1,200 tokens of retrievable context — a 99% compression.

Q: What is TiMem and what does it achieve?

TiMem (arXiv:2601.02845) implements a Temporal Memory Tree that consolidates raw conversations into structured episodic and semantic memories without model fine-tuning. It achieves 75.30% on LoCoMo and 76.88% on LongMemEval-S benchmarks, with 52.20% reduction in recalled memory length on LoCoMo vs. baselines. Key gains: +9.49pp on knowledge updates (the agent correctly updates beliefs when facts change), +12.03pp on multi-session reasoning (connecting events across sessions).

10 minute read

TL;DR: Production agents hit a context ceiling around turn 100: tokens explode, personas become incoherent, the agent starts contradicting itself. The fix is a three-tier memory stack: working memory (current context, ~2K tokens), episodic memory (timestamped events retrieved on demand, ~1K tokens), and semantic memory (learned user preferences, ~500 tokens). TiMem (arXiv:2601.02845, January 2026) achieves 75.30% accuracy on LoCoMo and 76.88% on LongMemEval-S with this structure, with 52.20% reduction in recalled memory length on LoCoMo. Memori achieves 81.95% on LoCoMo at 1,294 tokens per query. The architecture is implementable today with PostgreSQL + pgvector + Redis.

Three nested filing drawers representing working memory, episodic memory, and semantic memory in a hierarchical agent architecture

An agent without long-term memory is a goldfish. It resets every session, forgets what the user told it last week, and produces inconsistent advice across conversations. Teams that deploy agents this way rely on users to compensate: repeating themselves, providing context the agent should already have, tolerating contradictions.

The opposite failure is less obvious. Teams that give agents unlimited memory (full conversation history, all retrieved documents, every tool call) hit the token wall around session 3. An 8K context fills fast: system prompt (1,500 tokens), RAG retrieval (2,000 tokens), conversation history from the last two sessions (4,000 tokens). The reasoning space evaporates. The model struggles most with information in the middle of the context window — a pattern noted in “Lost in the Middle” (Liu et al., 2024) — and the agent’s coherence degrades silently.

The solution is structured memory at three timescales.

The three tiers and what each one does

Human memory operates at multiple timescales: you remember what you were just thinking about (seconds), what happened this morning (hours), and who you are (years). The same structure applies to agent memory.

Working memory is the agent’s present: the current context window, the last 3–5 conversation turns, the tool results from this session. It’s the agent’s RAM. Typical budget: 2,000 tokens. Anything beyond the working window must be retrieved from lower tiers, not loaded wholesale.

Episodic memory is the agent’s past: timestamped facts and events stored externally and retrieved on demand. “The user mentioned they use Python for data work.” “The last three sessions all started with a question about their API rate limits.” Episodic memories are immutable once written. They’re a log, not an editable document. Typical retrieval budget: 1,000–1,500 tokens per session, selected by semantic similarity to the current query.

Semantic memory is the agent’s model of the user: learned preferences, aggregated behavioral patterns, stable traits distilled from episodic events. “User prefers direct answers over explanations.” “User works in fintech, references compliance concerns frequently.” This tier is mutable; it updates as episodes accumulate. Typical budget: 500–800 tokens loaded into system context.

┌────────────────────────────────────────────┐
│  CONTEXT WINDOW (8K tokens)                │
│                                            │
│  ┌──────────────────────────────────────┐  │
│  │ System prompt + Semantic profile     │  │  ← 2,300 tokens (static)
│  │ [User traits, preferences, persona]  │  │
│  └──────────────────────────────────────┘  │
│  ┌──────────────────────────────────────┐  │
│  │ Episodic retrieval                   │  │  ← 1,500 tokens (retrieved)
│  │ [Timestamped past events, 3–5 facts] │  │
│  └──────────────────────────────────────┘  │
│  ┌──────────────────────────────────────┐  │
│  │ Working memory                       │  │  ← 1,000 tokens (live)
│  │ [Last 3 turns, current tool results] │  │
│  └──────────────────────────────────────┘  │
│  ┌──────────────────────────────────────┐  │
│  │ Reasoning + generation space         │  │  ← 3,200 tokens
│  └──────────────────────────────────────┘  │
└────────────────────────────────────────────┘

This structure is not novel. It mirrors the MemGPT paper (arXiv:2310.08560, 2023) and its central insight that agents need explicit memory management, not just bigger windows. What’s changed in 2025–2026 is the engineering maturity: systems like TiMem, Memori, and Mem0 bring this architecture to production without requiring fine-tuning.

What breaks without it

Context poisoning. If incorrect information gets stored in episodic memory (a wrong preference, a misheard fact), it surfaces in every subsequent session. The agent “learned” something false and acts on it persistently. In a flat full-context system, wrong information eventually scrolls out of the window. In a managed episodic store, it persists until explicitly corrected.

Temporal incoherence. Without episodic timestamps, the agent cannot reason about sequence. “The user said X last week and Y this week” becomes “the user said X and Y” and the agent cannot resolve the conflict or update its beliefs correctly. TiMem’s Temporal Memory Tree addresses this directly: +9.49 percentage points on knowledge update tasks (arXiv:2601.02845), where the challenge is updating a stored belief when facts change across sessions.

Token budget explosion. The naive full-context approach to persistence (load all prior history into every session) is unsustainable. The math is simple: at 128K tokens max and 2,000 tokens of new material per session, conversation history alone reaches tens of thousands of tokens within weeks of regular use. Memori’s hierarchical approach maintains 1,294 tokens per query on LoCoMo, a benchmark built on long multi-session conversations, compared to full-context baselines requiring an order of magnitude more.

Stateless session loss. The most common failure: no memory system at all. Users repeat themselves. The agent has no model of the user’s expertise level, preferences, or history. Every session starts cold. This is tolerable for task-completion agents (search, calculator) and fatal for relationship-dependent agents (coaching, advising, customer support).

TiMem and what the benchmarks show

TiMem (arXiv:2601.02845, “Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents”) implements the three-tier structure via a Temporal Memory Tree (TMT). Raw conversation events → episodic nodes organized by recency and importance → semantic summaries at the tree root. The consolidation is semantic-guided (no fine-tuning required), and recall is complexity-aware: simple factual questions get lightweight episodic retrieval; multi-session reasoning questions get deep tree traversal.

Results on standard benchmarks:

System	LoCoMo accuracy	LongMemEval-S	Memory footprint
TiMem	75.30%	76.88%	52.20% memory length reduction on LoCoMo
Memori	81.95%	—	1,294 tokens/query
Mem0 (managed)	—	90% token reduction reported	—
LangMem	competitive	—	p95 latency 59.82s (impractical interactive)
Zep	competitive	—	p50 total latency 1.29s

TiMem’s specific gains: +9.49pp on knowledge updates (correctly updating beliefs when new information contradicts stored facts), +12.03pp on multi-session reasoning (connecting events that occurred in separate conversations). These are the two hardest failure modes in production agents. TiMem’s architecture directly targets both.

Memori achieves the highest raw accuracy (81.95% on LoCoMo) at the lowest token cost (1,294 per query), making it the strongest research-stage option. For teams that need production stability over raw benchmark performance, Mem0’s managed service offers 90% token reduction with a compliance-ready stack.

The storage implementation

The three-tier model maps onto commodity databases:

Working memory → Redis. Sub-millisecond reads, TTL-based expiration, no persistence required. Store the last N turns as key-value pairs keyed by session ID.

Episodic memory → PostgreSQL with hypertables (TimescaleDB) or ClickHouse for time-series queries. Immutable append-only rows: (event_id, session_id, timestamp, content, embedding_vector). The embedding enables semantic search; the timestamp enables temporal reasoning.

Semantic memory → PostgreSQL + pgvector. Mutable user profile rows: (user_id, trait, value, confidence, last_updated). Updated by a consolidation job that runs after each session, distilling new episodes into profile updates.

The PostgreSQL + pgvector + Redis stack is what Mem0 and Zep both converge on in their self-hosted configurations. It handles the retrieval latency requirements for interactive agents (p50 under 200ms for episodic retrieval with proper indexing) and scales horizontally when needed.

For production, see memory architectures for AI agents for the foundational patterns, and Memori’s 20x token efficiency for a concrete production case study.

Which memory system to choose

Three tiers of production readiness, depending on your constraints:

Managed services (fastest path to production):

Mem0: graph DB backend, 90% token reduction vs. full-context, compliance-ready, REST API. Best for teams that don’t want to operate infrastructure.
Zep: hybrid managed/self-hosted, p50 1.29s search latency. More control than Mem0, more operational overhead.

Open-source, self-hosted:

TiMem (TiMEM-AI/timem on GitHub): best for teams that want the hierarchical consolidation algorithm without a managed dependency. Requires operational work.
SimpleMem: semantic lossless compression approach. Research-stage but lighter weight than TiMem.
LangMem: developer-friendly API, integrates with LangChain. The p95 latency (59.82 seconds in benchmarks) makes it unsuitable for interactive agents; appropriate for batch memory consolidation.

Research/infrastructure plays:

MemOS (MemTensor/MemOS): memory OS abstraction that treats agent memory as a managed resource with allocation, paging, and garbage collection primitives. Early but architecturally interesting for teams building agent platforms.
Cognee (topoteretes/cognee): knowledge engine, local-first, knowledge graph backend.

Key takeaways

The three-tier memory stack (working/episodic/semantic) maps to three different databases and three different token budgets. Getting the boundaries right is the implementation decision, not the algorithm.
TiMem (arXiv:2601.02845) achieves 75.30% LoCoMo and 76.88% LongMemEval-S, with specific gains on the two hardest failure modes: +9.49pp knowledge updates, +12.03pp multi-session reasoning.
Memori achieves 81.95% LoCoMo at 1,294 tokens per query.
The production stack is PostgreSQL + pgvector (episodic + semantic) + Redis (working). Mem0 and Zep both converge on this; implement it directly for full control.
Without temporal timestamps in episodic memory, agents cannot update beliefs correctly when facts change across sessions. This is the silent failure mode teams discover at session 50+.

FAQ

What is hierarchical memory for AI agents? Three tiers at different timescales: working memory (current context, 2K tokens), episodic memory (timestamped events, retrieved at 1K–1.5K tokens per session), and semantic memory (user preferences distilled from episodes, 500–800 tokens). Each tier prevents a different class of failure.

Why do agents without hierarchical memory fail at scale? Stateless: users repeat themselves every session. Full-context: token ceiling hit around turn 100. Naive episodic (no timestamps): temporal incoherence, can’t update beliefs when facts change. The three-tier approach prevents all three.

What is TiMem and what does it achieve? TiMem (arXiv:2601.02845) uses a Temporal Memory Tree to consolidate conversations into episodic → semantic hierarchy without fine-tuning. Achieves 75.30% LoCoMo accuracy, 76.88% LongMemEval-S, 52.20% reduction in recalled memory length. Key gains: +9.49pp knowledge updates, +12.03pp multi-session reasoning.

What is the practical token budget breakdown? 8K window: system + semantic profile 2,300 tokens, episodic retrieval 1,500 tokens, working memory 1,000 tokens, reasoning space 3,200 tokens. TiMem compression: episodic drops to 800, working to 500, freeing ~1,700 tokens for reasoning.

Which production memory system to evaluate first? Managed: Mem0 (90% token reduction, compliance-ready) or Zep (p50 1.29s latency, more control). Self-hosted: TiMem (best accuracy), LangMem (developer-friendly, not for interactive). Most teams start with Mem0 and migrate to self-hosted when they need cost control.

The three-tier memory stack that keeps agents coherent across sessions

The three tiers and what each one does

What breaks without it

TiMem and what the benchmarks show

The storage implementation

Which memory system to choose

Key takeaways

FAQ

Further reading

Related across topics

Share on

The three tiers and what each one does

What breaks without it

TiMem and what the benchmarks show

The storage implementation

Which memory system to choose

Key takeaways

FAQ

Further reading

Related across topics

Memory Architectures

Share on