10 minute read

TL;DR: Production agents hit a context ceiling around turn 100: tokens explode, personas become incoherent, the agent starts contradicting itself. The fix is a three-tier memory stack: working memory (current context, ~2K tokens), episodic memory (timestamped events retrieved on demand, ~1K tokens), and semantic memory (learned user preferences, ~500 tokens). TiMem (arXiv:2601.02845, January 2026) achieves 75.30% accuracy on LoCoMo and 76.88% on LongMemEval-S with this structure, with 52.20% reduction in recalled memory length on LoCoMo. Memori achieves 81.95% on LoCoMo at 1,294 tokens per query. The architecture is implementable today with PostgreSQL + pgvector + Redis.

Three nested filing drawers representing working memory, episodic memory, and semantic memory in a hierarchical agent architecture


An agent without long-term memory is a goldfish. It resets every session, forgets what the user told it last week, and produces inconsistent advice across conversations. Teams that deploy agents this way rely on users to compensate: repeating themselves, providing context the agent should already have, tolerating contradictions.

The opposite failure is less obvious. Teams that give agents unlimited memory (full conversation history, all retrieved documents, every tool call) hit the token wall around session 3. An 8K context fills fast: system prompt (1,500 tokens), RAG retrieval (2,000 tokens), conversation history from the last two sessions (4,000 tokens). The reasoning space evaporates. The model struggles most with information in the middle of the context window — a pattern noted in “Lost in the Middle” (Liu et al., 2024) — and the agent’s coherence degrades silently.

The solution is structured memory at three timescales.

The three tiers and what each one does

Human memory operates at multiple timescales: you remember what you were just thinking about (seconds), what happened this morning (hours), and who you are (years). The same structure applies to agent memory.

Working memory is the agent’s present: the current context window, the last 3–5 conversation turns, the tool results from this session. It’s the agent’s RAM. Typical budget: 2,000 tokens. Anything beyond the working window must be retrieved from lower tiers, not loaded wholesale.

Episodic memory is the agent’s past: timestamped facts and events stored externally and retrieved on demand. “The user mentioned they use Python for data work.” “The last three sessions all started with a question about their API rate limits.” Episodic memories are immutable once written. They’re a log, not an editable document. Typical retrieval budget: 1,000–1,500 tokens per session, selected by semantic similarity to the current query.

Semantic memory is the agent’s model of the user: learned preferences, aggregated behavioral patterns, stable traits distilled from episodic events. “User prefers direct answers over explanations.” “User works in fintech, references compliance concerns frequently.” This tier is mutable; it updates as episodes accumulate. Typical budget: 500–800 tokens loaded into system context.

┌────────────────────────────────────────────┐
│  CONTEXT WINDOW (8K tokens)                │
│                                            │
│  ┌──────────────────────────────────────┐  │
│  │ System prompt + Semantic profile     │  │  ← 2,300 tokens (static)
│  │ [User traits, preferences, persona]  │  │
│  └──────────────────────────────────────┘  │
│  ┌──────────────────────────────────────┐  │
│  │ Episodic retrieval                   │  │  ← 1,500 tokens (retrieved)
│  │ [Timestamped past events, 3–5 facts] │  │
│  └──────────────────────────────────────┘  │
│  ┌──────────────────────────────────────┐  │
│  │ Working memory                       │  │  ← 1,000 tokens (live)
│  │ [Last 3 turns, current tool results] │  │
│  └──────────────────────────────────────┘  │
│  ┌──────────────────────────────────────┐  │
│  │ Reasoning + generation space         │  │  ← 3,200 tokens
│  └──────────────────────────────────────┘  │
└────────────────────────────────────────────┘

This structure is not novel. It mirrors the MemGPT paper (arXiv:2310.08560, 2023) and its central insight that agents need explicit memory management, not just bigger windows. What’s changed in 2025–2026 is the engineering maturity: systems like TiMem, Memori, and Mem0 bring this architecture to production without requiring fine-tuning.

What breaks without it

Context poisoning. If incorrect information gets stored in episodic memory (a wrong preference, a misheard fact), it surfaces in every subsequent session. The agent “learned” something false and acts on it persistently. In a flat full-context system, wrong information eventually scrolls out of the window. In a managed episodic store, it persists until explicitly corrected.

Temporal incoherence. Without episodic timestamps, the agent cannot reason about sequence. “The user said X last week and Y this week” becomes “the user said X and Y” and the agent cannot resolve the conflict or update its beliefs correctly. TiMem’s Temporal Memory Tree addresses this directly: +9.49 percentage points on knowledge update tasks (arXiv:2601.02845), where the challenge is updating a stored belief when facts change across sessions.

Token budget explosion. The naive full-context approach to persistence (load all prior history into every session) is unsustainable. The math is simple: at 128K tokens max and 2,000 tokens of new material per session, conversation history alone reaches tens of thousands of tokens within weeks of regular use. Memori’s hierarchical approach maintains 1,294 tokens per query on LoCoMo, a benchmark built on long multi-session conversations, compared to full-context baselines requiring an order of magnitude more.

Stateless session loss. The most common failure: no memory system at all. Users repeat themselves. The agent has no model of the user’s expertise level, preferences, or history. Every session starts cold. This is tolerable for task-completion agents (search, calculator) and fatal for relationship-dependent agents (coaching, advising, customer support).

TiMem and what the benchmarks show

TiMem (arXiv:2601.02845, “Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents”) implements the three-tier structure via a Temporal Memory Tree (TMT). Raw conversation events → episodic nodes organized by recency and importance → semantic summaries at the tree root. The consolidation is semantic-guided (no fine-tuning required), and recall is complexity-aware: simple factual questions get lightweight episodic retrieval; multi-session reasoning questions get deep tree traversal.

Results on standard benchmarks:

System LoCoMo accuracy LongMemEval-S Memory footprint
TiMem 75.30% 76.88% 52.20% memory length reduction on LoCoMo
Memori 81.95% 1,294 tokens/query
Mem0 (managed) 90% token reduction reported
LangMem competitive p95 latency 59.82s (impractical interactive)
Zep competitive p50 total latency 1.29s

TiMem’s specific gains: +9.49pp on knowledge updates (correctly updating beliefs when new information contradicts stored facts), +12.03pp on multi-session reasoning (connecting events that occurred in separate conversations). These are the two hardest failure modes in production agents. TiMem’s architecture directly targets both.

Memori achieves the highest raw accuracy (81.95% on LoCoMo) at the lowest token cost (1,294 per query), making it the strongest research-stage option. For teams that need production stability over raw benchmark performance, Mem0’s managed service offers 90% token reduction with a compliance-ready stack.

The storage implementation

The three-tier model maps onto commodity databases:

Working memory → Redis. Sub-millisecond reads, TTL-based expiration, no persistence required. Store the last N turns as key-value pairs keyed by session ID.

Episodic memory → PostgreSQL with hypertables (TimescaleDB) or ClickHouse for time-series queries. Immutable append-only rows: (event_id, session_id, timestamp, content, embedding_vector). The embedding enables semantic search; the timestamp enables temporal reasoning.

Semantic memory → PostgreSQL + pgvector. Mutable user profile rows: (user_id, trait, value, confidence, last_updated). Updated by a consolidation job that runs after each session, distilling new episodes into profile updates.

The PostgreSQL + pgvector + Redis stack is what Mem0 and Zep both converge on in their self-hosted configurations. It handles the retrieval latency requirements for interactive agents (p50 under 200ms for episodic retrieval with proper indexing) and scales horizontally when needed.

For production, see memory architectures for AI agents for the foundational patterns, and Memori’s 20x token efficiency for a concrete production case study.

Which memory system to choose

Three tiers of production readiness, depending on your constraints:

Managed services (fastest path to production):

  • Mem0: graph DB backend, 90% token reduction vs. full-context, compliance-ready, REST API. Best for teams that don’t want to operate infrastructure.
  • Zep: hybrid managed/self-hosted, p50 1.29s search latency. More control than Mem0, more operational overhead.

Open-source, self-hosted:

  • TiMem (TiMEM-AI/timem on GitHub): best for teams that want the hierarchical consolidation algorithm without a managed dependency. Requires operational work.
  • SimpleMem: semantic lossless compression approach. Research-stage but lighter weight than TiMem.
  • LangMem: developer-friendly API, integrates with LangChain. The p95 latency (59.82 seconds in benchmarks) makes it unsuitable for interactive agents; appropriate for batch memory consolidation.

Research/infrastructure plays:

  • MemOS (MemTensor/MemOS): memory OS abstraction that treats agent memory as a managed resource with allocation, paging, and garbage collection primitives. Early but architecturally interesting for teams building agent platforms.
  • Cognee (topoteretes/cognee): knowledge engine, local-first, knowledge graph backend.

Key takeaways

  • The three-tier memory stack (working/episodic/semantic) maps to three different databases and three different token budgets. Getting the boundaries right is the implementation decision, not the algorithm.
  • TiMem (arXiv:2601.02845) achieves 75.30% LoCoMo and 76.88% LongMemEval-S, with specific gains on the two hardest failure modes: +9.49pp knowledge updates, +12.03pp multi-session reasoning.
  • Memori achieves 81.95% LoCoMo at 1,294 tokens per query.
  • The production stack is PostgreSQL + pgvector (episodic + semantic) + Redis (working). Mem0 and Zep both converge on this; implement it directly for full control.
  • Without temporal timestamps in episodic memory, agents cannot update beliefs correctly when facts change across sessions. This is the silent failure mode teams discover at session 50+.

FAQ

What is hierarchical memory for AI agents? Three tiers at different timescales: working memory (current context, 2K tokens), episodic memory (timestamped events, retrieved at 1K–1.5K tokens per session), and semantic memory (user preferences distilled from episodes, 500–800 tokens). Each tier prevents a different class of failure.

Why do agents without hierarchical memory fail at scale? Stateless: users repeat themselves every session. Full-context: token ceiling hit around turn 100. Naive episodic (no timestamps): temporal incoherence, can’t update beliefs when facts change. The three-tier approach prevents all three.

What is TiMem and what does it achieve? TiMem (arXiv:2601.02845) uses a Temporal Memory Tree to consolidate conversations into episodic → semantic hierarchy without fine-tuning. Achieves 75.30% LoCoMo accuracy, 76.88% LongMemEval-S, 52.20% reduction in recalled memory length. Key gains: +9.49pp knowledge updates, +12.03pp multi-session reasoning.

What is the practical token budget breakdown? 8K window: system + semantic profile 2,300 tokens, episodic retrieval 1,500 tokens, working memory 1,000 tokens, reasoning space 3,200 tokens. TiMem compression: episodic drops to 800, working to 500, freeing ~1,700 tokens for reasoning.

Which production memory system to evaluate first? Managed: Mem0 (90% token reduction, compliance-ready) or Zep (p50 1.29s latency, more control). Self-hosted: TiMem (best accuracy), LangMem (developer-friendly, not for interactive). Most teams start with Mem0 and migrate to self-hosted when they need cost control.


Further reading

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch