7 minute read

“Your agent remembers nothing. It re-reads the entire conversation every time it speaks.”

TL;DR

Agent memory is a data structuring problem, not a storage problem. Memori converts dialogue into semantic triples — 81.95% accuracy on LoCoMo at 1,294 tokens per query, 5% of full-context cost. Mem0, Zep, and LangMem use 67% more tokens for lower accuracy. The practical gap: a full-context agent spends $26K/month re-reading conversations at scale. Memori spends $1.3K. For the foundational memory architecture patterns, see memory architectures.

A library card catalog with thousands of individual index cards compressed into a single compact drawer, warm archival reading room light casting s...

Why is agent memory not just RAG?

Because RAG retrieves. Memory learns.

RAG indexes a static corpus, embeds documents into vector space, and retrieves the most similar chunks when queried. It is stateless — the system does not change based on interactions. It is read-only — new information requires explicit re-indexing. It answers one question: “What does our knowledge base say about X?”

Agent memory answers a different question: “What did I learn about this user across our past 47 conversations?” That requires a write-manage-read loop — not just retrieval but active extraction, structuring, updating, and forgetting. When a user mentions they moved from London to Tokyo, memory should update, not append. When a preference expressed in session 12 contradicts one from session 3, memory should resolve the conflict.

The survey on autonomous agent memory (arXiv 2603.07670) formalizes five mechanism families:

  1. Context-resident compression — sliding windows, rolling summaries. Simple but drifts over time.
  2. Retrieval-augmented stores — vector embeddings with ANN search. Better recall, no structure.
  3. Reflective self-improvement — store critiques and observations. Reflexion achieved 91% on HumanEval this way.
  4. Hierarchical virtual context — OS-inspired tiering (MemGPT exemplifies this).
  5. Policy-learned management — memory ops as RL actions. AgeMem learns when to consolidate vs forget.

Most production agents use option 1 (stuff the conversation history into the prompt) or option 2 (embed and retrieve). Both scale badly. Option 1 costs linearly with conversation length. Option 2 loses structure — a vector embedding of “user moved to Tokyo” has no relationship to “user’s address” unless you engineer that connection manually.

What does Memori do differently?

Memori (arXiv 2603.19935, March 2026) treats memory as a data structuring problem. Instead of storing conversation text and retrieving similar chunks, it converts unstructured dialogue into two coordinated outputs.

Semantic triples. Subject-predicate-object facts extracted from each conversation turn. “User:4521 → kyc:level → 2” or “User:4521 → preferred_language → Spanish.” Each triple is atomic, queryable, and updatable. When the user’s KYC level changes, the triple is updated in place — not appended.

Conversation summaries. High-level narrative capturing user intent and dialogue progression. Each triple is linked to its source summary, preventing “granular facts divorced from context” — you can always trace a fact back to the conversation that produced it.

graph LR
    A[Raw dialogue<br/>200 tokens per exchange] --> B[Memori Pipeline]
    B --> C[Semantic Triples<br/>6-10 tokens per fact]
    B --> D[Conversation Summary<br/>50-100 tokens]
    C --> E[(SQL Database<br/>Queryable KG)]
    D --> E
    E --> F[Query: 1,294 tokens<br/>5% of full-context]

The efficiency gain comes from compression ratio. A typical customer service exchange contains 200 tokens. The useful fact from that exchange — “user wants to upgrade to premium plan” — is 8 tokens as a triple. That is a 25:1 compression ratio on the information that matters, discarding the pleasantries, confirmations, and filler that comprise most conversational text.

The design is SQL-native. The knowledge graph lives in a relational database (PostgreSQL, MySQL, SQLite) in third normal form. This is a deliberate enterprise design decision: SQL databases are auditable, durable, transactional, and familiar to every backend engineer. You can query a user’s complete memory profile with a SELECT statement. You can audit what the agent knows with standard database tooling. You can enforce retention policies with DELETE queries.

How does Memori compare to alternatives?

The LoCoMo benchmark (ACL 2024) provides standardized evaluation: 10 conversations, ~600 turns each, up to 32 sessions, with single-hop, temporal, multi-hop, and adversarial question types.

System LoCoMo accuracy Tokens per query Approach Best for
Memori 81.95% 1,294 Semantic triples + summaries (SQL) Accuracy + cost efficiency
Zep (Graphiti) 78.94% 3,911 Temporal knowledge graph Temporal reasoning, fact evolution
LangMem 78.05% LangGraph function calls LangGraph-native teams
Mem0 62.47% 1,764 Compression engine Lowest latency (148ms p50)
Full context ~85% 26,031 Entire history in prompt Maximum accuracy, maximum cost

Memori leads on the accuracy-to-cost ratio: 81.95% at 1,294 tokens is better than Zep’s 78.94% at 3,911 tokens and far better than full context’s ~85% at 26,031 tokens. The 20x cost reduction versus full-context is the headline number.

Each system has a distinct failure mode. Memori struggles with temporal reasoning (68.13% on temporal questions versus 87.87% on direct fact retrieval) — the triple format loses temporal nuance that narrative summaries preserve. Zep’s Graphiti engine requires hours of background graph processing; immediate post-ingestion retrieval often fails. Mem0 trades accuracy for speed — 62.47% is notably behind the field, but 148ms p50 latency is unmatched. LangMem is Python-only with no TypeScript SDK, limiting frontend integration.

When should you use structured memory versus vector memory?

Structured memory (semantic triples, knowledge graphs) when:

  • Facts need to be updatable (user preferences, account state, relationship status)
  • Audit trails are required (compliance, enterprise governance)
  • Temporal reasoning matters (what was the user’s plan before they changed it?)
  • Token efficiency is a cost constraint (production at scale)

Vector memory (embeddings, ANN retrieval) when:

  • Content is unstructured and diverse (meeting notes, research documents)
  • Similarity search is the primary access pattern (“find conversations like this one”)
  • Integration speed matters more than precision (prototype, MVP)

Both when:

  • The agent needs factual recall (structured) and contextual similarity (vector)
  • Different memory types serve different reasoning steps in the same workflow

The hybrid approach is increasingly common. Memori provides the structured layer. A vector store (Pinecone, Qdrant, Chroma) provides the similarity layer. The LLM queries both and synthesizes.

Key takeaways

  • Memory is not RAG. RAG retrieves from static documents. Memory writes, manages, reads, and updates from conversations. Different problem, different architecture.
  • Semantic triples compress 25:1. A 200-token exchange produces an 8-token triple. At scale, this is a 20x cost reduction versus full-context prompting.
  • Memori leads on accuracy per token. 81.95% at 1,294 tokens. Zep gets 78.94% at 3,911 tokens. Full context gets ~85% at 26,031 tokens.
  • SQL-native design is an enterprise feature. Queryable, auditable, durable. Standard database tooling works on the agent’s memory.
  • LoCoMo is the standard benchmark. 600-turn conversations, 32 sessions, four question types. Use it to evaluate any memory system before production deployment.
  • Temporal reasoning remains hard. Memori scores 87.87% on direct facts but 68.13% on temporal questions. Zep’s temporal graph handles this better at higher token cost.

Further reading

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch