How is agent memory different from RAG?

RAG is a stateless, read-only system that indexes static documents and queries flat embeddings. Agent memory is a stateful write-manage-read loop that dynamically extracts and updates knowledge from conversations. RAG answers 'what does our knowledge base say?' Agent memory answers 'what did I learn about this user across past sessions?' Modern agents use both: RAG for factual grounding, memory for context and continuity.

What is LoCoMo and why does it matter?

LoCoMo is a benchmark for evaluating long-term conversational memory, presented at ACL 2024. It contains 10 conversations, each spanning roughly 600 turns and 16,000 tokens across up to 32 sessions. It tests four question types: single-hop factual recall, temporal reasoning, multi-hop information linking, and adversarial misleading questions. It has become the standard evaluation for agent memory systems in 2025-2026.

How does Memori convert conversations to semantic triples?

Memori's augmentation pipeline processes each dialogue turn and extracts subject-predicate-object triples — structured facts like 'User:4521 → preferred_language → Spanish.' Each triple is linked to a conversation summary that preserves context. This produces a queryable knowledge graph where facts are atomic and relationships are explicit, rather than buried in thousands of tokens of raw conversation history.

Which agent memory system should I use?

For accuracy and cost efficiency: Memori (81.95% on LoCoMo, 1,294 tokens per query, SQL-native). For lowest latency: Mem0 (148ms p50, but 62.47% accuracy). For temporal knowledge graphs: Zep/Graphiti (78.94%, but requires hours of background graph processing). For LangGraph integration: LangMem (78.05%, Python-only). Choose based on your binding constraint: cost, speed, accuracy, or ecosystem.

Persistent agent memory: how Memori achieves 20x token efficiency over full-context prompting

7 minute read

“Your agent remembers nothing. It re-reads the entire conversation every time it speaks.”

TL;DR

Agent memory is a data structuring problem, not a storage problem. Memori converts dialogue into semantic triples — 81.95% accuracy on LoCoMo at 1,294 tokens per query, 5% of full-context cost. Mem0, Zep, and LangMem use 67% more tokens for lower accuracy. The practical gap: a full-context agent spends $26K/month re-reading conversations at scale. Memori spends $1.3K. For the foundational memory architecture patterns, see memory architectures.

A library card catalog with thousands of individual index cards compressed into a single compact drawer, warm archival reading room light casting s...

Why is agent memory not just RAG?

Because RAG retrieves. Memory learns.

RAG indexes a static corpus, embeds documents into vector space, and retrieves the most similar chunks when queried. It is stateless — the system does not change based on interactions. It is read-only — new information requires explicit re-indexing. It answers one question: “What does our knowledge base say about X?”

Agent memory answers a different question: “What did I learn about this user across our past 47 conversations?” That requires a write-manage-read loop — not just retrieval but active extraction, structuring, updating, and forgetting. When a user mentions they moved from London to Tokyo, memory should update, not append. When a preference expressed in session 12 contradicts one from session 3, memory should resolve the conflict.

The survey on autonomous agent memory (arXiv 2603.07670) formalizes five mechanism families:

Context-resident compression — sliding windows, rolling summaries. Simple but drifts over time.
Retrieval-augmented stores — vector embeddings with ANN search. Better recall, no structure.
Reflective self-improvement — store critiques and observations. Reflexion achieved 91% on HumanEval this way.
Hierarchical virtual context — OS-inspired tiering (MemGPT exemplifies this).
Policy-learned management — memory ops as RL actions. AgeMem learns when to consolidate vs forget.

Most production agents use option 1 (stuff the conversation history into the prompt) or option 2 (embed and retrieve). Both scale badly. Option 1 costs linearly with conversation length. Option 2 loses structure — a vector embedding of “user moved to Tokyo” has no relationship to “user’s address” unless you engineer that connection manually.

What does Memori do differently?

Memori (arXiv 2603.19935, March 2026) treats memory as a data structuring problem. Instead of storing conversation text and retrieving similar chunks, it converts unstructured dialogue into two coordinated outputs.

Semantic triples. Subject-predicate-object facts extracted from each conversation turn. “User:4521 → kyc:level → 2” or “User:4521 → preferred_language → Spanish.” Each triple is atomic, queryable, and updatable. When the user’s KYC level changes, the triple is updated in place — not appended.

Conversation summaries. High-level narrative capturing user intent and dialogue progression. Each triple is linked to its source summary, preventing “granular facts divorced from context” — you can always trace a fact back to the conversation that produced it.

graph LR
    A[Raw dialogue<br/>200 tokens per exchange] --> B[Memori Pipeline]
    B --> C[Semantic Triples<br/>6-10 tokens per fact]
    B --> D[Conversation Summary<br/>50-100 tokens]
    C --> E[(SQL Database<br/>Queryable KG)]
    D --> E
    E --> F[Query: 1,294 tokens<br/>5% of full-context]

The efficiency gain comes from compression ratio. A typical customer service exchange contains 200 tokens. The useful fact from that exchange — “user wants to upgrade to premium plan” — is 8 tokens as a triple. That is a 25:1 compression ratio on the information that matters, discarding the pleasantries, confirmations, and filler that comprise most conversational text.

The design is SQL-native. The knowledge graph lives in a relational database (PostgreSQL, MySQL, SQLite) in third normal form. This is a deliberate enterprise design decision: SQL databases are auditable, durable, transactional, and familiar to every backend engineer. You can query a user’s complete memory profile with a SELECT statement. You can audit what the agent knows with standard database tooling. You can enforce retention policies with DELETE queries.

How does Memori compare to alternatives?

The LoCoMo benchmark (ACL 2024) provides standardized evaluation: 10 conversations, ~600 turns each, up to 32 sessions, with single-hop, temporal, multi-hop, and adversarial question types.

System	LoCoMo accuracy	Tokens per query	Approach	Best for
Memori	81.95%	1,294	Semantic triples + summaries (SQL)	Accuracy + cost efficiency
Zep (Graphiti)	78.94%	3,911	Temporal knowledge graph	Temporal reasoning, fact evolution
LangMem	78.05%	—	LangGraph function calls	LangGraph-native teams
Mem0	62.47%	1,764	Compression engine	Lowest latency (148ms p50)
Full context	~85%	26,031	Entire history in prompt	Maximum accuracy, maximum cost

Memori leads on the accuracy-to-cost ratio: 81.95% at 1,294 tokens is better than Zep’s 78.94% at 3,911 tokens and far better than full context’s ~85% at 26,031 tokens. The 20x cost reduction versus full-context is the headline number.

Each system has a distinct failure mode. Memori struggles with temporal reasoning (68.13% on temporal questions versus 87.87% on direct fact retrieval) — the triple format loses temporal nuance that narrative summaries preserve. Zep’s Graphiti engine requires hours of background graph processing; immediate post-ingestion retrieval often fails. Mem0 trades accuracy for speed — 62.47% is notably behind the field, but 148ms p50 latency is unmatched. LangMem is Python-only with no TypeScript SDK, limiting frontend integration.

When should you use structured memory versus vector memory?

Structured memory (semantic triples, knowledge graphs) when:

Facts need to be updatable (user preferences, account state, relationship status)
Audit trails are required (compliance, enterprise governance)
Temporal reasoning matters (what was the user’s plan before they changed it?)
Token efficiency is a cost constraint (production at scale)

Vector memory (embeddings, ANN retrieval) when:

Content is unstructured and diverse (meeting notes, research documents)
Similarity search is the primary access pattern (“find conversations like this one”)
Integration speed matters more than precision (prototype, MVP)

Both when:

The agent needs factual recall (structured) and contextual similarity (vector)
Different memory types serve different reasoning steps in the same workflow

The hybrid approach is increasingly common. Memori provides the structured layer. A vector store (Pinecone, Qdrant, Chroma) provides the similarity layer. The LLM queries both and synthesizes.

Key takeaways

Memory is not RAG. RAG retrieves from static documents. Memory writes, manages, reads, and updates from conversations. Different problem, different architecture.
Semantic triples compress 25:1. A 200-token exchange produces an 8-token triple. At scale, this is a 20x cost reduction versus full-context prompting.
Memori leads on accuracy per token. 81.95% at 1,294 tokens. Zep gets 78.94% at 3,911 tokens. Full context gets ~85% at 26,031 tokens.
SQL-native design is an enterprise feature. Queryable, auditable, durable. Standard database tooling works on the agent’s memory.
LoCoMo is the standard benchmark. 600-turn conversations, 32 sessions, four question types. Use it to evaluate any memory system before production deployment.
Temporal reasoning remains hard. Memori scores 87.87% on direct facts but 68.13% on temporal questions. Zep’s temporal graph handles this better at higher token cost.

Persistent agent memory: how Memori achieves 20x token efficiency over full-context prompting

TL;DR

Why is agent memory not just RAG?

What does Memori do differently?

How does Memori compare to alternatives?

When should you use structured memory versus vector memory?

Key takeaways

Further reading

Related across topics

Share on

TL;DR

Why is agent memory not just RAG?

What does Memori do differently?

How does Memori compare to alternatives?

When should you use structured memory versus vector memory?

Key takeaways

Further reading

Related across topics

LRU Cache

Share on