Memory Architectures

8 minute read

“The difference between a Chatbot and a Partner is Memory.”

TL;DR

LLMs are stateless — every request starts from zero. To build agents that learn and persist, you need engineered memory: short-term (context window management via sliding windows, summarization, or entity extraction), long-term (vector databases for cross-session episodic recall), and reflection (synthesizing raw observations into reusable insights). The Stanford Generative Agents architecture scores memories by recency, importance, and relevance. MemGPT treats the context window as RAM with the LLM managing its own virtual memory. Voyager stores learned skills as procedural memory. Together, these patterns transform stateless LLMs into agents that remember, learn, and grow.

Server chassis showing three distinct storage tiers from fast NVMe to cold spinning disk

Why Do AI Agents Need Memory?

LLMs are fundamentally stateless. If you send a request to standard GPT-4, it processes it and immediately “forgets” it. The only state that exists is the Context Window you physically pass in with every call.

For a chatbot session, this is manageable. You pass the chat history [(User, "Hi"), (Bot, "Hi")] in every turn. But for an Autonomous Agent running for days, handling thousands of steps, this breaks down.

Cost: Passing 128k tokens per step is prohibitively expensive.
Capacity: Even 1M tokens isn’t enough for a whole lifetime of experiences.
Focus: “Context Stuffing” reduces intelligence. If you feed the model too much noise (irrelevant history), it hallucinates or gets distracted.

To build true agency, we need to engineer a Memory Hierarchy similar to the human brain, managing the flow of information from short-term perception to long-term storage.

What Are the Layers of Agent Memory?

Just like computers have layers (Registers -> RAM -> Hard Drive), agents utilize different tiers of memory to balance speed, cost, and capacity.

2.1 Sensory Memory (The Input)

The raw user prompt and System Instructions.

Capacity: Limited by the immediate Context Window logic.
Mechanism: Direct Prompt Injection.
Role: Immediate perception of the “Now.”

2.2 Short-Term / Working Memory (Context)

This tracks the current conversation or active task. It holds the “Scratchpad” of thoughts.

The Challenge: The “Context Stuffing” problem.
Strategy 1: Sliding Window. Keep last N turns. Simple, but forgets the beginning of the plan.
Strategy 2: Summarization. As the window fills, call the LLM to summarize the oldest 10 turns into a paragraph (“I successfully downloaded the data and cleaned it”). Inject this summary and drop the raw logs.
Strategy 3: Entity Extraction. Extract specific variables (“User Name: Alice”, “Goal: Fix Bug”) and store them in a state dictionary, reducing the text needed to purely essential facts.

2.3 Long-Term Memory (Episodic)

Storage that persists across sessions. “What did we do last Tuesday?”

Implementation: Vector Databases (RAG) (Pinecone, Milvus).
Mechanism:
1. Event happens (“User reset password”).
2. Embed text -> Vector.
3. Store metadata {"date": "2023-10-01", "type": "auth"}.
4. Retrieve when relevant.

2.4 Semantic Memory (World Knowledge)

Facts about the world, distinct from events. “The user uses Python 3.10.”

Implementation: Knowledge Graphs (Neo4j) or Structured SQL.
Why: Vectors are fuzzy. If you ask “Who is Alice’s manager?”, a vector search might return “Alice manages Bob” (similar words, wrong relationship). A Graph query (Alice)-[:REPORTS_TO]->(Manager) is precise.

How Does the Generative Agents Architecture Work?

In 2023, Stanford/Google researchers published “Generative Agents: Interactive Simulacra of Human Behavior”. They created “Smallville,” a simulation where 25 AI agents lived in a town. They remembered relationships, planned parties, and gossiped. Their memory architecture is the blueprint for Human-Like Memory.

3.1 The Memory Stream

Every observation is a distinct object in a time-ordered list.

[09:00] Observed Alice drinking coffee.
[09:05] Observed Bob walking in.
[09:06] Alice said "Hi Bob".

3.2 The Retrieval Function

How do we decide what to remember? They used a weighted score of 3 factors to retrieve top memories for a given query: ` Score = \alpha \cdot \text{Recency} + \beta \cdot \text{Importance} + \gamma \cdot \text{Relevance} `

Recency: Exponential decay. I care about what happened 5 minutes ago more than 5 years ago.
- Score = 0.99^{\text{decay\_hours}}
Importance: A static score separating “Noise” from “Signal”. “Ate toast” (1/10). “House on fire” (10/10).
- Implementation: Ask the LLM to rate the importance of every new memory on ingestion.
Relevance: The standard Cosine Similarity to the module’s current query.

3.3 The Reflection Tree (Synthesizing Wisdom)

If you only store raw logs, the agent never “learns.” It just remembers details.

Process: Periodically (e.g., every 100 observations), the agent takes a batch of memories and asks: “What does this mean?”
Input: “Alice drank coffee Mon”, “Alice drank coffee Tue”, “Alice drank coffee Wed”.
Insight: “Alice is addicted to coffee.”
Action: Save this Insight as a new memory node.
Result: Future queries retrieve the Insight, not the 50 raw logs. This mimics human generalization.

How Does MemGPT Enable Infinite Context?

MemGPT (Memory-GPT) proposes a different analogy.

LLM Context = RAM. (Fast, expensive, volatile).
Vector DB = Hard Drive. (Slow, cheap, persistent).
Paging: The OS (Prompt logic) manages “Virtual Memory.” It swaps data in and out of the Context Window based on need.
Mechanism: The LLM itself decides what to “write to disk” (save to DB) and what to “read from disk” (search DB) via function calls.
Impact: Enables agents to run “forever” (infinite context) by actively managing their own memory slots, just malloc and free.

How Do Agents Learn New Skills? (Procedural Memory)

How does an agent “learn to code”?

Naive Agent: Writes code -> Fails -> Rewrites -> Succeeds -> Forgets.
Voyager Agent (Minecraft): Writes code -> Fails -> Rewrites -> Succeeds -> Saves Function to Disk.
Skill Retrieval: Next time the goal is “Mine Diamond”, it checks its skills/ folder. “Ah, I have a mine_block function.”
Result: The agent gets faster and more capable over time, building a library of tools it wrote itself. This is akin to “Muscle Memory.”

How Do You Implement Memory Retrieval in Code?

The formula for the Stanford retrieval function.

def retrieve_memories(query_vector, memory_stream, alpha=1, beta=1, gamma=1):
    """
    Ranks memories by Recency, Importance, and Relevance.
    """
    scored_memories = []

    current_time = now()

    for memory in memory_stream:
        # 1. Relevance: Cosine Similarity
        relevance = cosine_similarity(query_vector, memory.vector)

        # 2. Recency: Exponential Decay
        time_diff = current_time - memory.timestamp
        recency = 0.99 ** time_diff.hours

        # 3. Importance: Pre-calculated static score (1-10)
        importance = memory.importance_score / 10.0

        # Combined Score
        total_score = (alpha * recency) + (beta * importance) + (gamma * relevance)

        scored_memories.append((total_score, memory))

        # Python's sort is stable
        scored_memories.sort(key=lambda x: x[0], reverse=True)

        return [m[1] for m in scored_memories[:3]]

FAQ

Q: Why do AI agents need memory architectures? A: LLMs are fundamentally stateless – they forget everything after each request. For autonomous agents running for days, passing full history in every call is too expensive (128k tokens per step), too limited (even 1M tokens cannot hold a lifetime of experiences), and too noisy (irrelevant context causes hallucination). Memory architectures solve this with tiered storage.

Q: What is the Generative Agents memory architecture? A: The Stanford/Google Generative Agents architecture uses a Memory Stream (time-ordered observations), a Retrieval Function (scoring memories by recency, importance, and relevance), and a Reflection Tree (periodically synthesizing raw observations into higher-level insights). This was demonstrated in the Smallville simulation with 25 AI agents.

Q: How does MemGPT manage infinite context for agents? A: MemGPT treats the LLM context window as RAM and vector databases as a hard drive. The LLM itself manages virtual memory by deciding what to write to disk (save to DB) and what to read from disk (search DB) via function calls, enabling agents to run indefinitely.

Q: What is procedural memory in AI agents? A: Procedural memory is the agent’s skill library. When an agent writes code that works, it saves the function to disk (like the Voyager agent in Minecraft). Next time a similar goal appears, it retrieves the saved skill instead of rebuilding from scratch – analogous to muscle memory.

Key Takeaways

Short-term memory (context window management) allows for coherent conversation within a session
Long-term episodic memory (vector databases) enables cross-session recall and personalization
Semantic memory (knowledge graphs) stores precise relational facts that vector search can’t handle
Reflection synthesizes raw observations into reusable insights, enabling learning over time
Procedural memory (skill libraries) lets agents save and reuse code they’ve written successfully
MemGPT’s paging model treats context as RAM and databases as disk, enabling infinite-context agents
Without memory, agents are “goldfish” — brilliant in the moment but incapable of growth

Originally published at: arunbaby.com/ai-agents/0005-memory-architectures

If you found this helpful, consider sharing it with others who might benefit.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

Memory Architectures

TL;DR

Why Do AI Agents Need Memory?

What Are the Layers of Agent Memory?

2.1 Sensory Memory (The Input)

2.2 Short-Term / Working Memory (Context)

2.3 Long-Term Memory (Episodic)

2.4 Semantic Memory (World Knowledge)

How Does the Generative Agents Architecture Work?

3.1 The Memory Stream

3.2 The Retrieval Function

3.3 The Reflection Tree (Synthesizing Wisdom)

How Does MemGPT Enable Infinite Context?

How Do Agents Learn New Skills? (Procedural Memory)

How Do You Implement Memory Retrieval in Code?

FAQ

Key Takeaways

Related across topics

Share on

TL;DR

Why Do AI Agents Need Memory?

What Are the Layers of Agent Memory?

2.1 Sensory Memory (The Input)

2.2 Short-Term / Working Memory (Context)

2.3 Long-Term Memory (Episodic)

2.4 Semantic Memory (World Knowledge)

How Does the Generative Agents Architecture Work?

3.1 The Memory Stream

3.2 The Retrieval Function

3.3 The Reflection Tree (Synthesizing Wisdom)

How Does MemGPT Enable Infinite Context?

How Do Agents Learn New Skills? (Procedural Memory)

How Do You Implement Memory Retrieval in Code?

FAQ

Key Takeaways

Related across topics

Maximum Subarray (Kadane’s Algorithm)

Batch vs Real-Time Inference

Speaker Recognition & Verification

Share on