Context Window Management
“The Finite Canvas of Intelligence: Managing the Agent’s RAM.”
1. Introduction: The RAM of the LLM
If Vector Search is the “Hard Drive” of an agent (Long-term, slow, massive), the Context Window is its RAM (Short-term, fast, expensive, volatile).
Every time you send a request to an LLM, you are sending a state snapshot. The model reads the inputs, performs inference, and generates an output. It does not “remember” the previous request unless you pass it in again.
In 2023, context windows were small (4k - 8k tokens). This forced developers to be extremely disciplined, summarizing every conversation turn. In 2025, we have models with 1 Million+ tokens (Gemini 1.5, Claude 3). This leads to a dangerous fallacy: “Why manage context? Just stuff everything in.”
This is wrong for three reasons:
- Cost: Input tokens cost money. Re-sending a 100-page manual on every API call allows you to burn through your budget in minutes. Even at cheap rates, $0.50 per call * 1000 calls = $500.
- Latency: Reading 1M tokens takes time. The Time-to-First-Token (TTFT) scales linearly (or worse) with input length. Real-time agents cannot wait 30 seconds to “read the manual” before answering “Hello.”
- Accuracy (The Dilution Effect): The more garbage you put in the context, the less attention the model pays to the signal. This is known as the Signal-to-Noise Ratio problem. A model given 1000 irrelevant sentences and 1 relevant sentence is significantly more likely to hallucinate than a model given just the 1 relevant sentence.
Effective Context Management is the art of curating the perfect prompt—giving the model exactly what it needs to solve the current step, and nothing else.
2. The Physics of Attention: Why Length Matters
To understand why context matters, we must peek inside the Transformer architecture.
2.1 The Quadratic Bottleneck: $O(N^2)$
The core mechanism of an LLM is Self-Attention. Every token in the sequence looks at every other token to calculate its “Attention Score.”
- “The” looks at “cat”, “sat”, “on”, “mat”.
- “cat” looks at “The”, “sat”, “on”, “mat”.
If you have sequence length $N$, the number of calculations is $N^2$.
- 1k tokens -> 1M operations.
- 100k tokens -> 10B operations.
While recent optimizations like FlashAttention-2 and Ring Attention have reduced the memory footprint and constant factors, the fundamental physics remains: Longer Context = More Compute + More VRAM. This is why prompt caching is a game changer—it allows us to re-use the computed attention matrices.
2.2 The KV Cache (Key-Value Cache)
When you use a chat interface (like ChatGPT), why is the second message faster than the first? Because of the KV Cache.
The model processes the prompt layer by layer. At each layer, it computes “Keys” and “Values” for each token. Instead of discarding them, the inference engine (like vLLM) stores them in GPU memory (VRAM). When you send the next token, the model doesn’t re-compute the past. It just retrieves the cached Keys/Values.
Implication for Agents:
- Adding to the End (Prefix Caching): Cheap. The cache is preserved.
- Changing the Beginning (System Prompt): Expensive. It invalidates the entire cache for the subsequent tokens.
- Design Rule: Keep your massive instructions static at the top. Append dynamic history at the bottom.
3. The “Lost in the Middle” Phenomenon
A landmark paper (Liu et al., 2023) discovered a quirk in LLM psychology. They tested models by placing a specific fact (the “Needle”) at different positions in a long context window (the “Haystack”).
- Primacy Bias: Models pay huge attention to the Beginning. The System Prompt sets the rules.
- Recency Bias: Models pay huge attention to the End. The user’s last question defines the immediate task.
- The Trough: Information buried in the exact middle (e.g., at token 50,000 of a 100,000 window) is frequently ignored or hallucinated.
Agent Strategy: If you retrieve 50 documents via RAG, do not just dump them in the middle of the prompt.
- Curve Sorting: Sort your RAG chunks by relevance.
- Placement: Put the Most Relevant chunk at the very bottom (closest to the question). Put the 2nd Most Relevant at the very top. Put the least relevant in the middle.
- Instruction Repeat: It is often useful to repeat the core instruction at the very end.
[System Prompt] ... [Data] ... [User Query] ... [Constraint Reminder: remember to answer in JSON].
4. Context Management Strategies
How do we compress “A Lifetime of Memories” into a finite window?
4.1 Sliding Window (FIFO)
The simplest approach.
- Structure:
[System Prompt] + [Msg N-10 ... Msg N] - Logic: When the buffer is full, drop the oldest message (
Msg N-11). - Pros: Simple implementation. Constant cost. Predictable latency.
- Cons: Catastrophic Amnesia. If the user told you their name (“I’m Alice”) 50 turns ago, and that message slides out of the window, the agent forgets who Alice is.
4.2 Summarization (The Rolling Summary)
Instead of dropping old messages, convert them into a compressed representation.
- Structure:
[System Prompt] + [Summary of Past] + [Recent Messages] - Process:
- Wait until
Recent Messageshits 10 items. - Trigger a background LLM call (using a cheaper model like GPT-3.5-Turbo).
- Prompt: “Summarize the following conversation, preserving key facts like names, dates, and decisions.”
- Update the
Summaryvariable. - Clear
Recent Messages.
- Wait until
- Pros: Theoretically infinite conversation duration.
- Cons: Lossy Compression. “Subtle details” (e.g., the user mentioned they dislike blue) are often lost in the summary.
4.3 Entity Extraction (The Knowledge Graph)
Instead of summarizing text, extract specific facts into a structured state.
- State:
{"user_name": "Alice", "current_project": "Website", "todos": ["Fix bug"]} - Process: An “Observer Agent” runs in parallel, watching the stream and updating the JSON state.
- Context Injection: We inject this JSON into the System Prompt.
- Pros: Extremely dense information. Zero hallucination of facts.
- Cons: Hard to implement. Requires defining a schema for what is worth remembering.
4.4 Selective Context (The Bouncer)
Use a specialized small model (BERT/Cross-Encoder) to act as a “Bouncer.”
- Scenario: You have 100 previous messages.
- Question: “What was the error code?”
- Bouncer: Scans the 100 messages independently. Scores them based on relevance to “error code”.
- Selection: Identifies messages 42 and 43.
- Construction: Constructs a prompt with ONLY messages 42 and 43.
- Pros: Highest Signal-to-Noise ratio.
- Cons: If the Bouncer misses the context, the Agent is blind.
5. The Game Changer: Context Caching
In late 2024, model providers (Anthropic, Google, OpenAI) introduced Prompt Caching APIs. This changed the economics of agency overnight.
5.1 How it Works
- Old Way: You pay to upload the 100-page manual on every API call.
- New Way: You upload the manual once. You designate a
cache_controlcheckpoint. The provider processes the manual, computes the KV Cache, and stores it on the GPU (for a few minutes/hours). - Subsequent Calls: You pass the
cache_id. The provider skips the computation for the prefix. - Economics:
- Write Cost: Standard price.
- Read Cost: ~90% Discount.
- Speed: 2-5x faster (Processing pre-cached tokens is instant).
5.2 The “Megaprompt Agent” Pattern
This enables a new architecture. Instead of RAG (retrieving tiny snippets), you can just Global Context.
- Scenario: A Coding Agent.
- Old Strategy: RAG search to find 3 relevant files.
- New Strategy: Dump the Entire Codebase (100 files) into the Context Cache.
- Benefit: The agent sees the whole architecture. It understands imports, global variables, and side effects that RAG would miss.
- Limit: Still bounded by the hard limit (e.g., 200k for Claude). For huge repos, you still need RAG.
6. Compression Techniques: Lingua Franca
Can we make the text itself smaller?
6.1 Token Optimization
- Standard English: “I would like you to please go ahead and search for the file.” (12 tokens).
- Compressed: “Search file.” (2 tokens).
- LLMs are surprisingly good at understanding “Caveman Speak” or specialized syntax.
- LLMLingua: A research project (Microsoft) that uses a small model to remove “non-essential” tokens (stopwords, adjectives) from a prompt before sending it to the big model. It achieves 2-3x compression with <1% accuracy drop.
7. Code: A Context Buffer Implementation
Abstracting the logic for a managed buffer that handles pruning.
import tiktoken
import json
class ContextBuffer:
def __init__(self, max_tokens=4000, model="gpt-4"):
self.max_tokens = max_tokens
# Load the specific tokenizer for accurate counting
self.encoder = tiktoken.encoding_for_model(model)
self.system_message = None
self.history = [] # List of dicts
self.summary = ""
def set_system_message(self, content):
self.system_message = {"role": "system", "content": content}
def add_message(self, role, content):
"""Add a message and immediately prune if needed."""
self.history.append({"role": role, "content": content})
self._prune()
def _count_tokens(self, messages):
# Rough estimation: content + role overhead
text = "".join([m["content"] for m in messages])
return len(self.encoder.encode(text))
def _prune(self):
"""
Enforce the token limit by dropping the oldest messages.
Always keep the System Message and Summary.
"""
# Calculate current usage
static_tokens = 0
if self.system_message:
static_tokens += self._count_tokens([self.system_message])
if self.summary:
static_tokens += self._count_tokens([{"role": "system", "content": self.summary}])
current_history_tokens = self._count_tokens(self.history)
total = static_tokens + current_history_tokens
# While loop to pop from the front (FIFO)
while total > self.max_tokens and len(self.history) > 0:
removed = self.history.pop(0)
# Optimization: Just subtract the tokens of the removed msg
removed_count = self._count_tokens([removed])
total -= removed_count
print(f"Pruned message: {removed['content'][:20]}...")
def update_summary(self, new_summary):
"""Manually inject a summary (from an external process)"""
self.summary = new_summary
def get_messages(self):
"""Construct the final payload for the API."""
final_payload = []
if self.system_message:
final_payload.append(self.system_message)
if self.summary:
# Inject summary as a high-priority system note
final_payload.append({
"role": "system",
"content": f"Previous conversation summary: {self.summary}"
})
final_payload.extend(self.history)
return final_payload
# Usage Simulation
buffer = ContextBuffer(max_tokens=100) # Tiny limit for demo
buffer.set_system_message("You are a helpful assistant.")
buffer.add_message("user", "Hello! My name is Alice.")
buffer.add_message("assistant", "Hi Alice!")
buffer.add_message("user", "What is the capital of France?")
buffer.add_message("assistant", "Paris.")
# This message pushes us over the limit
buffer.add_message("user", "Tell me a very long story about dinosaurs...")
# Output: "Pruned message: Hello! My name is Alice..."
# Now the agent doesn't know Alice's name.
# Fix:
buffer.update_summary("User is named Alice.")
# Now get_messages() includes the summary.
8. Summary
Context is the scarcest resource in the AI economy. It is the bottleneck for “Intelligence Density.”
- KV Caching explains the physical costs.
- Context Caching APIs (Megaprompts) are removing the need for RAG in medium-sized tasks.
- Sliding Windows and Summarization are the basic garbage collection algorithms of Agent Memory.
A senior engineer treats Tokens like Bytes in network packets—optimizing, compressing, and caching them to build high-performance systems. An unoptimized agent is slow, expensive, and forgetful.
Managing memory allows us to tackle broader problems, but complex problems require Multi-Step Reasoning to deduce answers logically.