Token Efficiency Optimization
“The most expensive token is the one you didn’t need to send.”
TL;DR
Tokens are the currency of AI agents – they control cost, latency, and output quality simultaneously. Optimize by treating the context window as a managed cache: minify system prompts, dynamically load only relevant tool schemas, compress conversation history with summarization, and strip HTML/JSON bloat before injection. The “Lost in the Middle” phenomenon means stuffed contexts actively hurt performance, so concise prompts produce smarter agents. For foundational context window strategies, see Context Window Management, and for the cost implications of token waste, see Cost Management for Agents.

1. Introduction
In the world of AI Agents, Tokens are Currency.
- Cost: You pay per 1M input tokens. A verbose system prompt sent 100,000 times a day drains the budget.
- Latency: The more tokens the LLM reads, the slower the Time-To-First-Token (TTFT). Reading 10k tokens takes seconds.
- Performance: The “Lost in the Middle” phenomenon means LLMs forget instructions if the context is too stuffed.
Token Efficiency Optimization is the practice of compressing your agent’s cognitive load without lobotomizing its intelligence. It is the “garbage collection” and “compression” of the Agentic world.
2. Core Concepts: The Anatomy of a Prompt
An Agent’s context window isn’t just a string. It has rigid sections, each with a “Token Tax”:
- System Prompt: The static instructions (“You are a helpful assistant…”). Sent every single turn.
- Tool Definitions: The JSON schemas describing functions (
search_google,calculator). Can be huge. - Conversation History: The dynamic chat log. Grows linearly (or super-linearly if tools are verbose).
- RAG Context: The retrieved documents. Often the biggest chunk (3k-10k tokens).
Optimization attacks each of these layers.
3. Architecture Patterns for Efficiency
We treat the Context Window as a Cache.
3.1 The “Context Manager” Pattern
Instead of blindly appending messages.append(response), we use a sophisticated manager class.
- FIFO Buffer: Keep last N turns.
- Summarization: When history > 10 turns, ask a cheap model (GPT-3.5) to summarize turns 1-5 into a single “Memory” string.
- Tool Pruning: Only inject tool definitions relevant to the current state.
3.2 Prompt Compression (The “Minification”)
Much like minifying Javascript (function variableName -> function a), we can minify prompts.
- Original: “Please analyze the following text and determine the sentiment score between 0 and 1.” (15 tokens)
- Optimized: “Analyze text. Return sentiment 0-1.” (6 tokens)
- Savings: 60%.
4. Implementation Approaches
Strategy A: Dynamic Tool Loading
If you have 50 tools, don’t put 50 schemas in the system prompt. Use a Router.
- Classifier Step: “User asks about Math.”
- Loader Step: Load only
Calculator,WolframAlpha. - Execution Step: Run Agent.
Strategy B: System Prompt Refactoring
Standardize on terse, expert language. LLMs (especially GPT-4) understand compressed instructions. Instead of “If the user does this, then you should do that”, use “User: X -> Action: Y”.
5. Code Examples: The Token Budget Manager
Here is a Python class that actively manages the context window ensuring we never overflow or overspend.
import tiktoken
class TokenBudgetManager:
def __init__(self, model="gpt-4", max_tokens=8000):
self.max_tokens = max_tokens
self.encoding = tiktoken.encoding_for_model(model)
# Reserve 1000 tokens for the generated reply
self.safety_margin = 1000
def count(self, text):
return len(self.encoding.encode(text))
def compress_history(self, history):
"""
Compresses conversation history to fit budget.
Strategy: Keep System Prompt + Last N messages.
"""
current_tokens = 0
preserved_messages = []
# 1. Always keep System Prompt (Critical Instructions)
system_msg = next((m for m in history if m['role'] == 'system'), None)
if system_msg:
current_tokens += self.count(system_msg['content'])
preserved_messages.append(system_msg)
# 2. Add recent messages backwards until budget hit
budget = self.max_tokens - current_tokens - self.safety_margin
# Reverse excluding system
chat_msgs = [m for m in history if m['role'] != 'system']
for msg in reversed(chat_msgs):
msg_tokens = self.count(msg['content'])
if budget - msg_tokens >= 0:
preserved_messages.insert(1, msg) # Insert after system
budget -= msg_tokens
else:
break # Stop adding older messages
return preserved_messages
# Usage
manager = TokenBudgetManager()
history = [
{"role": "system", "content": "You are a concise agent..."},
{"role": "user", "content": "Hello world"},
# ... 50 more messages
]
optimized_history = manager.compress_history(history)
6. Production Considerations
6.1 Cost vs. Quality Curve
There is a “Pareto Frontier” of optimization.
- Removing adjectives: No quality loss.
- Removing examples (Few-Shot): Slight quality loss.
- Removing constraints: High quality loss (Agent hallucinates). Rule of Thumb: Never compress the Safety Guidelines.
6.2 Caching (The Semantic Cache)
We discussed caching in ML System Design. For Agents, Semantic Caching saves 100% of tokens.
Ref: Cost Management for Agents.
If User: "Hello" is cached, we send 0 tokens to LLM.
7. Common Pitfalls
- JSON Schema Bloat: Pydantic models with verbose descriptions.
- Fix: Use terse descriptions.
Field(description="The user's age in years")->Field(description="age (yrs)").
- Fix: Use terse descriptions.
- HTML/XML Residue: Scraping a website often leaves
<div>,class="...". These are junk tokens.- Fix: Use
html2textor Markdown converters before injecting into context.
- Fix: Use
- Recursive Summarization: If you summarize a summary of a summary, details wash out (Chinese Whispers).
- Fix: Keep “Key Claims” separate from “Conversation Log”.
8. Best Practices: The “Chain of Density”
A prompting technique from Salesforce Research. Instead of asking for a summary once, ask the model to:
- Write a summary.
- Identify missing entities from the text.
- Rewrite the summary to include those entities without increasing length.
- Repeat 5 times. This creates physically dense information blocks, maximizing information-per-token density.
9. Connections to Other Topics
This connects deeply to the Transfer Learning theme of ML System Design.
- Transfer Learning (ML): Freezes the backbone to save compute.
- Token Optimization (Agents): Freezes/Compresses the System Prompt (the “backbone” of the agent’s personality) to save IO.
- Both are about identifying the “Invariant” (core knowledge) vs the “Variant” (current input) and optimizing the ratio.
10. Real-World Examples
- GitHub Copilot: Uses “Fill-in-the-middle” models but aggressively prunes the surrounding code context to fit the most relevant file imports into the window.
- AutoGPT: Struggled famously with cost loops ($10 runs). Newer versions implement “sliding windows” and “memory summarization” by default.
11. Future Directions
- Context Caching (Google Gemini 1.5): You pay once to upload a huge manual (1M tokens). Subsequent requests referencing that manual are cheap. This essentially “fine-tunes” the cache state.
- Infinite Attention: Architectures (like RingAttention) that make context length mathematically irrelevant, shifting the bottleneck purely to compute/cost.
12. Key Takeaways
- Count your tokens: Use
tiktoken. Don’t guess. Length\neqTokens. - Minify everything: Prompts, Schemas, HTML.
- Manage History: It’s a sliding window, not an infinite scroll.
- Density is Quality: Concise prompts often yield smarter agents because the attention mechanism is less diluted.
Next in this series, we apply these efficiency techniques to the bigger picture of Cost Management for Agents.
FAQ
How do you reduce token usage in AI agents?
Use four strategies: minify system prompts by removing verbose natural language and using terse expert notation, dynamically load only the tool schemas relevant to the current task instead of all 50 at once, compress conversation history using summarization after a turn threshold, and sanitize retrieved content by stripping HTML tags, scripts, and boilerplate before injecting into context.
What is the Lost in the Middle problem with LLMs?
The Lost in the Middle phenomenon means LLMs tend to forget or underweight instructions placed in the middle of a large context window. When context is stuffed with too many tokens, the model misses critical information buried between the beginning and end. This makes token efficiency essential for both cost and output quality.
What is semantic caching for AI agents?
Semantic caching uses embeddings to match similar queries rather than requiring exact string matches. If a new query is semantically close enough to a cached one (within a similarity threshold), the cached response is returned directly. This saves 100% of tokens for that request and provides sub-millisecond, zero-cost responses.
How does prompt compression work for AI agents?
Prompt compression minifies natural language instructions similarly to how JavaScript is minified. Replace verbose phrases like “Please analyze the following text and determine the sentiment score between 0 and 1” with terse equivalents like “Analyze text. Return sentiment 0-1.” This can reduce token count by 40-60% without quality loss because modern LLMs understand compressed expert language.
Originally published at: arunbaby.com/ai-agents/0046-token-efficiency-optimization
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch