8 minute read

“The most expensive token is the one you didn’t need to send.”

TL;DR

Tokens are the currency of AI agents – they control cost, latency, and output quality simultaneously. Optimize by treating the context window as a managed cache: minify system prompts, dynamically load only relevant tool schemas, compress conversation history with summarization, and strip HTML/JSON bloat before injection. The “Lost in the Middle” phenomenon means stuffed contexts actively hurt performance, so concise prompts produce smarter agents. For foundational context window strategies, see Context Window Management, and for the cost implications of token waste, see Cost Management for Agents.

Two data storage media side by side on a precision laboratory scale balance

1. Introduction

In the world of AI Agents, Tokens are Currency.

  • Cost: You pay per 1M input tokens. A verbose system prompt sent 100,000 times a day drains the budget.
  • Latency: The more tokens the LLM reads, the slower the Time-To-First-Token (TTFT). Reading 10k tokens takes seconds.
  • Performance: The “Lost in the Middle” phenomenon means LLMs forget instructions if the context is too stuffed.

Token Efficiency Optimization is the practice of compressing your agent’s cognitive load without lobotomizing its intelligence. It is the “garbage collection” and “compression” of the Agentic world.


2. Core Concepts: The Anatomy of a Prompt

An Agent’s context window isn’t just a string. It has rigid sections, each with a “Token Tax”:

  1. System Prompt: The static instructions (“You are a helpful assistant…”). Sent every single turn.
  2. Tool Definitions: The JSON schemas describing functions (search_google, calculator). Can be huge.
  3. Conversation History: The dynamic chat log. Grows linearly (or super-linearly if tools are verbose).
  4. RAG Context: The retrieved documents. Often the biggest chunk (3k-10k tokens).

Optimization attacks each of these layers.


3. Architecture Patterns for Efficiency

We treat the Context Window as a Cache.

3.1 The “Context Manager” Pattern

Instead of blindly appending messages.append(response), we use a sophisticated manager class.

  • FIFO Buffer: Keep last N turns.
  • Summarization: When history > 10 turns, ask a cheap model (GPT-3.5) to summarize turns 1-5 into a single “Memory” string.
  • Tool Pruning: Only inject tool definitions relevant to the current state.

3.2 Prompt Compression (The “Minification”)

Much like minifying Javascript (function variableName -> function a), we can minify prompts.

  • Original: “Please analyze the following text and determine the sentiment score between 0 and 1.” (15 tokens)
  • Optimized: “Analyze text. Return sentiment 0-1.” (6 tokens)
  • Savings: 60%.

4. Implementation Approaches

Strategy A: Dynamic Tool Loading

If you have 50 tools, don’t put 50 schemas in the system prompt. Use a Router.

  1. Classifier Step: “User asks about Math.”
  2. Loader Step: Load only Calculator, WolframAlpha.
  3. Execution Step: Run Agent.

Strategy B: System Prompt Refactoring

Standardize on terse, expert language. LLMs (especially GPT-4) understand compressed instructions. Instead of “If the user does this, then you should do that”, use “User: X -> Action: Y”.


5. Code Examples: The Token Budget Manager

Here is a Python class that actively manages the context window ensuring we never overflow or overspend.

import tiktoken

class TokenBudgetManager:
    def __init__(self, model="gpt-4", max_tokens=8000):
        self.max_tokens = max_tokens
        self.encoding = tiktoken.encoding_for_model(model)
        # Reserve 1000 tokens for the generated reply
        self.safety_margin = 1000

    def count(self, text):
        return len(self.encoding.encode(text))

    def compress_history(self, history):
        """
        Compresses conversation history to fit budget.
        Strategy: Keep System Prompt + Last N messages.
        """
        current_tokens = 0
        preserved_messages = []

        # 1. Always keep System Prompt (Critical Instructions)
        system_msg = next((m for m in history if m['role'] == 'system'), None)
        if system_msg:
            current_tokens += self.count(system_msg['content'])
            preserved_messages.append(system_msg)

            # 2. Add recent messages backwards until budget hit
            budget = self.max_tokens - current_tokens - self.safety_margin

            # Reverse excluding system
            chat_msgs = [m for m in history if m['role'] != 'system']

            for msg in reversed(chat_msgs):
                msg_tokens = self.count(msg['content'])
                if budget - msg_tokens >= 0:
                    preserved_messages.insert(1, msg) # Insert after system
                    budget -= msg_tokens
                else:
                    break # Stop adding older messages

                    return preserved_messages

                    # Usage
                    manager = TokenBudgetManager()
                    history = [
                    {"role": "system", "content": "You are a concise agent..."},
                    {"role": "user", "content": "Hello world"},
                    # ... 50 more messages
                    ]
                    optimized_history = manager.compress_history(history)

6. Production Considerations

6.1 Cost vs. Quality Curve

There is a “Pareto Frontier” of optimization.

  • Removing adjectives: No quality loss.
  • Removing examples (Few-Shot): Slight quality loss.
  • Removing constraints: High quality loss (Agent hallucinates). Rule of Thumb: Never compress the Safety Guidelines.

6.2 Caching (The Semantic Cache)

We discussed caching in ML System Design. For Agents, Semantic Caching saves 100% of tokens. Ref: Cost Management for Agents. If User: "Hello" is cached, we send 0 tokens to LLM.


7. Common Pitfalls

  1. JSON Schema Bloat: Pydantic models with verbose descriptions.
    • Fix: Use terse descriptions. Field(description="The user's age in years") -> Field(description="age (yrs)").
  2. HTML/XML Residue: Scraping a website often leaves <div>, class="...". These are junk tokens.
    • Fix: Use html2text or Markdown converters before injecting into context.
  3. Recursive Summarization: If you summarize a summary of a summary, details wash out (Chinese Whispers).
    • Fix: Keep “Key Claims” separate from “Conversation Log”.

8. Best Practices: The “Chain of Density”

A prompting technique from Salesforce Research. Instead of asking for a summary once, ask the model to:

  1. Write a summary.
  2. Identify missing entities from the text.
  3. Rewrite the summary to include those entities without increasing length.
  4. Repeat 5 times. This creates physically dense information blocks, maximizing information-per-token density.

9. Connections to Other Topics

This connects deeply to the Transfer Learning theme of ML System Design.

  • Transfer Learning (ML): Freezes the backbone to save compute.
  • Token Optimization (Agents): Freezes/Compresses the System Prompt (the “backbone” of the agent’s personality) to save IO.
  • Both are about identifying the “Invariant” (core knowledge) vs the “Variant” (current input) and optimizing the ratio.

10. Real-World Examples

  • GitHub Copilot: Uses “Fill-in-the-middle” models but aggressively prunes the surrounding code context to fit the most relevant file imports into the window.
  • AutoGPT: Struggled famously with cost loops ($10 runs). Newer versions implement “sliding windows” and “memory summarization” by default.

11. Future Directions

  • Context Caching (Google Gemini 1.5): You pay once to upload a huge manual (1M tokens). Subsequent requests referencing that manual are cheap. This essentially “fine-tunes” the cache state.
  • Infinite Attention: Architectures (like RingAttention) that make context length mathematically irrelevant, shifting the bottleneck purely to compute/cost.

12. Key Takeaways

  1. Count your tokens: Use tiktoken. Don’t guess. Length \neq Tokens.
  2. Minify everything: Prompts, Schemas, HTML.
  3. Manage History: It’s a sliding window, not an infinite scroll.
  4. Density is Quality: Concise prompts often yield smarter agents because the attention mechanism is less diluted.

Next in this series, we apply these efficiency techniques to the bigger picture of Cost Management for Agents.


FAQ

How do you reduce token usage in AI agents?

Use four strategies: minify system prompts by removing verbose natural language and using terse expert notation, dynamically load only the tool schemas relevant to the current task instead of all 50 at once, compress conversation history using summarization after a turn threshold, and sanitize retrieved content by stripping HTML tags, scripts, and boilerplate before injecting into context.

What is the Lost in the Middle problem with LLMs?

The Lost in the Middle phenomenon means LLMs tend to forget or underweight instructions placed in the middle of a large context window. When context is stuffed with too many tokens, the model misses critical information buried between the beginning and end. This makes token efficiency essential for both cost and output quality.

What is semantic caching for AI agents?

Semantic caching uses embeddings to match similar queries rather than requiring exact string matches. If a new query is semantically close enough to a cached one (within a similarity threshold), the cached response is returned directly. This saves 100% of tokens for that request and provides sub-millisecond, zero-cost responses.

How does prompt compression work for AI agents?

Prompt compression minifies natural language instructions similarly to how JavaScript is minified. Replace verbose phrases like “Please analyze the following text and determine the sentiment score between 0 and 1” with terse equivalents like “Analyze text. Return sentiment 0-1.” This can reduce token count by 40-60% without quality loss because modern LLMs understand compressed expert language.


Originally published at: arunbaby.com/ai-agents/0046-token-efficiency-optimization

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch