Cost Management for Agents
“Intelligence is cheap. Reliable, scalable intelligence is expensive.”
TL;DR
A single GPT-4 agent at scale can cost millions per year. Control costs with four architectural primitives: model routing (send simple tasks to cheap models), semantic caching (return cached responses for similar queries at zero cost), budget enforcement (per-user and per-feature spending limits), and circuit breakers (hard step limits to prevent runaway loops). The key insight is that 80% quality at 10% price is better than 99% quality at 100% price for most agent tasks. For the token-level optimization that feeds into cost control, see Token Efficiency Optimization, and for monitoring spend in production, see Observability and Tracing.

1. Introduction
When you move from a “Demo” (10 queries/day) to “Production” (10M queries/day), the economics of AI Agents shift dramatically. A generic GPT-4 agent running a ReAct loop (Reason+Act) might cost $0.30 per task. If you have 10,000 active users doing 5 tasks a day, your daily bill is $15,000 ($5.4M/year).
Cost Management is not just about “switching to cheaper models”. It involves architecting systems that are financially aware, using routing, caching, and budget enforcement as core primitives.
2. Core Concepts: The Token Economy
To optimize cost, we must understand the billing unit.
- Input Tokens: Cheaper. (Reading context).
- Output Tokens: 3x-10x More Expensive. (Generation).
- Frequency: Agents are “chatty”. A single user “task” might involve 10 back-and-forth LLM calls (Thought -> Tool -> Observation -> Thought…).
The Formula:
Cost = (Input_Vol * Input_Rate) + (Output_Vol * Output_Rate) + (Tool_Compute_Cost)
Optimization targets the Frequency (fewer calls) and the Rate (cheaper models).
3. Architecture Patterns: The Cost Gateway
We shouldn’t hardcode API keys in our agent code. We need a Model Gateway (like LiteLLM or Helicone).
[Agent Logic] -> [Cost Gateway] -> [Provider (OpenAI/Anthropic)]
|
+-------+-----------+
| |
[Budget Check] [Semantic Cache]
(Stop if over) (Return cached)
The Router Pattern: The Gateway inspects the prompt.
- Tier 1 (Complex): “Write a legal contract” -> Route to GPT-4.
- Tier 2 (Simple): “Extract the date from this string” -> Route to GPT-3.5-Turbo or Claude-Haiku.
4. Implementation Approaches
4.1 Semantic Caching
Exact string matching (Redis) has a 5% hit rate. “How are you?” != “How are you doing?”. Semantic Caching uses Embeddings.
- Embed query:
vec = embed("How are you?") - Search Vector DB for neighbors.
- If distance < 0.1 (very similar), return cached response.
4.2 Waterfall Routing / Fallbacks
Try the cheapest model first. If it fails (or returns low confidence/bad format), retry with the expensive model.
5. Code Examples: The Budget-Aware Router
import os
import openai
from tenacity import retry, stop_after_attempt
# Mock Pricing Table ($ per 1k tokens)
PRICING = {
"gpt-4": 0.03,
"gpt-3.5-turbo": 0.0015
}
class CostRouter:
def __init__(self, budget_limit=5.0):
self.total_spend = 0.0
self.budget_limit = budget_limit
def estimate_cost(self, model, prompt_len, output_len_est=500):
# Very rough estimation
rate = PRICING.get(model, 0.03)
return (prompt_len + output_len_est) / 1000 * rate
def route(self, prompt, complexity="low"):
# 1. Budget Check
if self.total_spend >= self.budget_limit:
raise Exception("Budget Exceeded! Refusing to run.")
# 2. Model Selection Logic
model = "gpt-3.5-turbo"
if complexity == "high" or len(prompt) > 8000:
model = "gpt-4"
# 3. Execution
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
# 4. Accounting (Post-execution true up)
usage = response['usage']
cost = (usage['prompt_tokens'] * PRICING[model]) / 1000 # Simplified
self.total_spend += cost
return response
router = CostRouter(budget_limit=10.0)
# Simple query -> Cheap
router.route("What is 2+2?", complexity="low")
# Hard query -> Expensive
router.route("Draft a patent claim for...", complexity="high")
6. Production Considerations
6.1 The “Agent Loop” Trap
An agent gets stuck in a loop:
- Thought: “I need to search Google.”
- Action: Search “Python”.
- Observation: “Python is a snake.”
- Thought: “That’s not code. I need to search Google.”
- Action: Search “Python”. … System Design Fix: Implement a Max Steps Circuit Breaker. Hard limit of 10 steps. If not solved, return “I failed” rather than burning $100.
6.2 FinOps Tagging
Every request should have metadata: {"user_id": "123", "feature": "email_writer"}.
This allows you to answer: “Is the Email Writer feature profitable?”
7. Common Pitfalls
- Summarization Recursion: You summarize history to save tokens. But resizing the summary costs tokens. Sometimes summarization costs more than just reading the raw logs if the thread is short.
- Over-Caching: Caching “Write me a poem about X” is bad (User wants variety). Caching “What is the capital of X?” is good.
- Fix: Only cache deterministic queries.
8. Best Practices
- Usage-Based Throttling: Rate limit users not just by Request Count, but by Dollar Amount. “You have $1.00 credit per day.”
- Separation of Concerns: Don’t ask the “Planner” (GPT-4) to do the “Extraction” (JSON formatting). Extract using GPT-3.5 or RegEx.
9. Connections to Other Topics
This connects to Model Serialization (ML Track).
- Serialization: Optimizing storage size (disk cost).
- Cost Mgmt: Optimizing token size (compute cost). In both, “Compression” (of weights or of prompts) is the key lever for efficiency.
10. Real-World Examples
- Zapier AI Actions: Uses a router. Simple logic runs on cheaper models. Complex reasoning upgrades to GPT-4.
- Microsoft Copilot: Likely caches code snippets. If 10,000 developers type
def qsort(arr):, the completion is fetched from a KV store, not re-generated by the GPU.
11. Future Directions
- Speculative Decoding: Using a small model to “guess” the next few tokens, and the large model to “verify” them in parallel. Reduces cost and latency.
- Local-First Agents: Running a 7B Llama-3 model on the user’s laptop for free, falling back to Cloud GPT-4 only when stuck.
12. Key Takeaways
- Routing is ROI: Getting 80% quality for 10% price (GPT-3.5) is better than 99% quality for 100% price (GPT-4) for most tasks.
- Cache Aggressively: Semantic caching is the only way to get sub-millisecond, $0 cost responses.
- Circuit Breakers: Never let an agent run
while(true).
Next in this series, we move into real-time agent architectures with Streaming and Real-Time Agents.
FAQ
How much do AI agents cost to run in production?
A GPT-4 agent running a ReAct loop costs roughly $0.30 per task. At 10,000 active users doing 5 tasks per day, that is $15,000 daily or $5.4 million per year. The cost comes from three components: input tokens (reading context), output tokens (generation, which is 3-10x more expensive), and frequency (a single user task may involve 10 or more LLM calls in the reasoning loop).
What is model routing for cost optimization?
Model routing inspects each prompt and routes it to the cheapest model that can handle the required complexity. Simple tasks like date extraction or formatting go to GPT-3.5-Turbo or Claude Haiku at a fraction of the cost, while complex tasks like legal drafting or multi-step reasoning go to GPT-4. This gives roughly 80% quality for 10% of the price on the majority of agent tasks.
How do circuit breakers prevent runaway AI agent costs?
Circuit breakers set a hard limit on the number of steps an agent can take per task, typically around 10. Without them, an agent stuck in a reasoning loop – repeatedly searching for the same thing or retrying a failing tool – can make dozens of expensive LLM calls on a single request. The circuit breaker forces the agent to return a failure message rather than burning through budget.
What is semantic caching and how does it save money?
Semantic caching embeds queries as vectors and searches for similar cached queries rather than requiring exact string matches. If a new query falls within a similarity threshold of a cached one, the stored response is returned instantly at zero token cost. This is effective for deterministic queries like factual lookups but should not be used for creative tasks where users expect variety.
Originally published at: arunbaby.com/ai-agents/0047-cost-management-for-agents
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch