LLM Capabilities for Agents

21 minute read

“The Engine of Autonomy: Understanding the Agentic ‘Brain’.”

TL;DR

The Large Language Model is the cognitive engine of every AI agent, and its capabilities directly cap what the agent can do. Four capabilities matter most: steerability (does the model follow system instructions under pressure?), reasoning (can it plan multi-step solutions via Chain of Thought and Tree of Thoughts?), grounding and tool use (can it translate fuzzy intent into rigid API schemas?), and context window management (can it find the right information in a sea of tokens?). Production agent stacks often use multi-model routing – a cheap fast model for simple tasks, a frontier model for hard reasoning, and a judge model for scoring. When a model is weak on any axis, you compensate with runtime scaffolding: validators, routers, memory, and tool contracts.

(generated in previous session)

1. Why Does the LLM Matter for AI Agents?

If an AI Agent is a vehicle for automation, the Large Language Model (LLM) is its engine. The performance, reliability, and “intelligence” of your agent are fundamentally capped by the capabilities of the underlying model.

You cannot build a sophisticated autonomous software engineer using a model that struggles with basic logic puzzles. Similarly, using a massive, expensive reasoning model for a simple intent classification task is a waste of resources.

In this deep dive, we will move beyond the marketing hype of “AI” and dissect the specific neuro-symbolic capabilities required for an LLM to function as an effective agent. We will explore reasoning paradigms, the mechanics of context, the landscape of models available today, and the benchmarks that actually matter.

This is not a post about “How to use ChatGPT.” This is a post about the computer science of agency, how we map the fuzzy, probabilistic outputs of a neural network into the rigid, deterministic world of software action.

To make this concrete, keep one question in mind: when an agent fails, what capability was missing? Was it the inability to follow a constraint? A lack of long-horizon planning? Poor tool argument construction? Weak retrieval under long context? If you can name the missing capability, you can design the right runtime scaffolding (validators, routers, memory, retries) instead of hoping a bigger model will magically fix it.

2. What Capabilities Must an LLM Have to Be an Effective Agent?

Not all LLMs can be agents. A model might be excellent at creative writing (high perplexity, good flow) but terrible at agency. Agency requires a specific set of skills, often referred to as the “Cybernetic Stack” of LLMs.

2.1 What Is Steerability and Why Does Instruction Following Matter?

The most basic requirement is obedience. Can the model follow a rule, even when the context distracts it?

The Problem: Pre-trained base models (like raw GPT-3) are just “Pattern Completers.” If you prompt “The capital of France is”, it completes “Paris.” If you prompt “Do not output the capital of France”, a base model might get confused because it sees “Capital of France” and autocompletes “Paris.” It is optimizing for statistical likelihood, not semantic command.
The Solution: This is where RLHF (Reinforcement Learning from Human Feedback) changed the world. Models are specifically trained to prioritize System Instructions over their own training data probability.
Agent Relevance: Agents rely on System Prompts, massive blocks of text defining rules (“Never delete files,” “Always return JSON”). A model with low steerability (like many early open-source models) will “forget” these rules as the conversation gets longer, a phenomenon known as Instruction Drift. High-quality agent models (GPT-4o, Claude 3.5 Sonnet) treat the System Prompt as a constitution, adhering to it even when the user tries to jailbreak it.

In practice, steerability is less about “politeness” and more about hierarchy resolution:

If the system prompt says “Only output JSON,” and the user says “Explain your reasoning in prose,” what wins?
If the developer message says “Use get_customer(id) before answering,” and the tool call fails, does the model degrade gracefully or fabricate?

A useful way to think about this is that a production agent lives inside a policy lattice: system > developer > user > tool. The more consistently a model respects that lattice under stress (long context, conflicting instructions, adversarial input), the less runtime glue you need to keep it safe.

2.2 How Does LLM Reasoning Work (System 2 Thinking)?

Daniel Kahneman, in “Thinking, Fast and Slow”, distinguishes between System 1 (Fast/Intuitive) and System 2 (Slow/Logical).

System 1: “What is 2+2?” -> “4”. Instant.
System 2: “What is 17 * 24?” -> “Let me think… 1024=240, 720=140…” Slow.

Standard LLMs naturally operate in System 1. They generate the next token immediately based on surface-level statistics. Agents, however, need System 2. They face multi-step logic puzzles: “If the user is in the US, check inventory A. If inventory A is empty, check inventory B, but only if it’s a weekday.”

To enable this, we rely on Reasoning Paradigms:

Deductive Reasoning: deriving a conclusion from premises (A \to B, B \to C, \therefore A \to C).
Inductive Reasoning: Seeing examples in the prompt and generalizing the rule.
Abductive Reasoning: The most critical for debugging. Seeing an observation (“The server returned 500”) and inferring the most likely cause (“The database connection string is probably wrong”).

One more capability matters for agents in production: stateful reasoning. It’s not enough to reason; the model must reason consistently across turns:

Remember which tool it already called (avoid duplicate actions like sending the same email twice).
Preserve intermediate decisions (e.g., “we chose rollout strategy B because strategy A violated the latency budget”).
Maintain invariants (“never mutate production data”) even as the dialogue evolves.

This is why many agent stacks maintain explicit run-state outside the model (a structured plan, tool history, approvals), and treat the LLM as a planner/controller rather than the sole source of truth.

2.3 What Is Grounding and How Does Tool Use Work?

Pure LLMs live in a world of hallucination and text. Agents must live in reality. Grounding is the capability to map a fuzzy human intent (“Book me a flight”) to a rigid schema (book_flight(origin="SFO", dest="LHR", date="2024-01-01")).

This requires Semantic-to-Syntactic Translation.

Structured Output: This is the killer feature of modern models. Models like GPT-4o are fine-tuned to output valid JSON or XML tokens. This isn’t just “writing code”; it’s about adhering to a specific syntax constraint while maintaining semantic intent.
Constraint Satisfaction: A good agent model respects enums. If a function only accepts [RED, GREEN, BLUE], a bad model might output “YELLOW” because it “liked that color better” in the context of a painting. A grounded model understands that “Yellow” is an illegal move in this state space.

Grounding also includes reference discipline: when a tool returns an ID, does the model use that exact ID later, or does it “round” it, reformat it, or hallucinate a near-miss? This shows up constantly in real systems:

Billing: invoice IDs, charge IDs, subscription IDs.
Data: partition keys, feature names, table names.
Infra: instance IDs, region names, cluster identifiers.

If you’ve ever watched an agent “almost” do the right thing but fail on a small string mismatch, that’s a grounding failure. The runtime fixes are usually boring but effective: strict schemas, type validators, canonicalization, and a policy of “never invent identifiers, only reuse identifiers returned by tools.”

2.4 How Does Context Window Management Affect Agents?

The Agent’s “Working Memory.”

Needle in a Haystack: Can the model find one specific fact (e.g., a specific transaction ID) buried in 100,000 tokens of server logs?
Attention Sinks: Models tend to pay most attention to the Beginning (System Prompt) and the End (User Query). The “Middle” is often a dead zone, known as the “Lost in the Middle” phenomenon.
Agent Relevance: As an agent works, its history log grows. If the model’s retrieval capability degrades as context fills, the agent becomes “senile”, forgetting what it did 5 minutes ago, or re-running a tool it already ran.

There’s also a budget trade-off: long context is not free. It increases latency, cost, and sometimes decreases accuracy due to attention diffusion. So the real capability is not just “has a big window,” but “can operate with a selective window”:

Summarize older steps into compact state.
Retrieve only what’s needed (tool RAG, memory retrieval).
Keep high-signal artifacts (decisions, constraints, outputs) and discard chatter.

3. What Are the Key Reasoning Paradigms for Agents?

How do we extract “Reasoning” from a next-token predictor? We use structural prompting strategies to force the model into “System 2” mode.

3.1 What Is Chain of Thought (CoT) Prompting?

This is the foundational technique.

Prompt: “What is 23 * 45?”
Standard: “1035” (This might be a hallucination based on similar numbers in training data).
CoT: “Let’s break it down. 20 * 45 = 900. 3 * 45 = 135. 900 + 135 = 1035.”

Why it works: Computation takes time. In an LLM, time = tokens. By forcing the model to generate intermediate tokens, we are giving it computational space to resolve the logic. It’s effectively writing its own “scratchpad” data to the context window, which it can then attend to for subsequent calculations. An agent without CoT is like a human trying to do calculus in their head; an agent with CoT is a human with a pen and paper.

In production, you often want a clean separation between private scratchpad and public output:

The scratchpad helps the model stay coherent and reduces errors in multi-step tasks.
The public output should be concise, parseable, and avoid leaking internal prompts or sensitive tool responses.

Many agent runtimes therefore ask for “structured reasoning” internally (plan, assumptions, tool calls) while only returning a compact user-facing answer. This is less about secrecy and more about reliability: the model gets its “pen and paper,” but downstream systems get clean, deterministic output.

3.2 How Does Tree of Thoughts (ToT) Enable Complex Planning?

For complex planning, linear thinking (CoT) isn’t enough. We need exploration.

Mechanism: The agent explores multiple “branches” of possibilities simultaneously or sequentially.
Branch 1: “I could search Google for the answer.” -> Evaluation: “This might be slow.”
Branch 2: “I could check the local cache.” -> Evaluation: “Fast, but might be stale.”
Selection: “I will check the cache first, and if that fails, search Google.”

Implementation: This usually isn’t just a prompt; it’s a runtime architecture. The Python wrapper runs the model 3 times to generate 3 thoughts, evaluates them (using a Judge model), and picks the winner.

The ability to critique one’s own output.

Draft 1: Writes code to parse a CSV.
Self-Prompt: “Are there any bugs in this code? Check for edge cases.”
Critique: “Yes, I missed the case where the CSV has no header row.”
Draft 2: Rewrites code to handle headerless CSVs.

Agent Relevance: This is crucial for autonomous coders (like Devin). They must “Loop until passing.” A model that can self-correct is exponentially more valuable than a model that is 10% smarter but rigid.

4. How Do You Choose the Right LLM for Your Agent?

Which model should you use for your agent? The answer depends on your budget, latency requirements, and complexity.

A common pattern in mature stacks is multi-model routing:

A cheap, fast model handles intent classification, schema filling, and simple tool routing.
A stronger “reasoning” model is invoked only for hard cases (long-horizon planning, ambiguous requirements, complex code edits).
A separate “judge” model scores candidate plans/outputs (safety, correctness, formatting).

This matters because the best agent systems are rarely “one model to rule them all.” They are orchestration systems that spend expensive cognition only when it buys down real risk.

4.1 What Are the Frontier Models (The Big Brains)?

These are closed-source, massive models (likely 1T+ parameters). They are the “General Contractors” of agency.

GPT-4o (OpenAI): The reigning champion for general agency.
Pros: Best-in-class Instruction Following. Native Function Calling fine-tuning means it rarely hallucinates tool parameters. Extremely fast time-to-first-token.
Cons: Expensive. “Lazy” (sometimes refuses to code full solutions, preferring to leave placeholders like # ... rest of code). Strict safety filters can trigger false refusals on benign tasks.
Claude 3.5 Sonnet (Anthropic): The “Coder’s Choice.”
Pros: Exceptional at coding and complex reasoning. Many benchmarks place it above GPT-4o for pure logic. Huge context window (200k) with excellent “Needle in Haystack” retrieval. It feels more “human” and verbose (“Sure! Here is the code…”).
Cons: Function calling format is slightly different (uses XML heavily in training, though supports JSON).
Gemini 1.5 Pro (Google): The “Context King.”
Pros: Massive 2M token context window. This changes the architecture entirely, you don’t need RAG; you just stuff the entire manual into the context. Multimodal (can watch videos/screencasts of bugs).
Cons: Historically slightly higher hallucination rates on rigid logic than GPT-4, though catching up fast.

4.2 What Are the Best Open Weights Models for Agents?

Models you can host yourself. Essential for healthcare, finance, or privacy-critical agents.

Llama 3.1 70B & 405B (Meta):
Pros: GPT-4 class performance for free (if you have the GPUs). Uncensored (mostly). You own your data. Excellent Tool Calling support.
Cons: Hosting 405B parameters is technically difficult and expensive (`$ GPU clusters). The 70B model is the sweet spot for enterprise agents, fitting on a single fast node.
Mistral Large / Mixtral:
Pros: Efficient “Mixture of Experts” (MoE) architecture. Good reasoning-to-cost ratio.

4.3 When Should You Use Specialized Models?

NexusRaven / Gorilla: Models fine-tuned exclusively on API calling. They might forget who the President is, but they can construct a perfect AWS CLI command from a fuzzy prompt better than GPT-4.
DeepSeek Coder: Models trained on massive troves of GitHub code. Excellent for autonomous software engineering agents.

5. How Do You Benchmark an LLM for Agentic Tasks?

How do we know if a model is a good agent? Standard NLP benchmarks (MMLU, HELM) are multiple choice (“A, B, C, D”). They don’t test agency. We use Agentic Benchmarks which are dynamic.

Two benchmarking principles keep teams honest:

Executable evaluation: If it’s not run against an environment (tests, simulator, website), it’s easy to fool yourself with plausible text.
Long-horizon scoring: Agents often “look right” for the first 3 steps and fail on step 12. Benchmarks must measure end-to-end success, not just local correctness.

In other words, the agentic question is not “did it answer well,” but “did it finish the job under constraints.”

5.1 What Do HumanEval and MBPP Measure?

Task: “Write a Python function to reverse a list.”
Metric: Pass@1. We take the generated code and actually run it against a suite of unit tests.
Significance: Coding is a proxy for rigorous logic and syntax adherence. Models good at code are usually good at agents because both require handling rigid structures (APIs/Syntax) and logical flow control.

5.2 What Does AgentBench / WebShop Test?

Task: “Go to amazon.com, find a blue HDMI cable under $10, and add it to cart.”
Metric: Success Rate within N steps.
Significance: Tests environment navigation, HTML parsing, and decision making. A model needs to handle dynamic observations (“out of stock”) and pivot its strategy.

5.3 What Is SWE-Bench and Why Is It the Gold Standard?

This is the gold standard for software engineering agents.

Task: “Here is a real GitHub Issue from a popular repo (e.g., Django, scikit-learn). Fix it.”
Process: The agent is given the codebase (files). It must write a reproduction script, locate the bug, fix it, and pass the tests.
Scores:
GPT-4 (Unassisted): ~2%
Devin/Specialized Agents: ~13-20%
Reality Check: This shows how early we are. Most agents fail at real-world software engineering because the context is too large and the reasoning chain is too long (hundreds of steps).

6. When Should You Fine-Tune an LLM for Your Agent?

Should you fine-tune your own model for your agent?

6.1 Why Should You Try Prompt Engineering Before Fine-Tuning?

Always exhaust prompt engineering (Context + Few-Shot Examples) first. It is cheaper, faster, and easier to debug. Fine-tuning adds a permanent maintenance burden (you have to re-train every time base models update).

6.2 What Are the Valid Reasons to Fine-Tune?

Unique Toolset: If you have a proprietary internal API with 5,000 unusual functions, GPT-4 won’t know it. Fine-tuning a Llama-3 model on your API docs (Input: “Refund user”, Output: api.refund(uid)) can make it an expert router.
Style/Tone Enforcement: If your agent needs to speak in a specific brand voice (e.g., “A 17th Century Pirate Lawyer”), fine-tuning is better than a long system prompt which consumes tokens.
Latency/Cost (Distillation): You can train a small model (8B parameters) to mimic the outputs of a large model (GPT-4) for your specific task, allowing you to run it 10x cheaper and 5x faster. This is how many “Router” agents are built.

7. What Are Reasoning Tokens and How Will They Change Agents?

We are on the cusp of models that “think” in a fundamentally new way. Rumors of OpenAI’s “Project Strawberry” (Q) and Google’s AlphaProof suggest models that perform internal tree-search optimization (like AlphaGo) *during inference before outputting a single token.

Current: Inference is constant time (proportional to output length).
Future: Inference time is variable. “Take 10 seconds to think about this chess move.” “Take 1 hour to think about this cure for cancer.”

This will transform agents from “Fast Guessers” to “Deep Thinkers,” potentially solving the “Zero-Shot Reliability” problem. An agent that can simulate 1,000 paths forward before choosing one action will be radically more reliable than our current greedy autoregressive models.

FAQ

Q: What LLM capabilities are needed for AI agents? A: AI agents require four core LLM capabilities: steerability (following system instructions reliably), reasoning (multi-step logical thinking), grounding and tool use (translating intent into structured API calls), and context window management (finding relevant information in large contexts).

Q: What is Chain of Thought (CoT) prompting and why does it help agents? A: Chain of Thought prompting forces the LLM to generate intermediate reasoning steps before giving a final answer. It works because computation takes tokens – by writing out a scratchpad, the model gets computational space to resolve logic, like a human using pen and paper instead of doing calculus in their head.

Q: How do you choose the right LLM for an AI agent? A: Match the model to your task complexity: frontier models like GPT-4o or Claude 3.5 Sonnet for complex reasoning and coding agents, open-weights models like Llama 3.1 70B for privacy-critical or self-hosted agents, and specialized models like NexusRaven for narrow tool-calling tasks. Many production systems use multi-model routing.

Q: What is SWE-Bench and why does it matter for AI agents? A: SWE-Bench is the gold standard benchmark for software engineering agents. It tests whether an agent can fix real GitHub issues from popular repos like Django. Even the best specialized agents score only 13-20%, showing how early the field is for real-world autonomous software engineering.

Q: When should you fine-tune an LLM for an AI agent? A: Fine-tune only after exhausting prompt engineering. The three valid cases are: a unique proprietary toolset the base model does not know, strict style or tone enforcement that would consume too many tokens as a system prompt, and latency or cost reduction through distillation of a large model into a smaller one.

8. Key Takeaways

The LLM is the Cognitive Substrate of the agent, providing the reasoning capabilities that allow the agent to navigate the world.
Steerability ensures it listens to you. Reasoning ensures it can plan. Function Calling ensures it can act. Context ensures it remembers.
Building an agent starts with selecting the right model for the complexity of your task. A coding agent needs Claude 3.5 or GPT-4o. A classification agent might do fine with Llama-3-8B. The art is in matching the brain size to the problem size.
Treat LLM capability as a set of engineering constraints. When you know where a model is weak (formatting, long-horizon planning, identifier fidelity, retrieval under long context), you can compensate with the runtime: validators, routers, memory, tool contracts, and evaluation harnesses.
For a practical model selection checklist, ask: (1) does it reliably produce structured outputs under pressure, (2) does it preserve identifiers and constraints across long conversations, and (3) does it recover cleanly when tools fail? Those three traits dominate real-world success far more than “IQ”-style benchmarks.

Originally published at: arunbaby.com/ai-agents/0002-llm-capabilities-for-agents

If you found this helpful, consider sharing it with others who might benefit.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

LLM Capabilities for Agents

TL;DR

1. Why Does the LLM Matter for AI Agents?

2. What Capabilities Must an LLM Have to Be an Effective Agent?

2.1 What Is Steerability and Why Does Instruction Following Matter?

2.2 How Does LLM Reasoning Work (System 2 Thinking)?

2.3 What Is Grounding and How Does Tool Use Work?

2.4 How Does Context Window Management Affect Agents?

3. What Are the Key Reasoning Paradigms for Agents?

3.1 What Is Chain of Thought (CoT) Prompting?

3.2 How Does Tree of Thoughts (ToT) Enable Complex Planning?

3.3 What Is Self-Refinement (Reflexion) and Why Does It Matter?

4. How Do You Choose the Right LLM for Your Agent?

4.1 What Are the Frontier Models (The Big Brains)?

4.2 What Are the Best Open Weights Models for Agents?

4.3 When Should You Use Specialized Models?

5. How Do You Benchmark an LLM for Agentic Tasks?

5.1 What Do HumanEval and MBPP Measure?

5.2 What Does AgentBench / WebShop Test?

5.3 What Is SWE-Bench and Why Is It the Gold Standard?

6. When Should You Fine-Tune an LLM for Your Agent?

6.1 Why Should You Try Prompt Engineering Before Fine-Tuning?

6.2 What Are the Valid Reasons to Fine-Tune?

7. What Are Reasoning Tokens and How Will They Change Agents?

FAQ

8. Key Takeaways

Related across topics

Share on

TL;DR

1. Why Does the LLM Matter for AI Agents?

2. What Capabilities Must an LLM Have to Be an Effective Agent?

2.1 What Is Steerability and Why Does Instruction Following Matter?

2.2 How Does LLM Reasoning Work (System 2 Thinking)?

2.3 What Is Grounding and How Does Tool Use Work?

2.4 How Does Context Window Management Affect Agents?

3. What Are the Key Reasoning Paradigms for Agents?

3.1 What Is Chain of Thought (CoT) Prompting?

3.2 How Does Tree of Thoughts (ToT) Enable Complex Planning?

3.3 What Is Self-Refinement (Reflexion) and Why Does It Matter?

4. How Do You Choose the Right LLM for Your Agent?

4.1 What Are the Frontier Models (The Big Brains)?

4.2 What Are the Best Open Weights Models for Agents?

4.3 When Should You Use Specialized Models?

5. How Do You Benchmark an LLM for Agentic Tasks?

5.1 What Do HumanEval and MBPP Measure?

5.2 What Does AgentBench / WebShop Test?

5.3 What Is SWE-Bench and Why Is It the Gold Standard?

6. When Should You Fine-Tune an LLM for Your Agent?

6.1 Why Should You Try Prompt Engineering Before Fine-Tuning?

6.2 What Are the Valid Reasons to Fine-Tune?

7. What Are Reasoning Tokens and How Will They Change Agents?

FAQ

8. Key Takeaways

Related across topics

Valid Parentheses

Classification Pipeline Design

Speech Command Classification

Share on