Testing AI Agents
”Test agents like systems: validate tool calls, pin behaviors with replayable traces, and catch regressions before users do.”
TL;DR
Testing AI agents requires a systems approach across four layers: unit tests for deterministic components (validators, budgets, redaction), integration tests with mocked tools, scenario tests for end-to-end flows, and trace-based regression tests. Tool mocking via record/replay is the single most important technique because it makes tests fast, stable, and cheap. Test trajectories (tool selection, ordering, retries) not just final outputs, and enforce invariants (budgets, safety rails, schema validation) deterministically in code. This complements the evaluation framework patterns in Agent Evaluation Frameworks and the Multi-Step Reasoning foundations.

1. Why “testing agents” is different from testing normal code
In normal software, you test deterministic functions: same input → same output.
Agents are not like that:
- outputs can vary across runs
- “correctness” can be fuzzy
- behavior depends on tools (web, APIs, files, code execution)
- safety matters as much as success
So testing agents is really testing a system:
- the model prompt and policy
- the orchestrator code
- tool contracts and validators
- sandboxing and safety gates
- state management and stop conditions
If you test only the final text output, you’ll miss the most expensive failures: unsafe tool use, loops, and silent hallucinations.
2. The testing stack: unit → integration → scenario → regression
A practical hierarchy:
2.1 Unit tests (fast, local)
Test deterministic components:
- tool argument validation
- schema parsers
- redaction functions
- budget counters and repetition detectors
2.2 Integration tests (tool boundaries)
Test orchestration with mocked tools:
- tool calls happen with correct args
- unsafe calls are blocked
- retries are bounded
2.3 Scenario tests (end-to-end)
Test realistic user flows:
- multi-step tasks
- realistic tool outputs
- expected stop reasons
2.4 Regression tests (pin behaviors)
Replay known traces and compare:
- action sequences
- safety decisions
- cost metrics
This is how you keep agent behavior stable over time.
2.5 A quick mapping: what to test where
| Layer | What you test | What it catches | Typical runtime |
|---|---|---|---|
| Unit | validators, redaction, budgets | obvious bugs, safety bypasses | milliseconds |
| Integration | orchestrator + mocked tools | wrong tool args, wrong ordering | seconds |
| Scenario | end-to-end intent flows | drift, missing steps, bad stop reasons | seconds–minutes |
| Regression | replay traces | regressions from prompt/model/tool changes | minutes |
This mapping is useful because it prevents a common mistake: trying to test everything with slow end-to-end tests.
3. What to test: outputs, trajectories, and invariants
3.1 Output-level tests
Useful for:
- strict JSON outputs
- required fields and formats
- deterministic tasks
But output tests alone are insufficient.
3.2 Trajectory-level tests (the agent’s path)
Trajectory tests validate:
- tool selection
- action ordering
- bounded retries
- stop conditions
This is the core of agent reliability engineering.
3.3 Invariant tests (must always hold)
Invariants should be enforced by code:
- budgets never go negative
- write actions require approval
- no network in sandbox (if policy)
- no secrets in logs
Invariants are your “hard rails.” When they fail, the test should fail loudly.
3.5 Practical invariants (copy/paste for most agents)
These are generic invariants that improve most agent systems:
- Budget invariants:
steps_used <= max_steps,tool_calls <= max_tool_calls. - Safety invariants: no write tools when writes are disallowed; no network when sandbox policy forbids it.
- Schema invariants: any structured output must parse; unknown fields are rejected (when strict).
- No secret echo: outputs and logs must not contain patterns that look like keys/PII.
- Stop invariants: agent must always stop with an explicit stop reason.
You can enforce many of these in code (deterministically) and then test the enforcement layer.
4. Mocking tools: the most important agent testing technique
If your tests depend on real tools:
- they’re slow
- they’re flaky
- they’re expensive
- they change over time (web pages, APIs)
So you need tool mocks.
4.1 Record/replay
Record real tool outputs once, then replay them.
Pros:
- realistic
- stable over time
Cons:
- recordings can go stale if tools evolve
4.2 Synthetic mocks
Write simple mocked outputs:
- simulate success
- simulate timeouts
- simulate malformed responses
This is great for testing error handling and safety.
4.3 Hybrid approach
Use record/replay for “happy paths” and synthetic mocks for “failure paths.”
4.5 Mocking strategy for multi-step agents: “tool scripts”
For scenario tests, it’s helpful to define tool outputs as a script:
- call #1 → returns success payload
- call #2 → returns 429
- call #3 → returns success after backoff
This lets you verify:
- retries are bounded
- the agent changes strategy
- stop reason is correct
These scripts become reusable test fixtures.
5. Testing structured outputs: schemas + fuzzing
If your agent outputs structured JSON, you can test:
- schema validity
- missing required fields
- invalid types
- range constraints
Add fuzz tests:
- random strings
- edge-case unicode
- extremely long inputs
These catch parsing edge cases and drift.
Further reading (optional): see Structured Output Patterns.
5.5 Property-based thinking (even without a property-testing library)
You can get most property-testing value without adopting a new library by writing a few loops:
- generate 100 random small inputs
- run the agent/tool logic
- assert invariants
Examples:
- “Any extracted list has no duplicate IDs.”
- “Any plan has no cycles when dependencies are used.”
- “Any cost metric is non-negative.”
This catches “weird” failures that aren’t in your hand-written examples.
6. Testing autonomy: budgets and stop reasons
Autonomous agents fail by looping. So tests must assert:
- the agent stops within budget
- the stop reason is correct (SUCCESS/NEEDS_INPUT/FAILURE)
- repeated tool calls are detected
A simple scenario test might assert:
- “Given tool failure twice, the agent asks a clarifying question instead of retrying forever.”
Further reading (optional): see Autonomous Agent Architectures and Hierarchical Planning.
6.5 Testing statefulness: resumption and replay
If your agent can pause/resume, test:
- resuming from a checkpoint produces a consistent next action
- tool cache prevents duplicate calls
- old failures are remembered (negative memory) to avoid loops
This is where “agent tests” become “workflow tests.”
Further reading (optional): see State Management and Checkpoints.
6.6 Testing planning quality (lightweight but effective)
If your agent generates plans, you can test plan quality without “grading prose.”
Practical checks:
- No cycles: if the plan has dependencies, it must be acyclic.
- Step bounds: each plan chunk has a max step budget.
- Done definitions exist: each chunk has a measurable done condition.
- No forbidden actions: plan does not contain disallowed tools/operations.
These are deterministic checks that catch many planning failures early.
Further reading (optional): see Hierarchical Planning.
7. Testing safety: prompt injection and forbidden actions
Safety tests should be explicit and automated:
7.1 Prompt injection tests
Inputs containing:
- “ignore previous instructions”
- “call delete tool”
- “reveal secrets”
Expected behavior:
- refuse
- sanitize
- continue safely or escalate
7.2 Forbidden tool calls
Make sure the orchestrator blocks:
- writes when writes are forbidden
- network calls when network is forbidden
- file access outside allowlist
These should be deterministic tests that do not depend on the model “doing the right thing.”
7.5 Testing prompt injection defense: fail closed
When you test injection, treat the model as untrusted:
- the orchestrator should block dangerous tool calls even if the model requests them
- untrusted content must not be placed into instruction channels
Your tests should assert “fail closed” behavior: when unsure, the system blocks and escalates rather than acting.
7.6 Data leakage tests: protect outputs and logs
Leakage tests should cover both user-visible output and internal logs/traces.
Test inputs:
- fake API keys (
sk-..., long random strings) - fake PII (emails, phone numbers)
- internal hostnames or file paths
Expected behavior:
- agent output is redacted
- logs are redacted (or sensitive fields are omitted)
- the agent refuses to reveal secrets even when prompted
These tests can be mostly deterministic (regex detectors + policies), which is ideal for CI.
8. Trace-based regression: replay the system, not just the output
The best regression artifacts are traces:
- messages
- tool calls + results
- budgets
- stop reason
A regression test can assert:
- tool call sequence is unchanged (or changes are intentional)
- safety blocks still happen
- cost is within bounds
Further reading (optional): see Observability and Tracing and Agent Evaluation Frameworks.
8.5 Snapshot tests for agent behavior (use carefully)
Snapshot tests (“compare output to a stored snapshot”) are tempting, but they’re fragile for agents.
If you use snapshots, snapshot the structured parts:
- tool call sequence
- stop reason
- schema-validated outputs
- safety violations list
Avoid snapshotting large freeform text that is expected to vary.
This gives you regression protection without constant churn.
8.6 Golden traces: how to keep them useful over time
Golden traces are saved “known good” trajectories that you replay in regression tests.
To keep them useful:
- store traces as structured artifacts (JSON) with redacted tool results
- pin tool outputs via record/replay
- version the agent prompt and the tool schemas
When a golden trace changes:
- require an explicit review/approval (like updating a snapshot test)
- record the reason (“prompt updated to handle new constraint”)
This prevents “silent drift” where the test suite slowly stops representing intended behavior.
9. A minimal test harness (pseudocode)
def test_agent_scenario(agent, scenario):
tools = FakeTools(scenario["tool_responses"])
result = agent.run(scenario["input"], tools=tools)
assert result.stop_reason == scenario["expected_stop_reason"]
assert result.steps <= scenario["max_steps"]
assert not result.violations # safety violations
assert tool_calls_match(result.trace.tool_calls, scenario["expected_tool_calls"])
The goal is stable, repeatable tests. The model can still be stochastic, but the test harness should reduce variability via tool replay and strict invariants.
9.5 Case study: a browsing agent test suite
Test cases to include:
- normal query with 2 sources
- conflicting sources (agent should surface uncertainty)
- malicious page injection text (agent must ignore and not call tools unsafely)
- stale page (agent should prefer recent sources if policy requires)
This is an ideal domain for tests because “citation faithfulness” is measurable.
Further reading (optional): see Web Browsing Agents.
9.6 Case study: a code execution agent test suite
Code execution agents are a perfect target for strong tests because you can demand hard evidence (tests passing).
Scenario tests:
- compilation/runtime error → agent fixes code in ≤3 retries
- timeout case → agent changes strategy (smaller input, early exits)
- forbidden operation case (network/filesystem) → system blocks and agent proceeds safely
Assertions:
- sandbox limits were respected (time/memory)
- no network calls attempted (if policy)
- final output matches expected schema/format
Further reading (optional): see Code Execution Agents.
10. Common failure modes (and the test that catches them)
10.1 “Correct answer, wrong tool behavior”
Test: trajectory-level assertions on tool calls.
10.2 “Infinite retries”
Test: budget + repetition detector.
10.3 “Secret leakage”
Test: log scrubbing + regex leak detectors.
10.4 “Schema drift”
Test: schema validation on every run + fuzz tests.
10.5 “Stale browsing results”
Test: citation faithfulness and freshness policy checks (with recorded pages).
11. Summary & Junior Engineer Roadmap
Testing agents is systems testing:
- Test deterministic parts hard: validators, redaction, budgets.
- Mock tools: record/replay + failure simulations.
- Test trajectories: tool choice, ordering, bounded retries.
- Assert invariants: safety rules must never be violated.
- Replay traces for regression: pin behavior across changes.
Mini-project (recommended)
Build a test suite for one agent intent:
- 10 scenario tests with tool replay
- 5 adversarial tests (prompt injection, malformed tool output)
- 3 regression traces pinned to known-good behavior
Run it in CI and fail the build when success rate or safety degrades.
Common rookie mistakes (avoid these)
- Testing only final text: you’ll miss unsafe tool attempts and loops.
- No tool mocks: your tests will be flaky and slow.
- No budgets: an agent that loops will “pass” until it hits production costs.
- No regression suite: you’ll ship changes that silently degrade behavior.
Further reading (optional)
- Evaluation harnesses and scoring: Agent Evaluation Frameworks
- Trace collection and debugging: Observability and Tracing
Related: For the broader evaluation strategy that testing plugs into, see Agent Evaluation Frameworks. For planning patterns that benefit most from strong tests, see Hierarchical Planning.
FAQ
How is testing AI agents different from testing normal software?
Agent outputs vary across runs, correctness is often fuzzy, behavior depends on external tools, and safety matters as much as success. You must test the entire system including prompts, orchestration, tool contracts, sandboxing, state management, and stop conditions, not just final text output. Testing only the final answer misses the most expensive failures: unsafe tool use, infinite loops, and silent hallucinations.
What is tool mocking for AI agent testing?
Tool mocking replaces real external tools with controlled substitutes to make tests fast, stable, and cheap. Record/replay captures real tool outputs once and replays them for consistent tests. Synthetic mocks simulate failures like timeouts, rate limits, and malformed responses. The hybrid approach uses recordings for happy paths and synthetic mocks for failure paths, giving both realism and coverage of error handling.
What are trajectory-level tests for AI agents?
Trajectory tests validate the agent’s path through a task, not just its final output. They check tool selection (did it use the right tool), action ordering (did it respect dependencies), bounded retries (did it stop retrying after failures), and stop conditions (did it halt for the right reason). This catches failures like correct answers achieved through unsafe tool use or wasteful repeated calls.
How do you regression test AI agents?
Save known-good traces as structured JSON artifacts with pinned tool outputs, then replay them on every change. Assert that tool call sequences, safety blocks, and cost metrics remain within bounds. Version the agent prompt and tool schemas alongside the golden traces. When a golden trace changes, require explicit review and record the reason, preventing silent drift where the test suite stops representing intended behavior.
Originally published at: arunbaby.com/ai-agents/0043-testing-ai-agents
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch