14 minute read

Most red teaming is wrong. Not wrong about the risks — wrong about where the risks live.

Here’s the failure mode. You run a standard red team on your agent. Your tooling fires adversarial prompts at the model. The model refuses. The model declines. Your guardrails catch the obvious attempts. You write up the findings, fix the high-severity jailbreaks, and ship. The agent is deployed.

Three weeks later, an attacker sends your agent an email with a hidden instruction. The agent reads the email — safe. It calls a search tool to look up related documents — safe. It writes a summary to a shared Slack channel — safe. It composes and sends a follow-up to the attacker’s address — catastrophic.

No single step failed. The trajectory was the attack.

This is the gap T-MAP (arXiv:2603.22341, KAIST/UCLA/DeepAuto.ai, March 2026) is built to close. It’s the first red teaming framework that treats multi-step tool execution as the attack surface — not individual model responses.

TL;DR

T-MAP is the first red teaming framework that attacks agent action sequences rather than individual responses. It achieves a 57.8% average Attack Realization Rate across five MCP environments versus 1.9% for naive prompt-level approaches. Pair it with PISmith (arXiv:2603.13026), which stress-tests your prompt injection defenses under adaptive RL-based attack. Together they cover the offense/defense picture that standard LLM red teaming tools miss entirely. See prompt injection defense fundamentals for the background context on what makes injection in agent systems different.

A long circuit board trace path being probed at multiple checkpoints simultaneously with test leads clipped at regular intervals along the entire l...

Why standard red teaming misses agent-specific failures

Standard LLM red teaming tests whether a model will produce a harmful output. That framing made sense when models were input-output functions. You sent a prompt, you got a response, you evaluated the response. If the response was harmful, the test failed.

Agents are not input-output functions. They are state machines that execute over time, calling tools, reading results, modifying external systems, and using the output of each step as input to the next. The dangerous thing is rarely a single model response. It’s the sequence.

T-MAP’s research team at KAIST put a number on this gap. Zero-shot prompt-level attacks achieved a 1.9% Attack Realization Rate — meaning 1.9% of attacks actually achieved their harmful objective through tool execution. Iterative refinement (the technique most algorithmic red teaming tools use) reached 15.6%. T-MAP’s trajectory-aware approach: 57.8%. That’s a 30x improvement over naive approaches and nearly 4x over iterative refinement.

The explanation is compositional risk. According to the T-MAP paper, 46.28% of successful attacks spanned multiple MCP servers — they required cross-server tool chains that no individual step’s safety evaluation could catch. Compare that to the best non-trajectory baseline, which found cross-server attacks in only 16.42% of successful attempts. You cannot find a vulnerability you are not designed to look for.

The Stanford HAI 2025 AI Index reported that publicly disclosed AI-related security and privacy incidents rose 56.4% from 2023 to 2024. The trajectory-level attack surface is a large part of why that number is climbing.

What trajectory-aware attack means in practice

T-MAP’s Attack Realization Rate is the signal you care about: did the attack actually achieve its harmful objective through real tool execution, not just produce a harmful text response. The gap between these two things is where standard testing creates false confidence.

The mechanism is evolutionary search over a structured archive. T-MAP maintains a 64-cell grid spanning risk categories and attack styles. It generates candidate prompts, executes them against the actual agent environment, records the full tool call trajectory — every function called, every argument passed, every result returned — and uses that trajectory information to mutate and improve subsequent attack attempts. The archive prevents convergence to a single successful pattern and keeps the search exploring diverse attack paths.

The results across five MCP environments show just how much the attack surface varies by tool type. Filesystem access was the most vulnerable at 84.4% ARR. Slack reached 64.1%. Gmail landed at 46.9%. These aren’t abstract benchmarks — they’re the tool integrations running in production agent deployments today.

Prompt-level red teaming:
  Adversarial prompt → Model response → Harmful? (yes/no)

Trajectory-aware red teaming (T-MAP):
  Adversarial prompt → Step 1 (tool call + result)
                    → Step 2 (tool call + result)
                    → Step 3 (tool call + result)
                    → Objective achieved? (yes/no)

T-MAP also discovered 21.80 distinct successful tool invocation sequences at its highest success level — versus 12.80 for the best baseline. More diverse attack paths means more coverage of the attack surface. Its Self-BLEU lexical diversity score was 0.25 versus 0.45 for the standard evolution baseline. Lower is better: the attacks are genuinely different from each other, not variations on a single pattern.

The practical implication: if your red team runs 300 prompts through a prompt-level tool and finds nothing, you haven’t tested whether a 4-step trajectory through your agent’s MCP integrations achieves something harmful. You’ve tested something different. See agent evaluation frameworks for how eval methodology affects what you can and can’t detect.

flowchart TD
    A[Adversarial prompt generated] --> B[Agent executes Step 1]
    B --> C{Step 1 harmful?}
    C -->|No — step passes| D[Agent executes Step 2]
    D --> E{Step 2 harmful?}
    E -->|No — step passes| F[Agent executes Step 3]
    F --> G{Step 3 harmful?}
    G -->|No — step passes| H[Objective achieved?]
    H -->|Yes — trajectory attack succeeded| I[🔴 ARR = 1]
    H -->|No| J[🟢 Attack failed]

    style I fill:#ff4444,color:#fff
    style J fill:#44aa44,color:#fff

PISmith: RL-based defense testing

T-MAP is offense. PISmith (arXiv:2603.13026, Chenlong Yin et al., March 2026) is how you test whether your defenses actually hold.

PISmith evaluated 13 benchmarks and outperformed 7 baselines — covering static, search-based, and RL-based attack categories. The key finding: state-of-the-art prompt injection defenses remain vulnerable to adaptive attacks. That’s not an incremental result. Most defense papers demonstrate robustness against fixed attack patterns. PISmith shows that an attacker who learns from what works — adapting their injection strategy based on observed outputs — can break defenses that look solid against non-adaptive testing.

The technical problem PISmith solves is reward sparsity. When you train an RL agent to find prompt injection attacks against a strong defense, most generated prompts get blocked. Near-zero reward signals cause two failure modes: policy entropy collapse (the attacker stops exploring and optimizes a single pattern) and gradient dilution (rare successes get swamped by the majority of failures). PISmith addresses both with adaptive entropy regularization and dynamic advantage weighting.

The results are sharp. Against the strongest prior RL-based baseline (RL-Hammer), PISmith achieves ASR@1 of 0.87 versus RL-Hammer’s 0.48 — meaning PISmith succeeds in a single injection attempt 87% of the time. At 10 attempts, PISmith achieves ASR@10 of 1.0 (100% success) versus RL-Hammer’s 0.70. And it generalizes: it achieves an average ASR@10 of 1.0 across all 12 unseen benchmarks it was evaluated on after training on a single benchmark.

The practical read: your defensive controls may work against the attacks you’ve already seen. They may not work against an adaptive attacker who systematically tries different injection strategies, observes partial success signals, and adjusts. PISmith is a way to find out before that attacker does.

Together, T-MAP and PISmith form something like a complete offense/defense testing picture for agent systems:

Framework Attack surface Approach Key metric
T-MAP Multi-step trajectories Evolutionary search over trajectory archive Attack Realization Rate
PISmith Prompt injection defenses RL-based adaptive attack, black-box Attack Success Rate vs. 13 benchmarks
Standard tools (Garak, PyRIT, TAP) Single-step model responses Prompt mutation, iterative refinement Jailbreak success rate

The table illustrates the gap. Standard tools measure jailbreak success rate — whether the model says something harmful. T-MAP measures whether the agent does something harmful across a sequence of tool calls. These are different phenomena.

A testing checklist for production agent systems

The T-MAP and PISmith findings suggest a concrete shift in how agent security testing should work. Standard LLM red teaming covers one layer. Trajectory-level testing covers a different layer. Both are needed. Here’s how to approach it in practice:

Before testing: map your tool graph. List every tool your agent can call. Note which tools read external data (potential injection entry points), which write to external systems (potential exfiltration paths), and which can be chained. Cross-server chains — where output from one MCP server becomes input to another — are where T-MAP found 46.28% of successful multi-step attacks.

Single-step testing (keep doing this): Run your existing prompt-level tools against individual tool call responses. Use algorithmic red teaming methods like TAP or PAIR to stress-test individual guardrails. This catches step-level failures and remains necessary — trajectory-level testing doesn’t replace it.

Add trajectory-level testing: Define harmful objectives that require multiple tool calls to achieve (e.g., “exfiltrate data via email”, “execute code that modifies filesystem then reports externally”). Test whether an adversarial prompt can guide the agent through the 3-5 steps needed to achieve each objective. Evaluate ARR, not just whether any individual step produced a harmful response.

Test cross-tool chains explicitly: If your agent integrates Slack, Gmail, and a code execution tool, test explicitly whether a malicious Slack message can trigger a code execution step that then sends results to an external address. T-MAP’s multi-server experiments showed this is consistently the most dangerous configuration.

Stress-test defenses adaptively: Run PISmith or equivalent RL-based attacks against your prompt injection defenses. Static test suites tell you whether your defenses catch known patterns. Adaptive testing tells you whether they hold against an attacker who learns. Only the latter is relevant to a real adversary.

Log at trajectory level: If your observability only captures individual tool call inputs and outputs, you can’t reconstruct attack trajectories after the fact. Log full sequences — the ordered chain of tool calls within a single agent session — so you can detect trajectory-level attacks in production.

Set ARR thresholds, not just jailbreak rate thresholds: Define what ARR is acceptable for each risk category in your system. A 57.8% ARR in a filesystem agent is very different from a 57.8% ARR in an agent that can only read public web pages. The acceptable threshold depends on what the trajectory can actually achieve.

flowchart LR
    A[Map tool graph] --> B[Single-step testing]
    B --> C[Trajectory-level testing]
    C --> D[Cross-tool chain testing]
    D --> E[Adaptive defense testing]
    E --> F[Trajectory-level logging]
    F --> G[Set ARR thresholds per risk category]

    style A fill:#e8f4fd
    style C fill:#ffeaa0
    style D fill:#ffeaa0
    style E fill:#ffe0e0
    style G fill:#e0ffe0

The deeper issue is that agent security is still being evaluated with tools built for a simpler problem. According to the Gravitee State of AI Agent Security 2026 report, only 34% of enterprises have AI-specific security controls in place — and 88% reported confirmed or suspected AI agent security incidents in the last year. The methodology gap is part of why that 88% figure keeps climbing. See ethical AI agents and safety for the broader framing of where agent safety controls need to go.

The T-MAP paper is a useful forcing function: if your red team can’t tell you the ARR of your agent system across multi-step attack scenarios, you don’t know your actual exposure.


FAQ

What is T-MAP and how does it differ from standard LLM red teaming?

T-MAP (Trajectory-aware Multi-dimensional Archive-based adversarial Prompting) is a red teaming framework from KAIST, UCLA, and DeepAuto.ai that attacks agent action sequences rather than individual model responses. Standard tools like Garak or PyRIT test whether a model will produce a harmful output to a given prompt. T-MAP tests whether a sequence of tool calls across multiple steps will achieve a harmful objective — a categorically different attack surface. It achieves 57.8% average Attack Realization Rate versus 1.9% for naive prompt-level baselines.

Why does testing individual agent steps miss real vulnerabilities?

Because agents fail compositionally. A step that reads from a filesystem is safe. A step that sends an email is safe. A step that executes code is safe. String them together under the right adversarial prompt and you get exfiltration. No individual step looks dangerous in isolation. T-MAP found that 46.28% of successful attacks in its MCP experiments spanned multiple servers — cross-server trajectories that single-step testing by definition cannot detect.

What is PISmith and how does it relate to T-MAP?

PISmith (arXiv:2603.13026) is an RL-based red teaming framework that stress-tests prompt injection defenses in black-box settings. Where T-MAP is offense — finding trajectory attacks — PISmith tests whether your defenses hold under adaptive attack. Together they cover both sides: T-MAP discovers multi-step attack paths; PISmith verifies that your prompt injection defenses can’t be bypassed by an attacker who adapts based on what works. PISmith was evaluated on 13 benchmarks against 7 baselines.

Which agent environments did T-MAP test against?

T-MAP ran against five MCP server environments: CodeExecutor (56.2% ARR), Slack (64.1% ARR), Gmail (46.9% ARR), Playwright (37.5% ARR), and Filesystem (84.4% ARR). It also tested multi-server chains across Slack+CodeExecutor, Playwright+Filesystem, and Gmail+CodeExecutor+Filesystem. The main attack model was GPT-5-mini; generalization tests covered nine models including GPT-5.2, Gemini-3-Pro, Claude Opus 4.6, Qwen3.5, and GLM-5.

What does Attack Realization Rate (ARR) measure and why does it matter?

ARR measures whether a harmful objective was actually achieved through tool execution — not whether a model produced a harmful text response. This is the right metric for agents because the risk is in what the agent does, not what it says. A model can refuse to describe a harmful action while a trajectory of tool calls achieves it anyway. T-MAP’s 57.8% average ARR against frontier models including GPT-5.2 and Gemini-3-Pro shows the gap between surface-level safety and trajectory-level safety.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch