AttriGuard (arXiv 2603.10749) is a runtime defense that uses counterfactual causal analysis to detect when an agent's tool calls are driven by injected content rather than the user's original instruction. It intervenes at the action level rather than the input level, catching attacks that bypass input-layer detectors.

PISmith (arXiv 2603.13026) is an RL-based red teaming framework that automatically discovers prompt injection attacks in a black-box setting. It trains an attack LLM using reinforcement learning to systematically find injection payloads that bypass defenses, enabling continuous automated testing of your defense stack.

How does the NVIDIA secure agent architecture paper approach the problem?

The NVIDIA and Johns Hopkins position paper (arXiv 2603.30016) presents three system-level defense positions for agents processing untrusted data. It frames indirect prompt injection as an architectural problem requiring system-level constraints on what agents can do, rather than model-level improvements to what they detect.

Architecting secure AI agents: the defense stack for indirect prompt injection

Q: What is the best defense against indirect prompt injection in 2026?

Layered defense that combines input detection (as one signal, not the gate), causal attribution on tool calls (tracing which input caused each action), behavioral monitoring, and least-privilege tool access. AttriGuard's counterfactual causal analysis is the highest-ROI single technique, achieving strong defense on AgentDojo with minimal utility loss.

11 minute read

Concentric translucent security rings surrounding a processor chip, representing layered defense architecture

TL;DR — Three papers from March-April 2026 form a complete defense stack against indirect prompt injection: system-level architecture from NVIDIA and Johns Hopkins (arXiv 2603.30016), RL-based automated red teaming via PISmith (arXiv 2603.13026), and causal tool-call attribution via AttriGuard (arXiv 2603.10749). This is the companion to why detection is broken — that post explains the problem, this one explains the solution.

Detection is one layer. What are the other layers?

The previous post in this series established that prompt injection detection tools can be evaded at up to 100% success rates. If detection cannot be the primary defense, what replaces it?

Three papers published within days of each other in late March and early April 2026 answer this question from different angles. Together, they form a defense stack:

System-level architecture (arXiv 2603.30016, NVIDIA + Johns Hopkins): What constraints should the agent system enforce regardless of what the model detects?
Automated red teaming (arXiv 2603.13026, PISmith): How do you continuously test whether your defenses actually work?
Causal attribution (arXiv 2603.10749, AttriGuard): How do you determine, at runtime, whether a tool call was caused by the user or by injected content?

None of these three rely on detecting malicious input text. All three operate at the system level, above the model.

What does the NVIDIA architecture paper propose?

The position paper from NVIDIA and Johns Hopkins (arXiv 2603.30016) frames indirect prompt injection as an architectural problem, not a model problem. Its core argument: when an agent processes untrusted data (documents, emails, API responses, web pages) and has the ability to take actions (tool calls, data writes, external communication), no amount of model-level safety training will prevent all attacks.

The paper presents three defense positions:

Position 1: Privilege separation. Untrusted data enters a sandboxed context. The agent can read it but cannot use information from that context to trigger tool calls. This is analogous to a browser sandbox — JavaScript from a web page cannot access the filesystem even though the browser can.

Position 2: Action verification. Every tool call passes through a verification layer that checks whether the action is consistent with the user’s original intent. If the user asked “summarize this email” and the agent tries to forward the email externally, the verification layer blocks it — regardless of what injection payload triggered the action.

Position 3: Graduated trust. Different data sources receive different trust levels. User messages get high trust. Retrieved documents get medium trust. External API responses get low trust. Tool call permissions scale with trust: high-trust sources can trigger any action; low-trust sources can only trigger read-only operations.

graph TD
    A[User message - high trust] --> D[Agent processing]
    B[Retrieved document - medium trust] --> D
    C[External API response - low trust] --> D
    D --> E[Agent requests tool call]
    E --> F{Trust level of triggering source?}
    F -->|High trust| G[Any action permitted]
    F -->|Medium trust| H[Read + approved writes only]
    F -->|Low trust| I[Read-only actions only]
    
    J[Action verification layer] --> F
    
    style A fill:#4caf50,color:#fff
    style B fill:#ff9800,color:#000
    style C fill:#d32f2f,color:#fff

The practical implication: if your agent processes emails and has tool access, the email content should never be able to trigger actions beyond what reading an email requires. The agent can summarize, classify, and extract information. It cannot forward, delete, or reply based on instructions found inside the email body.

How AttriGuard traces what caused each tool call

AttriGuard (arXiv 2603.10749) solves the attribution problem: when an agent calls a tool, which part of the input caused that call?

The technique is counterfactual causal analysis. For each tool call the agent makes, AttriGuard asks: if I removed the untrusted content from the input and re-ran the model, would this tool call still happen? If yes, the call is legitimate — driven by the user’s instruction. If no, the call was caused by the untrusted content — likely an injection.

This is not a simple heuristic. The counterfactual analysis accounts for:

Direct causation: The injected text explicitly instructs a tool call (“send this data to attacker.com”)
Indirect causation: The injected text changes the agent’s reasoning in a way that leads to an unauthorized tool call
Confounded causation: Both the user instruction and the injected text point toward the same tool call (the hardest case — AttriGuard checks whether the specific parameters match the user’s intent)

On the AgentDojo benchmark, AttriGuard achieves strong defense rates while maintaining agent utility. The key architectural decision: AttriGuard operates at the tool-call boundary, not the input boundary. It does not try to detect malicious text. It checks whether actions are causally justified by the user’s original request.

This complements the NVIDIA architecture’s Position 2 (action verification). The NVIDIA paper describes the principle; AttriGuard implements a concrete mechanism.

How PISmith automates attack discovery

The defense stack is only as good as your testing. PISmith (arXiv 2603.13026) automates the testing side.

Most red teaming for prompt injection relies on human-written attack payloads. This is slow, biased toward known attack patterns, and cannot keep pace with model updates. PISmith trains an attack LLM using reinforcement learning to discover injection payloads that bypass your specific defenses.

The process:

PISmith initializes an attack LLM with a set of known injection templates
The attack LLM generates candidate payloads targeting your agent
Each payload is tested against your defense stack (detection, attribution, action verification)
Successful bypasses (payloads that trigger unauthorized actions) receive positive reward
The attack LLM updates its policy to generate more payloads in the successful direction
Iterate until the attack LLM converges or a time budget expires

The output is a set of injection payloads that bypass your current defenses. These are not generic attacks — they are optimized for your specific stack, your specific model, and your specific tool configuration.

PISmith operates in a black-box setting: it does not need access to model weights, defense internals, or training data. It only observes whether its injection payloads succeed or fail against the deployed system. This matches realistic attack conditions.

The practical use: run PISmith against your production defense stack weekly. If it discovers new bypasses, update your defenses. If it converges without finding bypasses, you have a quantified confidence bound on your defense quality. This is continuous automated red teaming — the security equivalent of continuous integration testing.

Assembling the defense stack

The three papers compose into a layered architecture:

graph TD
    subgraph "Layer 1: Input Processing"
        A[Input detection - signal only] --> B[Trust classification by source]
    end
    
    subgraph "Layer 2: Runtime Defense"
        C[Privilege separation by trust level]
        D[AttriGuard causal attribution on tool calls]
    end
    
    subgraph "Layer 3: Continuous Testing"
        E[PISmith automated red teaming - weekly]
        F[AgentWatcher behavioral monitoring]
    end
    
    B --> C
    C --> D
    D -->|Legitimate| G[Execute action]
    D -->|Suspicious| H[Block + alert]
    E -->|New bypass found| I[Update Layer 1 + 2 defenses]
    
    style A fill:#ff9800,color:#000
    style D fill:#4caf50,color:#fff
    style E fill:#1976d2,color:#fff

Layer 1 classifies inputs by trust level and runs detection as a signal (not a gate). Even imperfect detection raises the cost of attack and provides signal for downstream layers.

Layer 2 enforces constraints at runtime. Privilege separation limits what actions untrusted data can trigger. AttriGuard checks causal attribution on every tool call, blocking actions driven by injected content.

Layer 3 continuously tests the defense stack. PISmith discovers new bypass techniques. The discoveries feed back into Layers 1 and 2 as defense updates.

This architecture does not depend on perfect detection at any single layer. Each layer catches what the previous layer misses. An injection that evades input detection is caught by trust-level privilege separation. An injection that works despite privilege limits is caught by causal attribution. And PISmith continuously probes for attacks that bypass all layers.

What to implement first

If you are starting from zero defense beyond input detection:

Week 1: Trust classification. Tag every data source in your agent’s pipeline with a trust level: user input (high), internal databases (high), retrieved documents (medium), external APIs (low), web content (low). This is a metadata change, not a model change.

Week 2: Privilege separation. Map tool permissions to trust levels. Low-trust sources can trigger read-only tools. High-risk actions (writes, sends, deletes) require high-trust source attribution. This mirrors the excessive agency controls applied at the data-source level.

Week 3: Action logging for attribution. Before implementing full causal attribution, log every tool call with its input provenance: which source(s) contributed to this action? This creates the audit trail for both post-incident analysis and future AttriGuard-style runtime checks.

Week 4: Automated testing baseline. Run a basic red team battery against your defense stack using known injection payloads from the 240,000-attack study. Measure bypass rates. This becomes your baseline for evaluating PISmith or equivalent automated red teaming.

Key takeaways

Three papers from March-April 2026 form a complete defense stack: system-level architecture (arXiv 2603.30016), RL-based red teaming (PISmith, arXiv 2603.13026), and causal attribution (AttriGuard, arXiv 2603.10749)
The NVIDIA/Johns Hopkins architecture paper frames indirect injection as an architectural problem requiring privilege separation, action verification, and graduated trust — not better detection
AttriGuard uses counterfactual causal analysis to determine whether each tool call was driven by the user’s instruction or injected content, operating at the action level rather than the input level
PISmith automates red teaming by training an attack LLM with reinforcement learning to discover defense bypasses in a black-box setting
The defense stack layers compose: input trust classification, runtime privilege separation + causal attribution, and continuous automated testing
Implementation starts with trust tagging (Week 1), privilege mapping (Week 2), action logging (Week 3), and red team baseline (Week 4)

FAQ

What is the best defense against indirect prompt injection in 2026? Layered defense that does not depend on detecting malicious text. Classify data sources by trust level, enforce privilege separation so low-trust data cannot trigger high-risk actions, implement causal attribution on tool calls (AttriGuard), and continuously test with automated red teaming (PISmith). Input detection stays as one signal among many.

What is AttriGuard? AttriGuard (arXiv 2603.10749) uses counterfactual causal analysis at the tool-call boundary. For each action the agent attempts, it checks: would this action still happen if the untrusted content were removed? If not, the action was likely caused by injected content and is blocked. This catches attacks that bypass every input-layer detector.

How does PISmith differ from manual red teaming? PISmith trains an attack LLM using reinforcement learning to automatically discover injection payloads that bypass your specific defenses. It operates in black-box mode (no access to model weights or defense internals), generates attacks optimized for your stack, and runs continuously. Manual red teaming is slow, biased toward known patterns, and cannot keep pace with model updates.

Can these defenses be combined with existing detection tools? Yes. Detection stays in the stack as Layer 1. The architectural difference is treating detection output as a risk signal rather than a binary block/allow gate. A flagged input gets higher scrutiny from AttriGuard and tighter privilege restrictions, but is not blocked outright — preventing the false-positive problem that makes detection-only systems unusable.

Architecting secure AI agents: the defense stack for indirect prompt injection

Detection is one layer. What are the other layers?

What does the NVIDIA architecture paper propose?

How AttriGuard traces what caused each tool call

How PISmith automates attack discovery

Assembling the defense stack

What to implement first

Key takeaways

FAQ

Further reading

Related across topics

Share on

Detection is one layer. What are the other layers?

What does the NVIDIA architecture paper propose?

How AttriGuard traces what caused each tool call

How PISmith automates attack discovery

Assembling the defense stack

What to implement first

Key takeaways

FAQ

Further reading

Related across topics

Prompt Injection Defense

Share on