Architecting secure AI agents: the defense stack for indirect prompt injection

TL;DR — Three papers from March-April 2026 form a complete defense stack against indirect prompt injection: system-level architecture from NVIDIA and Johns Hopkins (arXiv 2603.30016), RL-based automated red teaming via PISmith (arXiv 2603.13026), and causal tool-call attribution via AttriGuard (arXiv 2603.10749). This is the companion to why detection is broken — that post explains the problem, this one explains the solution.
Detection is one layer. What are the other layers?
The previous post in this series established that prompt injection detection tools can be evaded at up to 100% success rates. If detection cannot be the primary defense, what replaces it?
Three papers published within days of each other in late March and early April 2026 answer this question from different angles. Together, they form a defense stack:
- System-level architecture (arXiv 2603.30016, NVIDIA + Johns Hopkins): What constraints should the agent system enforce regardless of what the model detects?
- Automated red teaming (arXiv 2603.13026, PISmith): How do you continuously test whether your defenses actually work?
- Causal attribution (arXiv 2603.10749, AttriGuard): How do you determine, at runtime, whether a tool call was caused by the user or by injected content?
None of these three rely on detecting malicious input text. All three operate at the system level, above the model.
What does the NVIDIA architecture paper propose?
The position paper from NVIDIA and Johns Hopkins (arXiv 2603.30016) frames indirect prompt injection as an architectural problem, not a model problem. Its core argument: when an agent processes untrusted data (documents, emails, API responses, web pages) and has the ability to take actions (tool calls, data writes, external communication), no amount of model-level safety training will prevent all attacks.
The paper presents three defense positions:
Position 1: Privilege separation. Untrusted data enters a sandboxed context. The agent can read it but cannot use information from that context to trigger tool calls. This is analogous to a browser sandbox — JavaScript from a web page cannot access the filesystem even though the browser can.
Position 2: Action verification. Every tool call passes through a verification layer that checks whether the action is consistent with the user’s original intent. If the user asked “summarize this email” and the agent tries to forward the email externally, the verification layer blocks it — regardless of what injection payload triggered the action.
Position 3: Graduated trust. Different data sources receive different trust levels. User messages get high trust. Retrieved documents get medium trust. External API responses get low trust. Tool call permissions scale with trust: high-trust sources can trigger any action; low-trust sources can only trigger read-only operations.
graph TD
A[User message - high trust] --> D[Agent processing]
B[Retrieved document - medium trust] --> D
C[External API response - low trust] --> D
D --> E[Agent requests tool call]
E --> F{Trust level of triggering source?}
F -->|High trust| G[Any action permitted]
F -->|Medium trust| H[Read + approved writes only]
F -->|Low trust| I[Read-only actions only]
J[Action verification layer] --> F
style A fill:#4caf50,color:#fff
style B fill:#ff9800,color:#000
style C fill:#d32f2f,color:#fff
The practical implication: if your agent processes emails and has tool access, the email content should never be able to trigger actions beyond what reading an email requires. The agent can summarize, classify, and extract information. It cannot forward, delete, or reply based on instructions found inside the email body.
How AttriGuard traces what caused each tool call
AttriGuard (arXiv 2603.10749) solves the attribution problem: when an agent calls a tool, which part of the input caused that call?
The technique is counterfactual causal analysis. For each tool call the agent makes, AttriGuard asks: if I removed the untrusted content from the input and re-ran the model, would this tool call still happen? If yes, the call is legitimate — driven by the user’s instruction. If no, the call was caused by the untrusted content — likely an injection.
This is not a simple heuristic. The counterfactual analysis accounts for:
- Direct causation: The injected text explicitly instructs a tool call (“send this data to attacker.com”)
- Indirect causation: The injected text changes the agent’s reasoning in a way that leads to an unauthorized tool call
- Confounded causation: Both the user instruction and the injected text point toward the same tool call (the hardest case — AttriGuard checks whether the specific parameters match the user’s intent)
On the AgentDojo benchmark, AttriGuard achieves strong defense rates while maintaining agent utility. The key architectural decision: AttriGuard operates at the tool-call boundary, not the input boundary. It does not try to detect malicious text. It checks whether actions are causally justified by the user’s original request.
This complements the NVIDIA architecture’s Position 2 (action verification). The NVIDIA paper describes the principle; AttriGuard implements a concrete mechanism.
How PISmith automates attack discovery
The defense stack is only as good as your testing. PISmith (arXiv 2603.13026) automates the testing side.
Most red teaming for prompt injection relies on human-written attack payloads. This is slow, biased toward known attack patterns, and cannot keep pace with model updates. PISmith trains an attack LLM using reinforcement learning to discover injection payloads that bypass your specific defenses.
The process:
- PISmith initializes an attack LLM with a set of known injection templates
- The attack LLM generates candidate payloads targeting your agent
- Each payload is tested against your defense stack (detection, attribution, action verification)
- Successful bypasses (payloads that trigger unauthorized actions) receive positive reward
- The attack LLM updates its policy to generate more payloads in the successful direction
- Iterate until the attack LLM converges or a time budget expires
The output is a set of injection payloads that bypass your current defenses. These are not generic attacks — they are optimized for your specific stack, your specific model, and your specific tool configuration.
PISmith operates in a black-box setting: it does not need access to model weights, defense internals, or training data. It only observes whether its injection payloads succeed or fail against the deployed system. This matches realistic attack conditions.
The practical use: run PISmith against your production defense stack weekly. If it discovers new bypasses, update your defenses. If it converges without finding bypasses, you have a quantified confidence bound on your defense quality. This is continuous automated red teaming — the security equivalent of continuous integration testing.
Assembling the defense stack
The three papers compose into a layered architecture:
graph TD
subgraph "Layer 1: Input Processing"
A[Input detection - signal only] --> B[Trust classification by source]
end
subgraph "Layer 2: Runtime Defense"
C[Privilege separation by trust level]
D[AttriGuard causal attribution on tool calls]
end
subgraph "Layer 3: Continuous Testing"
E[PISmith automated red teaming - weekly]
F[AgentWatcher behavioral monitoring]
end
B --> C
C --> D
D -->|Legitimate| G[Execute action]
D -->|Suspicious| H[Block + alert]
E -->|New bypass found| I[Update Layer 1 + 2 defenses]
style A fill:#ff9800,color:#000
style D fill:#4caf50,color:#fff
style E fill:#1976d2,color:#fff
Layer 1 classifies inputs by trust level and runs detection as a signal (not a gate). Even imperfect detection raises the cost of attack and provides signal for downstream layers.
Layer 2 enforces constraints at runtime. Privilege separation limits what actions untrusted data can trigger. AttriGuard checks causal attribution on every tool call, blocking actions driven by injected content.
Layer 3 continuously tests the defense stack. PISmith discovers new bypass techniques. The discoveries feed back into Layers 1 and 2 as defense updates.
This architecture does not depend on perfect detection at any single layer. Each layer catches what the previous layer misses. An injection that evades input detection is caught by trust-level privilege separation. An injection that works despite privilege limits is caught by causal attribution. And PISmith continuously probes for attacks that bypass all layers.
What to implement first
If you are starting from zero defense beyond input detection:
Week 1: Trust classification. Tag every data source in your agent’s pipeline with a trust level: user input (high), internal databases (high), retrieved documents (medium), external APIs (low), web content (low). This is a metadata change, not a model change.
Week 2: Privilege separation. Map tool permissions to trust levels. Low-trust sources can trigger read-only tools. High-risk actions (writes, sends, deletes) require high-trust source attribution. This mirrors the excessive agency controls applied at the data-source level.
Week 3: Action logging for attribution. Before implementing full causal attribution, log every tool call with its input provenance: which source(s) contributed to this action? This creates the audit trail for both post-incident analysis and future AttriGuard-style runtime checks.
Week 4: Automated testing baseline. Run a basic red team battery against your defense stack using known injection payloads from the 240,000-attack study. Measure bypass rates. This becomes your baseline for evaluating PISmith or equivalent automated red teaming.
Key takeaways
- Three papers from March-April 2026 form a complete defense stack: system-level architecture (arXiv 2603.30016), RL-based red teaming (PISmith, arXiv 2603.13026), and causal attribution (AttriGuard, arXiv 2603.10749)
- The NVIDIA/Johns Hopkins architecture paper frames indirect injection as an architectural problem requiring privilege separation, action verification, and graduated trust — not better detection
- AttriGuard uses counterfactual causal analysis to determine whether each tool call was driven by the user’s instruction or injected content, operating at the action level rather than the input level
- PISmith automates red teaming by training an attack LLM with reinforcement learning to discover defense bypasses in a black-box setting
- The defense stack layers compose: input trust classification, runtime privilege separation + causal attribution, and continuous automated testing
- Implementation starts with trust tagging (Week 1), privilege mapping (Week 2), action logging (Week 3), and red team baseline (Week 4)
FAQ
What is the best defense against indirect prompt injection in 2026? Layered defense that does not depend on detecting malicious text. Classify data sources by trust level, enforce privilege separation so low-trust data cannot trigger high-risk actions, implement causal attribution on tool calls (AttriGuard), and continuously test with automated red teaming (PISmith). Input detection stays as one signal among many.
What is AttriGuard? AttriGuard (arXiv 2603.10749) uses counterfactual causal analysis at the tool-call boundary. For each action the agent attempts, it checks: would this action still happen if the untrusted content were removed? If not, the action was likely caused by injected content and is blocked. This catches attacks that bypass every input-layer detector.
How does PISmith differ from manual red teaming? PISmith trains an attack LLM using reinforcement learning to automatically discover injection payloads that bypass your specific defenses. It operates in black-box mode (no access to model weights or defense internals), generates attacks optimized for your stack, and runs continuously. Manual red teaming is slow, biased toward known patterns, and cannot keep pace with model updates.
Can these defenses be combined with existing detection tools? Yes. Detection stays in the stack as Layer 1. The architectural difference is treating detection output as a risk signal rather than a binary block/allow gate. A flagged input gets higher scrutiny from AttriGuard and tighter privilege restrictions, but is not blocked outright — preventing the false-positive problem that makes detection-only systems unusable.
Further reading
- Prompt injection detection is already broken — the companion post on why detection fails
- Prompt injection is a structural attack — the foundational argument
- Indirect prompt injection: the attack vector hiding in your data — deep dive on the attack vector
- Algorithmic red teaming: using AI to attack AI — the broader automated red teaming landscape
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch