13 minute read

A glass security shield shattering against a dark server room backdrop, symbolizing the failure of detection-based prompt injection defenses

TL;DR — Commercial prompt injection detectors like Azure Prompt Shield and Meta’s Prompt Guard can be evaded at up to 100% success rates using character injection and adversarial ML (arXiv 2504.11168). Detection-first defense is structurally broken because attackers control the optimization target. The fix is defense-in-depth that assumes your detector will fail.


You deployed a detector. You are not protected.

You ran Azure Prompt Shield against your test suite. It caught 98% of malicious prompts. You shipped it to production. Your security review passed.

In April 2026, researchers published a paper that should make you reconsider that decision. They tested character injection and adversarial ML techniques against six commercial prompt injection detectors — Azure Prompt Shield, Meta’s Prompt Guard, ProtectAI (v1 and v2), NeMo Guard Jailbreak Detect, and Vijil Prompt Injection — and achieved evasion rates up to 100% (arXiv 2504.11168). Not against toy models. Against the tools enterprises are deploying right now.

The 2% your red team caught was the 2% the attackers had not yet optimized against.

This is not a story about one paper finding one bug. A separate systematization-of-knowledge study (arXiv 2506.10597) independently evaluated the entire taxonomy of LLM guardrails across six dimensions and reached the same conclusion: no existing guardrail category provides reliable, consistent protection against prompt injection. The problem is not that current detectors are bad. The problem is that detection-centric defense is the wrong architecture.

What does 100% evasion look like?

The evasion techniques in arXiv 2504.11168 are not sophisticated zero-days. They are straightforward transformations that exploit a fundamental mismatch between how detectors classify text and how LLMs interpret it.

Technique How it works Why detectors miss it
Emoji smuggling Embed malicious instructions inside emoji variation sequences and Unicode tags Detectors parse visible text; hidden instructions pass through to the LLM
Character injection Insert invisible Unicode characters, zero-width joiners, or homoglyphs between tokens Detectors tokenize differently than the target LLM
Encoding substitution Base64-encode malicious instructions, ask the LLM to decode Detector sees gibberish, LLM sees instructions
Adversarial suffix optimization Append optimized token sequences that flip detector classification Gradient-based attacks on the detector’s own model

The evasion rates per tool tell the story:

Detection tool Best evasion rate Primary weakness
Azure Prompt Shield Up to 100% Emoji smuggling bypasses tokenization
Meta Prompt Guard Up to 100% Character injection and adversarial suffixes
ProtectAI (v1/v2) Up to 100% Encoding substitution
NeMo Guard Jailbreak Detect Up to 100% Emoji smuggling
Vijil Prompt Injection Up to 100% Character injection

These are not theoretical. The researchers ran each technique against production API endpoints. The 100% figure means: for at least one evasion technique, every single malicious prompt in the test set passed undetected.

Why can’t we build a better detector?

In the early 2000s, web application firewalls (WAFs) tried to solve SQL injection by detecting malicious SQL patterns in user input. Attackers responded with encoding tricks, case variations, and comment injection. WAF vendors added more rules. Attackers found more bypasses. The arms race continued until the industry realized the architecture was wrong. Parameterized queries solved SQL injection not by building better detectors, but by separating code from data at the execution layer.

Prompt injection has no equivalent separation. In an LLM, instructions and data occupy the same token stream. The model cannot distinguish “ignore previous instructions” planted in a document from “ignore previous instructions” typed by a user. Both are tokens. Both get attended to. Both influence the output.

graph LR
    A[Attacker crafts evasion] --> B[Bypasses detector]
    B --> C[Reaches LLM]
    C --> D[Executes malicious intent]
    D --> E[Defender updates detector]
    E --> F[Attacker adapts evasion]
    F --> B
    
    style A fill:#d32f2f,color:#fff
    style E fill:#1976d2,color:#fff

This loop never converges in the defender’s favor. The attacker has a structural advantage: they can observe the detector’s behavior (most are API-accessible), optimize against it using gradient-based methods, and iterate faster than the defender can retrain. The SoK study (arXiv 2506.10597) formalized this across six guardrail categories — rule-based filters, embedding classifiers, LLM-based judges, perplexity detectors, canary tokens, and instruction hierarchy — and found each has at least one class of evasion it cannot defend against by design.

This is not a “we need better models” problem. It is a “the architecture assumes detection works” problem.

How bad is the current exposure?

The numbers paint a grim picture of how much enterprise infrastructure depends on detection that does not work.

Metric Value Source
Organizations planning agentic AI deployment 83% Cisco State of AI Security, 2026
Organizations that feel ready to deploy agentic AI securely 29% Cisco State of AI Security, 2026
Publicly exposed vulnerable AI assistant instances (OpenClaw) ~40,000 China CNCERT, March 2026
OWASP LLM vulnerability ranking for prompt injection #1 (LLM01) OWASP Top 10 for LLM Applications, 2025

The gap between 83% planning agentic deployment and 29% feeling ready to do so securely means over half of organizations are building agent systems they know they cannot secure. At enterprise scale, that is not a risk — it is an open door.

Indirect injection — where the malicious payload arrives not from the user but from a document, email, calendar invite, or API response the agent processes — is the dominant production attack vector. This is what Simon Willison calls the “lethal trifecta”: an application that (1) has access to private data, (2) processes untrusted content, and (3) can take actions or communicate externally. Every agent with tool access meets all three criteria.

Palo Alto Networks’ Unit 42 research team confirmed this pattern in March 2026, documenting web-based indirect injection attacks where adversaries embed malicious instructions in web pages that AI agents browse during research tasks. Google published their continuous defense approach for Workspace the same month — a tacit acknowledgment that their own AI features face this attack surface.

What does a defense that assumes detection failure look like?

If detection cannot be the gate, it must become one signal among many. The architecture shifts from “block bad inputs” to “constrain what the agent can do regardless of input.”

AgentWatcher (arXiv 2604.01194) demonstrates what this looks like in practice. Instead of classifying inputs as malicious or benign, it uses attention-based attribution to trace which parts of the input influenced each tool invocation. When an agent calls a tool, AgentWatcher checks whether that call was causally driven by the user’s original instruction or by content from an untrusted data source. On the AgentDojo benchmark, this approach reduced attack success rates to under 1% with only 2% utility loss — meaning legitimate requests still worked.

The principle: do not ask “is this input malicious?” Ask “should this input be causing this action?”

graph TD
    A[User request] --> B[Input detection layer]
    A --> C[LLM agent processes request]
    B -->|Signal, not gate| D[Risk scoring engine]
    C --> E[Agent requests tool call]
    E --> F[Causal attribution check]
    F -->|Input caused this call?| D
    D --> G{Risk threshold}
    G -->|Low risk| H[Execute tool call]
    G -->|Medium risk| I[Execute with logging + alerts]
    G -->|High risk| J[Human approval required]
    
    K[Behavioral monitor] --> D
    L[Output validator] --> D
    
    style B fill:#ff9800,color:#000
    style F fill:#4caf50,color:#fff
    style J fill:#d32f2f,color:#fff

Google’s April 2026 Workspace defense follows the same principle. Rather than trying to catch every injection in Gmail content before Gemini processes it, they constrain which actions Gemini can take after processing email content. Read access is granted broadly. Write and send actions require elevated trust signals.

Tencent’s AI-Infra-Guard takes the automation path: a multi-agent scanning framework that probes agent deployments for the OWASP Top 10 for Agentic Applications, testing not just whether injection is possible but whether injected prompts can actually reach high-impact tools.

The practitioner checklist: what to build this week

Detection stays in your stack. It catches low-effort attacks and raises the cost of exploitation. But it is no longer your primary control.

Least-privilege tool access. Every tool the agent can call is an attack surface. If your customer-support agent has database write access “just in case,” that is a prompt injection away from data destruction. Audit tool permissions weekly. The principle from excessive agency applies directly.

Output validation, not just input filtering. Validate what the agent produces and what tools it calls — not just what goes in. If an agent suddenly calls a tool it has never used in this conversation context, flag it. Behavioral baselines catch attacks that input filters miss.

Causal attribution on tool calls. Implement AgentWatcher-style tracing: which part of the input drove this tool invocation? If the answer is “a paragraph from an external document,” apply higher scrutiny. This is the single highest-ROI investment from the research.

Human-in-the-loop for irreversible actions. Any tool call that sends data externally, modifies production state, or accesses credentials should require human approval. The latency cost is real. The alternative — an automated agent executing attacker-controlled instructions — is worse.

Assume breach: log everything. If detection fails and behavioral monitoring fails, your last line is forensics. Log every tool call, every input source, every output. Not for compliance theater — for the incident response team that will need this at 2 AM.

Test your defenses with adversarial evasion, not just red team prompts. If your red team is writing “ignore previous instructions” by hand, they are testing against the weakest attacks. Use the evasion techniques from arXiv 2504.11168 — character injection, encoding substitution, adversarial suffixes — against your own pipeline. Tencent’s AI-Infra-Guard automates this for OWASP alignment.

Key takeaways

  • Commercial prompt injection detectors (Azure Prompt Shield, Prompt Guard, ProtectAI, NeMo Guard, Vijil) can be evaded at up to 100% success rates using emoji smuggling, character injection, and adversarial ML techniques (arXiv 2504.11168)
  • Detection-centric defense is structurally broken — attackers control the optimization loop and can iterate faster than defenders retrain
  • 83% of organizations are planning agentic AI deployment, but only 29% feel ready to do so securely (Cisco, 2026)
  • The SQL injection parallel is instructive: WAFs lost the arms race, parameterized queries won by changing the architecture. Prompt injection needs an equivalent architectural shift
  • AgentWatcher’s causal attribution approach reduces attack success to under 1% by asking “should this input cause this action?” instead of “is this input malicious?”
  • The defense architecture that works treats detection as one signal in a layered system — least-privilege access, output validation, behavioral monitoring, human-in-the-loop gates, and comprehensive logging

FAQ

Can prompt injection be fully prevented by detection tools? No. Research demonstrates up to 100% evasion rates against commercial detectors including Azure Prompt Shield and Meta’s Prompt Guard (arXiv 2504.11168). Detection is necessary as a cost-raising measure but insufficient as a primary defense. It must be one layer in a defense-in-depth architecture that includes tool-level access controls, behavioral monitoring, and human approval gates for high-risk actions.

Why is prompt injection harder to solve than SQL injection? SQL injection was solved by separating code from data through parameterized queries — a clean architectural boundary. LLMs cannot make this separation because instructions and data occupy the same token stream. The model processes “summarize this document” and “ignore previous instructions” identically at the attention layer. Until a fundamental architectural equivalent to parameterized queries exists for natural language, prompt injection remains a constraint to design around, not a bug to fix.

What is the most effective single defense against prompt injection today? Causal attribution on tool calls, as demonstrated by AgentWatcher (arXiv 2604.01194). Instead of classifying inputs, it traces which input tokens caused each tool invocation. This approach achieved under 1% attack success rate on AgentDojo with only 2% utility loss. The shift: validate what the agent does, not what it receives.

Should I remove my prompt injection detector? No. Keep it. Detection raises the cost of attack and catches low-effort exploitation. The mistake is treating it as a gate that blocks attacks. Treat it as a signal that feeds into a broader risk scoring engine alongside behavioral monitoring, causal attribution, and output validation.

How do I test whether my agent is vulnerable to these evasion techniques? Run the specific evasion techniques from arXiv 2504.11168 against your pipeline: character injection (Unicode zero-width characters), encoding substitution (Base64-encoded payloads), semantic splitting (multi-turn attacks), and adversarial suffix optimization. Tencent’s open-source AI-Infra-Guard framework automates this testing aligned to the OWASP Top 10 for Agentic Applications.

Further reading

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch