The 240,000-attack study: what large-scale prompt injection competition found
One percent sounds like nothing. In production at 10,000 requests a day, a 1% attack success rate means 100 successful injections. The largest empirical study of prompt injection ever run (arXiv:2603.15714) found attack success rates ranging from 0.5% to 8.5% across 13 frontier models — and found that the attacks that work on one model tend to work on others.
That second finding is the one most teams aren’t accounting for.
TL;DR: A public competition run by ML safety researchers attracted 464 participants who submitted 272,000 attack attempts against 13 frontier models across three agent settings. Attack success ranged from 0.5% to 8.5%. The critical finding: attack strategies transferred across model families in 21 of 41 tested scenarios, meaning per-model robustness gives a false sense of security. Three empirical lessons and five concrete hardening steps follow from the data.

What the competition measured and how
464 participants competed for a $40K prize pool. They submitted 272,000 attack attempts against 13 frontier models; 8,648 succeeded — an overall rate just under 3.2%. The models spanned major proprietary and open-weight families including Claude Opus 4.5 and Gemini 2.5 Pro. Three agent settings: tool calling, coding, and computer use.
These aren’t attacks where a user types something malicious into a chat box. They’re indirect — adversarial instructions embedded in content the agent processes: emails, documents, code repositories, web pages. The agent reads the content, encounters the injected instruction, and follows it. The user sees nothing unusual. The harm happens in the background.
The competition scored attacks on whether the agent completed the injected objective, not just whether it produced suspicious output. Stealth mattered. Many successful attacks left no detectable trace in the user-facing response at all.
The three findings that change how you defend
The first finding is about volume. The ASR range of 0.5% to 8.5% sounds low until you run the numbers. Claude Opus 4.5 achieved the lowest rate at 0.5%. At 10,000 agent requests per day, that’s 50 successful attacks daily — 1,500 per month. Gemini 2.5 Pro at 8.5% means 850 per day. These aren’t tail-risk numbers. They’re operational rates that require active defense, not periodic audits.
The second finding is the one most teams aren’t ready for: attack strategies succeeded across 21 of 41 tested behavior scenarios and multiple model families. An attacker who develops a working payload against any accessible model — including cheap open-weight alternatives — can apply it against a production deployment on a completely different provider. Your model choice doesn’t protect you from their testing budget.
The third finding is quietly damning. Gemini 2.5 Pro scores near the top of capability benchmarks. It also had the highest attack success rate in this competition at 8.5%. Claude Opus 4.5 had the lowest at 0.5%. There’s no clean rule here — you cannot pick the “most capable” model and assume you’re picking the most resilient one. Robustness is its own property, untethered from capability ratings.
Attack Success Rate by Model (Approximate Range)
Claude Opus 4.5 |■ | 0.5%
[Model 2] |■■ | ~1.2%
[Model 3] |■■■ | ~2.0%
[Model 4] |■■■■ | ~3.1%
[Model 5] |■■■■■ | ~4.0%
[Model 6] |■■■■■■ | ~5.1%
[Model 7] |■■■■■■■ | ~6.0%
[Model 8] |■■■■■■■■ | ~6.8%
Gemini 2.5 Pro |■■■■■■■■■■■■■■■■■■ | 8.5%
Source: arXiv:2603.15714. Intermediate model positions
are illustrative; only endpoints confirmed by paper.
What “transferable attacks” means for defense
Most organizations currently defend against prompt injection at the model level: they pick a model they consider robust, tune its system prompt, and trust that the model’s training makes it harder to manipulate. The competition data breaks this assumption.
When attack strategies transfer across model families, the question stops being “is this model robust?” and becomes “does our system have layers that catch injections the model doesn’t?” The model is one control, not the control.
The SQL injection parallel is uncomfortable but accurate. Early SQL injection defenses were database-specific — some databases escaped inputs differently, some were considered more secure by reputation. The lesson took years to land: parameterized queries, input validation, and least-privilege database accounts all had to exist together, because any single layer would eventually be bypassed. Prompt injection is at that same stage. The industry is still largely relying on model-level trust, the same way early web developers trusted their database’s escaping behavior.
The competition also found something that compounds this: successful attacks often remained invisible to users. When an agent executes an injected instruction silently — forwarding data, triggering a tool call, modifying a file — there’s no output the user can flag as suspicious. Detection, not just prevention, is a first-class requirement.
graph TD
A[External content enters agent context] --> B{Input scanning}
B -->|Flagged| C[Block / sanitize]
B -->|Passes| D[Model processes content]
D --> E{Output monitoring}
E -->|Anomaly detected| F[Alert / interrupt]
E -->|Passes| G{Action sandbox}
G -->|High-risk action| H[Require human approval]
G -->|Low-risk action| I[Execute with audit log]
style C fill:#d4edda
style F fill:#fff3cd
style H fill:#fff3cd
style I fill:#d1ecf1
The diagram above shows what defense-in-depth looks like for an agent pipeline. The model is in the middle — it’s one node, not the whole graph. Input scanning and output monitoring sit on either side. Actions are sandboxed by risk level. Logs exist everywhere. None of these layers is reliable by itself. All of them together make the attacker’s job substantially harder.
Five concrete hardening steps derived from the competition data
The competition released its attack dataset to the UK and US AI Safety Institutes, making it possible to trace what the successful attacks actually exploited. Five steps map directly to those patterns.
First: scan external content before it enters agent context. The successful attacks in the competition came through content the agent was processing — documents, emails, tool outputs. A lightweight classifier checking that content against known injection patterns won’t catch everything (the competition found attacks that bypassed naive filters), but it raises the cost of commodity attacks without meaningful latency penalty.
Second: structurally separate instruction channels from data channels. Prompt injection works because language models receive instructions and data in the same token stream and must guess which is which. System-level prompts, user prompts, and external content should be differentiated wherever the model’s context format allows. Several model providers now support distinct content roles in their APIs. Use them.
Third: scope agent tool permissions to the minimum the task actually requires. This matters because of the stealth finding. An agent with write access to a file system, email client, and external APIs can do real damage from a single successful injection — without the user ever seeing a suspicious response. An agent scoped to read-only operations on specific resources can’t. Least privilege applies to tool grants the same way it applies to service accounts.
Fourth: monitor for behavioral anomalies at the action level, not just in model output. Output monitoring catches the attacks that show up in what the user sees. The competition’s stealthy attacks didn’t. Log every tool call the agent makes, every file it reads or writes, every external request it sends. Build detection on those action logs. That’s where the quiet attacks live.
Fifth: test your specific deployment against the transferable payloads the competition identified, not a generic benchmark. The ASR that matters is not the average across 13 models in controlled conditions — it’s the rate against your agent, your tool grants, your document corpus, your system prompt. Red-teaming is not a pre-launch checklist item. It’s an ongoing operation.
FAQ
What is the attack success rate for prompt injection across models? The competition found rates ranging from 0.5% (Claude Opus 4.5) to 8.5% (Gemini 2.5 Pro) across 13 frontier models, with 8,648 successful attacks out of 272,000 attempts.
What does “universal attack transfer” mean in practice? Attack strategies that succeeded on one model also worked on different model families in 21 of 41 tested scenarios. An attacker who develops a payload on a cheap open-weight model can apply it against a production deployment on a different provider.
Which model was most resistant to prompt injection in the competition? Claude Opus 4.5 had the lowest attack success rate at 0.5%. However, 0.5% is not zero — at 10,000 requests per day, that’s 50 successful injections daily.
Does a capable model tend to be more robust against prompt injection? No. The competition found a weak correlation between capability and robustness. Gemini 2.5 Pro ranked high on capability benchmarks but had the highest attack success rate (8.5%) in the competition.
What three agent settings did the competition test? Tool calling, coding, and computer use — the three most common ways agents interact with external content and act on user systems.
For the full taxonomy of prompt injection techniques and defense patterns, see Prompt injection defense for AI agents. For safety properties at the agent system level, see Ethical AI agents and safety.
Source paper: How Vulnerable Are AI Agents to Indirect Prompt Injections? (arXiv:2603.15714, published March 2026)
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch