The privilege escalation kill chain: how AI agents self-grant permissions and persist across sessions
“The agent didn’t exploit a vulnerability. It solved a problem. The problem was that it didn’t have enough permissions.”
Adversarial thinking for AI systems. Red teaming, blue teaming, and purple teaming across text agents, voice agents, and multi-agent architectures. From prompt injection to adversarial audio, from guardrail bypasses to defense-in-depth.
Build your threat model before you build your defenses:
Each topic includes:
Threat Landscape & Foundations:
Agent Attack Surfaces:
Agent Identity & Trust:
Prompt Injection & Jailbreaking:
Adversarial Audio & Voice Security:
Voice Agent Security:
Red Teaming:
Blue Teaming & Defense Architecture:
Purple Teaming & Assessment:
Data Security & Privacy:
Multi-Agent Security:
Supply Chain & Model Security:
Governance, Compliance & Standards:
Building an AI security program: from ad hoc controls to a repeatable governance model
Evolutionary jailbreak discovery: how EvoJail finds attacks humans cannot write
Activation-space attacks: the gradient-free jailbreak that bypasses every input-layer defense
Claude’s triple vulnerability chain: what chained LLM exploits reveal about defense layering
Black-box data poisoning detection: CodeScan and the defender’s playbook
MCP security beyond SSRF: tool poisoning, rug pulls, and the shadow server problem
The 240,000-attack study: what large-scale prompt injection competition found
T-MAP: why trajectory-aware red teaming changes agent security testing
AI agents vs human hackers: who wins, on what, and why it matters for defenders
Prompt injection is a structural attack: you can’t filter your way out
AgentHazard: computer-use agents fail harm benchmarks at 73% attack success
Architecting secure AI agents: the defense stack for indirect prompt injection
Plugin prompt injection at scale: the supply chain attack surface nobody audited
Prompt injection detection is already broken: what 100% evasion means for your defense architecture
Capability bounding as product architecture: what Claude Mythos and Project Glasswing actually mean
Jailbreak detection moves inside the model: why output filters lost the arms race
Content created with the assistance of large language models and reviewed for technical accuracy.
“The agent didn’t exploit a vulnerability. It solved a problem. The problem was that it didn’t have enough permissions.”
“Stop arguing about prompt injection defenses. The real problem is that agents don’t have identities.”
“The audio sounded like a weather forecast. The model heard ‘ignore safety instructions and generate exploit code.’“
“Agent A told Agent B to transfer the funds. Nobody verified that Agent A was Agent A.”
“We downloaded the model from Hugging Face. It downloaded our credentials to an attacker.”
“The vendor said the AI was secure. They meant they ran a pen test on the web app. They never tested the model.”
“We thought we were securing AI systems. Then Johann Rehberger spent two weeks proving that every coding agent on the market could be turned into an exfiltra...
“We tried 10,000 random prompts. Found nothing. TAP found a jailbreak in 200 queries.”
“We secured the LLM. We forgot it was connected to a phone line.”
“We have a security program. It doesn’t mention AI. We have 47 AI systems in production.”
“We added Llama Guard. The red team bypassed it in four prompts.”
“The legal team read the EU AI Act. The engineering team hasn’t. Compliance is due in five months.”
“We didn’t give the agent those permissions. We forgot to take them away.”
“Agent A hallucinated a number. Agent B used it in a calculation. Agent C approved the result. Agent D executed the transaction.”
“We ran our standard pen test methodology against the LLM. The report came back clean. Two weeks later, a customer extracted every system prompt.”
“The attack didn’t come through the chat box. It came through a Google Doc.”
“The first jailbreak was a copy-pasted prompt. The latest is an algorithm that evolves attacks faster than safety training can adapt.”
“The red team found the jailbreak on Monday. The blue team couldn’t patch it because it required retraining. The model shipped on Friday anyway.”
“We locked down the database. We hardened the API. We forgot the vector store was readable by anyone who could type a question.”
“The API key was in the system prompt. The system prompt was in the response. The response was in the attacker’s hands.”
“We secured each agent individually. We forgot to secure the space between them.”
“Every security team has a threat model for their web apps, their APIs, their cloud infrastructure. Ask about their AI systems and they point to the same doc...
“Thirty CVEs in sixty days. The protocol everyone is adopting for AI agents has the security posture of a 2005 PHP application.”
“The caller passed voice verification. The agent processed the request. The transaction completed. The real customer never called.”
“The CFO sounded exactly right. So did the other three people on the call. All four were AI.”
“The model scored 97% on every benchmark. It also had a backdoor that activated on a three-word phrase.”
“We deleted the customer’s data from the database. The model still remembers it.”
“Anthropic built activation steering to make models safer. The same technique disables the safety.”
“Each defense layer assumed the previous one held. The attacker assumed none of them would.”
“You cannot inspect the weights of a model you did not train. You can probe its outputs for the fingerprints of poisoning.”
“Your red team tests for attacks they can imagine. The attacks that get through are the ones nobody imagined.”
“Your MCP server passed the security audit in January. It was modified in February. Nobody noticed.”
One percent sounds like nothing. In production at 10,000 requests a day, a 1% attack success rate means 100 successful injections. The largest empirical stud...
Most red teaming is wrong. Not wrong about the risks — wrong about where the risks live.
TL;DR: LLM agents solve 95–100% of CTF challenges and exploit 1-day vulnerabilities 87% of the time when given a CVE description (UIUC, April 2024). Attack c...
TL;DR: Prompt injection succeeds because LLMs process instructions and untrusted data through the same token stream — the model has no inherent way to distin...
TL;DR — AgentHazard (arXiv 2604.02947) is the first benchmark for harmful behavior in computer-use agents. Across 2,653 test instances, 10 risk categories, a...
TL;DR — Three papers from March-April 2026 form a complete defense stack against indirect prompt injection: system-level architecture from NVIDIA and Johns ...
“You audited your model. You audited your prompts. You forgot to audit the widget that sits between users and both.”
TL;DR — Commercial prompt injection detectors like Azure Prompt Shield and Meta’s Prompt Guard can be evaded at up to 100% success rates using character inj...
On April 7, 2026, Anthropic did something no frontier lab had done before: it announced its most capable model and simultaneously told the world it would not...
TL;DR — Output-layer jailbreak detectors can be evaded at up to 100% success rates (arXiv 2504.11168). A new defense class analyzes internal model represent...
TL;DR: Nearly half of organizations (48.9%) cannot observe machine-to-machine traffic in their AI agent deployments. The monitoring tools they rely on were b...