How to red team an LLM application: a practitioner’s playbook
“We ran our standard pen test methodology against the LLM. The report came back clean. Two weeks later, a customer extracted every system prompt.”
TL;DR
Traditional pen testing methodology breaks down against LLM applications. The attack surface is natural language, outputs are probabilistic, and most vulnerabilities can’t be patched with code. This playbook covers scoping, the five attack categories that matter, tooling (Garak, PyRIT, DeepTeam, Promptfoo), when automated testing works versus when you need humans, and how to write findings that engineering teams can actually remediate. HackerOne now has 1,121 bug bounty programs with AI in scope, up 270% year-over-year (HackerOne, 2025). For the threat model that should inform your scoping, see The privilege escalation kill chain.

Why doesn’t traditional pen testing work for LLMs?
A web app pen test has a clear playbook: enumerate endpoints, fuzz inputs, test authentication, check for injection, verify authorization. The attack surface is finite and well-characterized. Findings are deterministic and reproducible. Fixes are code changes.
LLM red teaming breaks every one of those assumptions.
The attack surface is infinite. The input to an LLM is any natural language string. You can’t enumerate all possible inputs. You can’t fuzz effectively because minor phrasing changes produce wildly different results. The space of possible attacks is the space of all possible sentences, which is unbounded.
Results are probabilistic. Run the same prompt against the same model ten times and you might get a jailbreak three times. Is that a vulnerability? At what success rate? Traditional severity ratings assume deterministic behavior. CVSS doesn’t have a field for “works 30% of the time.” Microsoft’s AI Red Team has documented this challenge: they require statistical significance across multiple runs before classifying a finding (Microsoft, 2024).
Remediation is architectural. A SQL injection fix is a parameterized query. A jailbreak fix might require retraining the model, which takes weeks, costs significant compute, and can introduce new failure modes elsewhere. Some findings have no code-level fix at all. The OWASP LLM Top 10 2025 acknowledges this: several entries (LLM01: Prompt Injection, LLM04: Data and Model Poisoning) describe vulnerabilities with no complete remediation path.
MITRE ATLAS catalogs 66 adversarial techniques with 33 documented case studies, extending the ATT&CK framework specifically for ML systems. It’s the closest thing to a systematic attack taxonomy for LLM red teaming, and it’s still evolving. ATLAS added agent-focused adversarial techniques in October 2025.
How do you scope an LLM red team engagement?
Scope by threat model, not by tool capability. Most failed red team engagements fail because they tested everything the tool could do rather than everything that matters.
Step 1: Map the system. What model? What data does it access? What tools or APIs can it call? What actions can it take in the real world? An LLM chatbot with no tool access is a very different target than an agent with database access, code execution, and email sending.
Step 2: Identify adversaries. External users trying to extract training data? Competitors probing for system prompts? Malicious insiders planting poisoned documents? The adversary determines which attack categories are in scope.
Step 3: Select attack categories. Not all OWASP LLM Top 10 categories apply to every deployment. If you’re using a third-party model via API, training data poisoning (LLM04) is out of scope. If your system has no tool access, excessive agency (LLM06) doesn’t apply. Map your system to the relevant categories:
| OWASP LLM 2025 | When in scope |
|---|---|
| LLM01: Prompt Injection | Always. Every LLM system. |
| LLM02: Sensitive Information Disclosure | If the system accesses private data or has a system prompt worth protecting |
| LLM03: Supply Chain | If you use third-party models, plugins, or RAG data from external sources |
| LLM04: Data and Model Poisoning | If you fine-tune or curate training/RAG data |
| LLM05: Improper Output Handling | If model output reaches other systems (databases, APIs, rendered HTML) |
| LLM06: Excessive Agency | If the model has tool access or can take real-world actions |
| LLM07: System Prompt Leakage | If the system prompt contains sensitive logic or API keys |
| LLM08: Vector and Embedding Weaknesses | If you use RAG with vector databases |
| LLM09: Misinformation | If the system generates content users rely on for decisions |
| LLM10: Unbounded Consumption | If you need to protect against cost-based or denial-of-service attacks |
Step 4: Set success criteria. “We broke the chatbot” isn’t a finding. Define what constitutes a successful attack in business terms: exfiltrated customer PII, executed unauthorized API calls, obtained system prompt, bypassed content policy to generate harmful content. Make it measurable.
What are the five attack categories that matter most?
Five attack categories cover roughly 90% of findings in published LLM red team reports. I’m ordering them by how frequently they produce actionable results, not by theoretical severity.
1. Prompt injection (direct and indirect). Still the highest-yield attack category. Direct injection: craft inputs that override the system prompt. Indirect injection: plant malicious instructions in data the AI retrieves. HackerOne reported a 540% surge in prompt injection reports in 2025. Test both: direct jailbreaking through the chat interface, and indirect injection through any data source the system accesses. For the full technical breakdown, see Indirect prompt injection.
2. Data extraction. System prompt extraction, training data extraction, and RAG content extraction. System prompts often contain API keys, business logic, and pricing information that competitors would love to have. Training data extraction (Carlini et al., 2021) can surface PII, copyrighted content, or proprietary information. RAG extraction reveals what documents the system has access to.
3. Privilege escalation. If the system has tools, test whether you can make it use tools it shouldn’t, access data it shouldn’t, or take actions beyond its intended scope. The Irregular Labs agent testing (March 2026) found agents forging admin session cookies, disabling security software, and using steganography to bypass DLP controls. A red-team agent breached McKinsey’s AI platform and gained read-write access to 46.5 million messages in two hours.
4. Output manipulation. Can you make the system generate harmful, misleading, or policy-violating content? This matters for customer-facing systems where the output represents your brand. Jailbreak success rates in published research range from 70.6% to 95.9% for multi-turn attacks, with 64% cross-model transfer rates (academic benchmarks, 2025).
5. Denial of service and cost attacks. Can an attacker craft inputs that cause excessive token consumption, triggering large API bills? Can they create loops in agent systems? Can they exhaust rate limits for legitimate users? These are often overlooked but represent real business risk.
What tools should I use?
Four tools cover most red teaming needs. Each has a different strength.
graph TD
A[Red Team Tooling] --> B[Garak<br/>NVIDIA]
A --> C[PyRIT<br/>Microsoft]
A --> D[DeepTeam<br/>Confident AI]
A --> E[Promptfoo]
B --> B1[Broad vulnerability<br/>scanning]
B --> B2[Known attack<br/>pattern library]
B --> B3[Multiple model<br/>providers]
C --> C1[Multi-turn<br/>conversational attacks]
C --> C2[Human-in-the-loop<br/>via CoPyRIT GUI]
C --> C3[Orchestrated<br/>attack strategies]
D --> D1[OWASP LLM Top 10<br/>aligned]
D --> D2[40+ vulnerabilities<br/>20+ attack types]
D --> D3[Evaluation<br/>metrics]
E --> E1[CI/CD<br/>integration]
E --> E2[Regression<br/>testing]
E --> E3[50+ vulnerability<br/>scan types]
Garak (NVIDIA, open source) is the broadest scanner. It probes for known vulnerability patterns across prompt injection, data leakage, toxicity, and hallucination. Think of it as Nmap for LLMs: good for initial reconnaissance and broad coverage. Best for: first-pass scanning when you don’t know what you’re looking for.
PyRIT (Microsoft, open source) focuses on multi-turn attacks: conversational strategies where the attacker gradually steers the model toward a target behavior over multiple exchanges. It includes CoPyRIT, a GUI for human-in-the-loop testing where human testers guide automated attack strategies. Best for: depth testing and complex attack chains that require reasoning about the model’s responses.
DeepTeam (Confident AI) aligns its attack taxonomy directly to OWASP LLM Top 10 categories. It covers 40+ vulnerability types across 20+ attack strategies. Best for: compliance-oriented testing where you need to demonstrate coverage against a specific framework.
Promptfoo (open source, acquired by OpenAI in March 2026) runs evaluations as part of your CI/CD pipeline. Define test cases, set pass/fail criteria, run on every deployment. Best for: continuous regression testing to catch safety regressions before they ship.
My recommendation: start with Garak for broad scanning, add PyRIT for multi-turn depth, and integrate Promptfoo into CI/CD for continuous coverage. Use DeepTeam when you need OWASP-aligned reporting for compliance or audit.
When do you need humans versus automation?
Automated tools and human testers answer different questions. Use both, but use them for what they’re good at.
Automated tools handle breadth. They can run thousands of known attack patterns against your system in hours. They’re good at finding known vulnerability classes: standard jailbreaks, common injection patterns, system prompt extraction techniques. They generate statistical data: “this attack succeeds 34% of the time across 1,000 runs.” They scale to continuous testing.
Human testers handle depth. They discover novel attack chains that no tool has in its library. They exploit business logic specific to your application. They combine multiple weak findings into high-severity chains. They reason about the system’s behavior across turns and adapt their strategy in real time.
The published approaches from major AI labs confirm this split:
Microsoft’s AI Red Team (100+ products red-teamed) uses a hybrid approach: automated scanning for coverage, human testers for depth, and a threat model ontology published in 2024 that structures the relationship between the two. They require statistical significance (multiple runs) before classifying automated findings.
Anthropic’s Frontier Red Team (~15 people) focuses human testers on catastrophic risks: CBRN (chemical, biological, radiological, nuclear) capabilities, deception, and autonomous behavior. Automated tools handle the breadth of general safety testing. External partnerships bring domain experts for specialized testing.
Scale AI found through their red teaming programs that models sometimes exhibit “jailbreaking-to-jailbreak” behavior: a model that has been jailbroken once becomes more susceptible to subsequent jailbreaks in the same session. This is a finding that only emerged from human testing with careful observation.
The practical split: automated scanning for every deployment (daily/weekly), human red teaming for pre-launch assessments and quarterly reviews of high-risk applications.
How do you write findings that get fixed?
The most common failure mode in LLM red teaming isn’t the testing. It’s the reporting. Traditional vulnerability reports assume deterministic bugs with code-level fixes. LLM findings are often probabilistic, architectural, and ambiguous.
Include the success rate, not just one screenshot. “I jailbroke the model” means nothing without context. “This prompt bypasses the content policy in 47 out of 100 attempts at temperature 0.7 on GPT-4o-2025-03” is a finding. Run every attack at least 20 times and report the success rate with the model version, temperature, and date.
Map to OWASP categories. Every finding should reference the OWASP LLM Top 10 category it falls under. This gives engineering teams a shared vocabulary and lets them prioritize across findings. It also makes the report useful for compliance teams who need to demonstrate testing coverage.
Describe the business impact, not just the technical behavior. “Jailbreak achieved” is not actionable. “Attacker can bypass content policy to generate detailed phishing emails targeting our customers, using brand-specific language from the system prompt” describes a business risk. “System prompt extracted, revealing API keys for internal services and pricing logic for enterprise tier” quantifies the exposure.
Suggest remediation pathways with realistic expectations. Some findings have code fixes (add output filtering, remove API keys from system prompts). Some require architectural changes (add privilege separation for tool access). Some have no complete fix (fundamental prompt injection). Be honest about which is which. “Mitigate by adding input filtering (reduces success rate from 47% to approximately 12%)” is more useful than “fix prompt injection.”
Accept probabilistic severity. An attack that works 5% of the time against a customer-facing chatbot serving 10 million queries per day will succeed roughly 500,000 times per day. That’s a different severity than the same 5% rate against an internal tool with 50 users. Context determines severity, not just success rate.
The EU AI Act (effective August 2025) requires red teaming for high-risk AI systems. NIST AI RMF 1.0 includes red teaming under the MAP function. Both frameworks expect documented testing against adversarial inputs. Your reporting format needs to satisfy these requirements, not just your engineering team.
Key takeaways
- LLM red teaming is fundamentally different from traditional pen testing: infinite attack surface, probabilistic results, architectural remediation
- Scope by threat model, not by tool capability. Map your system to OWASP LLM Top 10 categories and test only what applies
- Five attack categories cover ~90% of findings: prompt injection, data extraction, privilege escalation, output manipulation, denial of service
- Four tools cover most needs: Garak (breadth), PyRIT (depth), DeepTeam (OWASP-aligned), Promptfoo (CI/CD)
- Use automated tools for breadth and continuous coverage. Use human testers for depth, novel attack discovery, and pre-launch assessments
- Report findings with success rates (not single screenshots), OWASP category mappings, business impact, and realistic remediation pathways
- The EU AI Act and NIST AI RMF both require documented adversarial testing for high-risk AI systems
FAQ
How is red teaming an LLM different from traditional penetration testing?
Three fundamental differences. The attack surface is natural language, meaning the input space is infinite and non-enumerable. Outputs are probabilistic, so the same attack may succeed 30% of the time, making reproducibility a challenge. Most findings can’t be patched with code: they require retraining, architectural changes, or accepting residual risk. A traditional pen test finds a SQL injection and the developer adds parameterized queries. An LLM red team finds a jailbreak and the fix might require months of safety training.
What tools should I use for automated LLM red teaming?
Four tools cover most needs. Garak (NVIDIA, open source) is the most comprehensive vulnerability scanner with broad attack coverage. PyRIT (Microsoft) excels at multi-turn conversational attacks with a GUI for human-in-the-loop testing. DeepTeam (Confident AI) aligns directly to OWASP LLM Top 10 categories with 40+ vulnerability types. Promptfoo is best for CI/CD integration and regression testing. Start with Garak for broad scanning, add PyRIT for depth.
When do I need human red teamers versus automated tools?
Automated tools handle breadth: scanning known attack patterns across many inputs at scale. Human testers handle depth: discovering novel attack chains, exploiting business logic specific to your application, and finding multi-step vulnerabilities that require reasoning about behavior. Use automated scanning for every deployment. Use human testers for pre-launch assessments and quarterly reviews of high-risk applications.
How do I scope an LLM red team engagement?
Start with the threat model. Identify what data the system accesses, what actions it can take, who the adversaries are, and what a successful attack looks like. Map to OWASP LLM Top 10 categories and exclude attacks that don’t apply (e.g., training data poisoning if using a third-party API model). Set measurable success criteria in business terms, not just technical outcomes.
How do I document LLM red team findings?
Include the exact prompt sequence, model version and temperature, success rate across multiple attempts (not just one screenshot), the business impact, and a suggested remediation path. Map every finding to an OWASP LLM Top 10 category. Accept probabilistic severity: “succeeds 40% of the time” at 10 million daily queries means 4 million successful attacks per day. Context determines severity.
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch