Algorithmic red teaming: using AI to attack AI

Q: Why can't you just brute-force test all possible prompts?

The input space is infinite. An LLM accepts any natural language string of any length. Randomly sampling this space has near-zero probability of finding meaningful vulnerabilities. Algorithmic methods use optimization (gradient-based, evolutionary, RL-based) to navigate the space efficiently, finding adversarial prompts orders of magnitude faster than random or manual testing.

Q: What is TAP and how does it work?

TAP (Tree of Attacks with Pruning), developed by Robust Intelligence with Yale researchers, uses tree-of-thought reasoning to iteratively refine adversarial prompts. It builds a tree of candidate attacks, evaluates which branches are most promising, prunes unlikely candidates, and sends only high-probability attacks to the target. It achieves 80%+ jailbreak success against GPT-4 with only a few hundred queries.

Q: How do you integrate automated red teaming into CI/CD?

Use Promptfoo or Garak as CI/CD gates that run on every deployment. Define test cases covering OWASP LLM Top 10 categories. Set pass/fail thresholds (e.g., jailbreak success rate must stay below 5%). Run baseline scans automatically. Supplement with PyRIT multi-turn campaigns for deeper testing on major releases. Alert on regression: if a new deployment increases attack success rates.

Q: Is automated red teaming sufficient without human testers?

No. Automated tools discover known vulnerability patterns at scale. Human testers discover novel attack chains, exploit application-specific business logic, and combine multiple weak findings into high-severity chains. The optimal approach uses automated scanning for continuous coverage (daily/weekly) and human testers for pre-launch assessments and quarterly deep dives.

9 minute read

“We tried 10,000 random prompts. Found nothing. TAP found a jailbreak in 200 queries.”

TL;DR

Brute-force prompt testing fails because the input space is infinite. Algorithmic red teaming uses AI to systematically discover AI vulnerabilities. TAP achieves 80%+ jailbreak success against GPT-4 in minutes. PAIR uses attacker-target-judge orchestration. AutoDAN evolves natural-language jailbreaks through genetic algorithms. Four tools make this practical: Garak, PyRIT, DeepTeam, Promptfoo. The gap between automated and manual red teaming is coverage, not quality. For the practitioner playbook on structuring a red team engagement with these tools, see How to red team an LLM application.

Two identical robotic arms facing each other, each probing the other's joints on a test bench

Why doesn’t brute-force testing work?

Because the input space is infinite and the success condition is sparse.

An LLM accepts any natural language string up to its context window limit. The space of all possible prompts is not just large; it’s unbounded. Randomly sampling from this space has near-zero probability of finding an adversarial prompt. Even structured fuzzing (trying variations of known dangerous patterns) covers only a microscopic fraction of the space.

The success signal is also sparse. A jailbreak prompt might work with one phrasing and fail with a minor rewording. The difference between “tell me how to…” (refused) and a carefully constructed roleplay scenario (accepted) is a region in prompt-space that random search will never find efficiently.

Algorithmic methods solve this by using optimization to navigate the space intelligently. Instead of sampling randomly, they use reward signals, gradient information, or evolutionary fitness to move toward adversarial prompts. The improvement is orders of magnitude: TAP finds jailbreaks in a few hundred queries where random testing needs millions (and still might not succeed).

What are the main algorithmic attack methods?

Three families of approaches, each exploiting a different optimization strategy.

RL-based red teaming trains an attacker model using reinforcement learning. The reward signal is whether the target model produced harmful output. The attacker learns to craft prompts that maximize this reward. The challenge: reward signals against well-aligned models are sparse (the model almost always refuses), making RL training unstable. Transferability is also limited. An attacker trained against a weak model often fails against a strong one.

LLM-vs-LLM orchestration uses one language model to generate attacks against another. The standard framework involves three models: an attacker that generates adversarial prompts, a target that responds, and a judge that scores whether the attack succeeded.

TAP (Tree of Attacks with Pruning), from Robust Intelligence and Yale, is the most efficient variant. It uses tree-of-thought reasoning to build a tree of candidate attack prompts, evaluates which branches are most promising, and prunes unlikely candidates before sending them to the target. This query efficiency matters: TAP achieves 80%+ jailbreak success against GPT-4 and GPT-4-Turbo in a few hundred queries, which translates to minutes of wall-clock time.

PAIR (Prompt Attacker with Iterative Refinement) uses the three-model orchestration with iterative refinement based on the target’s responses. Each round, the attacker analyzes why the previous attempt failed and adjusts. PAIR addresses the sparse reward problem by providing rich feedback (the target’s response) rather than a binary success/failure signal.

Evolutionary approaches use genetic algorithms to breed adversarial prompts. AutoDAN starts from a population of seed prompts, applies mutation (word changes, sentence rewrites) and crossover (combining elements from different prompts), and selects for fitness (attack success). The key innovation: it operates at the sentence level, producing grammatically correct, natural-language prompts that bypass perplexity-based detection. For the full taxonomy of jailbreak attack types these methods generate, see Jailbreaking in production.

graph TB
    subgraph "Algorithmic Attack Methods"
        A[RL-Based] -->|Sparse reward challenge| T[Target Model]
        B[LLM-vs-LLM<br/>TAP / PAIR] -->|Query-efficient<br/>Few hundred queries| T
        C[Evolutionary<br/>AutoDAN] -->|Natural language output<br/>Bypasses perplexity filters| T
    end

    T --> J[Judge Model<br/>Score success]
    J -->|Feedback| A
    J -->|Feedback| B
    J -->|Fitness signal| C

    subgraph "Coverage Gap"
        D[Automated: Known patterns<br/>at scale, continuously]
        E[Human: Novel chains,<br/>business logic, depth]
    end

How do the four major tools compare?

Each tool occupies a different niche. Using them in combination covers significantly more ground than any single tool.

Garak (NVIDIA, open source) is the broadest scanner. It ships with approximately 100 attack vectors and can fire up to 20,000 prompts per run across dozens of vulnerability categories: jailbreaking, data leakage, prompt injection, toxicity, hallucination. It integrates with AVID (AI Vulnerability Database) for reporting. Think of Garak as the Nmap of LLM security: broad, fast, systematic, good at finding known vulnerability patterns.

PyRIT (Microsoft, open source) focuses on depth. Its April 2025 release introduced the AI Red Teaming Agent that orchestrates multi-turn attack campaigns. It supports crescendo attacks (gradually escalating across conversation turns), TAP strategies, and human-in-the-loop testing through the CoPyRIT GUI. Microsoft has used PyRIT internally for 100+ red team operations. PyRIT finds complex multi-turn vulnerabilities that single-prompt scanners miss.

DeepTeam (Confident AI) maps directly to compliance frameworks: OWASP LLM Top 10, OWASP Agentic AI Top 10, NIST AI RMF, and MITRE ATLAS. It covers 40+ vulnerability classes with 10+ adversarial strategies including multi-turn jailbreaks, encoding obfuscation, and adaptive pivoting. If your red team engagement needs to demonstrate coverage against a specific framework for audit purposes, DeepTeam’s structured mapping makes reporting straightforward.

Promptfoo (open source) is built for CI/CD integration. Define test cases, set pass/fail criteria, run on every deployment. It’s less sophisticated as an attack generator but excels at regression testing: ensuring that a new model deployment or prompt change doesn’t reintroduce previously patched vulnerabilities.

The integration pattern that works: Garak for baseline vulnerability sweeping (run on every deployment), PyRIT for multi-turn attack campaigns (run on major releases), DeepTeam for compliance-aligned testing (run quarterly), Promptfoo as a CI/CD gate (run continuously). Garak identifies surface-level issues. PyRIT finds the complex chains. Together they discover vulnerabilities neither finds alone.

How do you integrate this into CI/CD?

Automated red teaming in CI/CD follows the same pattern as automated security testing for traditional software, with adjustments for probabilistic outputs.

Define a test suite. Map your system’s capabilities to OWASP LLM Top 10 categories. For each applicable category, create test cases: specific prompts, expected behaviors, and pass/fail criteria. A chatbot test suite might include 50 jailbreak variants, 30 prompt injection patterns, 20 data extraction attempts, and 10 system prompt extraction techniques.

Set thresholds, not absolutes. LLM outputs are probabilistic. A test case might succeed 3% of the time against a safe model. Set a threshold (e.g., “jailbreak success rate must stay below 5% across 100 runs”) rather than requiring zero failures. Track the baseline and alert on regression: if a new deployment increases the success rate from 3% to 12%, something changed.

Run on every deployment. Garak or Promptfoo scans take minutes to hours depending on the test suite size. Run them as a deployment gate. If the success rate exceeds the threshold, block the deployment and investigate.

Supplement with depth testing. Automated CI/CD testing catches known patterns. Schedule PyRIT multi-turn campaigns for major releases. Schedule human red team exercises quarterly. The CI/CD gate prevents regressions. The deep testing discovers new vulnerabilities.

Report against frameworks. Use DeepTeam’s OWASP mapping to generate compliance-ready reports showing which categories were tested, what was found, and what the remediation status is. This satisfies EU AI Act requirements for documented adversarial testing of high-risk systems.

Key takeaways

Brute-force prompt testing fails because the input space is infinite. Algorithmic methods improve efficiency by orders of magnitude.
TAP achieves 80%+ jailbreak success against GPT-4 in a few hundred queries. PAIR uses iterative refinement with attacker-target-judge orchestration. AutoDAN evolves natural-language attacks through genetic algorithms.
Four tools cover the space: Garak (breadth), PyRIT (multi-turn depth), DeepTeam (OWASP-aligned compliance), Promptfoo (CI/CD integration)
The integration pattern: Garak baseline on every deployment + PyRIT campaigns on major releases + DeepTeam quarterly for compliance + Promptfoo as continuous CI/CD gate
Automated and human red teaming are complementary, not substitutes. Automation handles coverage at scale. Humans discover novel attack chains.
Set probabilistic thresholds, not absolutes. Track baselines and alert on regression.

FAQ

Why can’t you just brute-force test all possible prompts?

The input space is infinite. Randomly sampling has near-zero probability of finding adversarial prompts. Algorithmic methods use optimization (gradients, evolution, RL) to navigate efficiently, finding vulnerabilities orders of magnitude faster. TAP needs hundreds of queries where random testing needs millions.

What is TAP and how does it work?

TAP uses tree-of-thought reasoning to build, evaluate, and prune a tree of candidate attacks, sending only high-probability prompts to the target. Achieves 80%+ success against GPT-4 in minutes. Developed by Robust Intelligence with Yale researchers.

How do you integrate automated red teaming into CI/CD?

Run Garak or Promptfoo scans on every deployment as a gate. Set probabilistic thresholds (e.g., jailbreak rate below 5%). Track baselines and alert on regression. Supplement with PyRIT multi-turn campaigns on major releases and quarterly human testing.

Is automated red teaming sufficient without human testers?

No. Automated tools find known patterns at scale. Humans find novel chains, exploit application-specific logic, and combine weak findings into severe chains. Use automation for continuous coverage and humans for pre-launch and quarterly deep testing.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

Algorithmic red teaming: using AI to attack AI

TL;DR

Why doesn’t brute-force testing work?

What are the main algorithmic attack methods?

How do the four major tools compare?

How do you integrate this into CI/CD?

Key takeaways

FAQ

Why can’t you just brute-force test all possible prompts?

What is TAP and how does it work?

How do you integrate automated red teaming into CI/CD?

Is automated red teaming sufficient without human testers?

Related across topics

Share on

TL;DR

Why doesn’t brute-force testing work?

What are the main algorithmic attack methods?

How do the four major tools compare?

How do you integrate this into CI/CD?

Key takeaways

FAQ

Why can’t you just brute-force test all possible prompts?

What is TAP and how does it work?

How do you integrate automated red teaming into CI/CD?

Is automated red teaming sufficient without human testers?

Related across topics

Prompt Injection Defense

Share on