9 minute read

“The red team found the jailbreak on Monday. The blue team couldn’t patch it because it required retraining. The model shipped on Friday anyway.”

TL;DR

Traditional purple teaming is point-in-time: attack, defend, report. AI purple teaming must be continuous because the attack surface changes with every model update, prompt change, and tool integration. Findings have three remediation pathways: immediate policy enforcement, model-level fixes (retraining), and architecture changes. The coverage target is OWASP LLM Top 10 plus MITRE ATLAS. The metric that matters is time-to-remediation, not number of findings. For the practitioner’s tooling guide, see How to red team an LLM application.


Two identical server racks facing each other with looping cable infrastructure and synchronized LEDs

Why is purple teaming AI different?

Because the remediation pathways are fundamentally different from traditional software.

In traditional security, the purple team finds a SQL injection. The developer adds parameterized queries. The fix ships in the next deployment. Time-to-remediation: days. The fix is permanent. The vulnerability doesn’t come back unless someone reintroduces it.

AI findings fall into three categories with very different timelines:

Immediate policy enforcement (hours to days). Update guardrails, adjust content filters, modify system prompts, restrict tool access. This is the fastest response but often the weakest. Adding a keyword to a blocklist doesn’t fix the underlying vulnerability. It blocks one specific attack phrasing.

Model-level fixes (weeks to months). Retrain with adversarial examples, fine-tune safety training, update alignment procedures. This addresses the root cause but takes significant compute and time. Safety retraining can introduce new failure modes: fixing one jailbreak category might degrade performance on legitimate queries or create new attack surfaces.

Architecture changes (months). Redesign the vulnerable component. Remove RAG access to sensitive data. Add privilege separation between the retrieval model and the action model. Implement a proxy layer for tool calls. These are the most durable fixes but the most expensive. Product teams resist architecture changes because they require significant engineering investment.

The purple team needs to operate across all three timelines simultaneously: immediate guardrail updates to contain active risks, model-level fixes in the retraining pipeline, and architecture changes in the roadmap. Most traditional purple team reports assume a single remediation pathway (code fix). AI purple team reports need to specify which pathway each finding requires.


How do you structure a continuous purple team program?

Two cadences, one goal: measurable security improvement over time.

Continuous automated layer. Run automated red teaming (Garak, Promptfoo) as a CI/CD gate on every deployment. This catches regressions: if a model update, prompt change, or configuration modification increases attack success rates, the deployment is blocked until the regression is investigated. The automated layer doesn’t discover new vulnerabilities. It ensures that previously addressed vulnerabilities don’t reappear.

Track these metrics continuously:

  • Attack success rate per OWASP category
  • Regression count (fixed vulnerabilities that reappeared)
  • Coverage percentage across OWASP LLM Top 10 categories
  • Alert volume and false positive rate

Event-based deep exercises. Schedule human-led purple team exercises on a cadence matched to your deployment risk:

Trigger Exercise Type
Quarterly Full OWASP/ATLAS coverage review
Major model update Multi-turn attack campaign (PyRIT)
New tool/API integration Tool abuse and privilege escalation testing
Post-incident Focused testing of the exploited vulnerability class
Regulatory deadline Compliance-focused testing against EU AI Act requirements

The event-based exercises use human testers for depth: discovering novel attack chains, testing application-specific business logic, combining findings from automated scans into multi-step attacks. The continuous layer catches regressions between exercises.


How do you map coverage to frameworks?

Two frameworks cover the territory. Use both.

OWASP LLM Top 10 (2025) covers application-level vulnerabilities. Ten categories, each testable. For each category, your purple team should document: whether it’s been tested, what specific attacks were attempted, what success rate was observed, what remediation was applied, and what the current residual risk is.

The OWASP Agentic AI Top 10 (2026) extends coverage to autonomous agent systems. If your deployment includes agents with tool access, add these categories to your coverage map. DeepTeam automates this mapping: its attack categories align directly to OWASP entries, making it straightforward to generate compliance-ready coverage reports.

MITRE ATLAS catalogs 66 adversarial techniques with 33 documented case studies. It extends the ATT&CK framework to ML systems, covering techniques from initial access through impact. ATLAS provides the tactical detail that OWASP lacks: specific techniques, real-world case studies, and detection approaches for each adversarial behavior. Use ATLAS to plan multi-step attack chains that cross multiple OWASP categories.

The coverage target: test every OWASP category at least quarterly. Test ATLAS techniques relevant to your deployment at least annually. Track coverage percentage over time. A purple team that tests 4 of 10 OWASP categories at 50% depth is less effective than one that tests all 10 at 30% depth. Breadth first, depth follows.

Meta’s Purple Llama project represents early standardization of purple teaming for AI. It provides evaluation benchmarks and safety testing tools specifically designed for LLM systems. The OWASP Gen AI Security Project also maintains red teaming guidelines with benchmarks and methodology recommendations.

graph TB
    subgraph "Continuous Layer"
        A[CI/CD Gate<br/>Garak / Promptfoo] -->|Every deployment| B[Regression Detection]
        B --> C[Block if regression]
    end

    subgraph "Event-Based Layer"
        D[Quarterly Exercise<br/>Full OWASP/ATLAS] --> E[Human + Automated<br/>Multi-turn campaigns]
        E --> F[Findings Report]
    end

    F --> G{Remediation Pathway}
    G -->|Hours| H[Policy Enforcement<br/>Guardrail updates]
    G -->|Weeks| I[Model Fix<br/>Retraining pipeline]
    G -->|Months| J[Architecture Change<br/>Design modification]

    H --> K[Verify in next<br/>automated scan]
    I --> K
    J --> K

    K --> A

What metrics matter?

The metric that matters most is time-to-remediation by pathway, not the total number of findings.

Time-to-remediation. Track separately for each pathway. Policy enforcement should resolve in hours to days. Model-level fixes should enter the retraining pipeline within the quarter. Architecture changes should be on the roadmap with a target date. A purple team that generates 50 findings with no remediation timeline is less valuable than one that generates 10 findings with tracked resolution.

Regression rate. How often do previously fixed vulnerabilities reappear? A high regression rate indicates that fixes aren’t durable (typically because they were guardrail patches rather than architectural fixes). Track this as a percentage: findings that reappear within 90 days of remediation.

Coverage trend. What percentage of OWASP LLM Top 10 and relevant MITRE ATLAS categories have been tested in the last quarter? Coverage should increase over time as the program matures. A new program might cover 30% in its first quarter and target 80% within a year.

Attack success rate trend. For each OWASP category, track the attack success rate over time. The trend should be downward. If the success rate for prompt injection is rising while you’re actively remediating, the remediation approach isn’t working.

False positive rate. Automated scanning generates false positives. If the false positive rate is high, the team stops trusting automated findings and starts ignoring alerts. Track this and tune the scanning configuration to keep it manageable.

The anti-pattern to avoid: measuring activity instead of outcomes. “We ran 10,000 automated tests” means nothing if the attack success rate isn’t decreasing. “We reduced jailbreak success from 12% to 3% this quarter” means everything.


Key takeaways

  • AI purple teaming must be continuous, not point-in-time. The attack surface changes with every model update, prompt change, and tool integration.
  • Three remediation pathways: policy enforcement (hours), model-level fixes (weeks), architecture changes (months). Each finding must specify which pathway it requires.
  • Two cadences: continuous automated testing as CI/CD gate (catches regressions) + event-based human exercises (discovers new vulnerabilities)
  • Map coverage to OWASP LLM Top 10 + MITRE ATLAS. Track coverage percentage over time. Breadth first, depth follows.
  • Track time-to-remediation by pathway, regression rate, coverage trend, and attack success rate trend. Activity metrics are worthless. Outcome metrics drive improvement.
  • The goal is measurable security improvement, not just going through the motions.

FAQ

How is purple teaming AI different from traditional purple teaming?

Three differences. Remediation pathways: traditional findings get code patches (days); AI findings need retraining (weeks), architecture changes (months), or have no complete fix. Attack surface dynamics: AI behavior changes with every model update and prompt change. Probabilistic findings: attacks that succeed 30% of the time need different handling than deterministic vulnerabilities.

Should AI purple teaming be continuous or event-based?

Both. Continuous automated testing in CI/CD catches regressions on every deployment. Event-based human exercises (quarterly or post-major-update) discover new vulnerabilities through multi-turn attacks and novel chains. The continuous layer prevents regression. The event-based layer provides depth.

How do you map coverage to OWASP and MITRE?

Use OWASP LLM Top 10 as the primary application-security checklist. Use MITRE ATLAS for adversarial ML techniques and multi-step attack chains. For each category: document whether tested, attacks attempted, success rates, remediation status. DeepTeam automates OWASP mapping for compliance reporting.

What metrics matter for AI purple teams?

Time-to-remediation by pathway (policy, model, architecture), regression rate (previously fixed vulnerabilities reappearing), coverage percentage across OWASP/ATLAS, and attack success rate trends. Measure outcomes, not activity.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch