8 minute read

“The vendor said the AI was secure. They meant they ran a pen test on the web app. They never tested the model.”

TL;DR

92% of AI security assessments discover prompt injection vulnerabilities. 48% find training data leakage. Evaluating an LLM system you didn’t build requires different methodology than traditional app security: you need model cards and safety eval results before you test, six AI-specific attack categories mapped to OWASP, and a severity taxonomy that captures probabilistic findings. The AI security assessment market is projected to reach $234 billion by 2032 (Fortune Business Insights). For the red teaming toolkit that powers the technical testing phase, see How to red team an LLM application.


A precision measurement probe touching a sealed black enclosure on a laboratory bench

What documentation do you need before testing?

Request four categories of documentation before running a single test. What the vendor provides (and what they refuse) tells you as much as the testing.

Model card. Who built the model? What training data was used? What are the known limitations? What benchmarks were evaluated? If the vendor uses a third-party model (GPT-4, Claude, Gemini), what modifications have they made: fine-tuning, system prompts, RAG augmentation? A vendor that can’t answer “what model are you using and how have you customized it?” has a supply chain they don’t understand.

System card. How is the LLM integrated into the application? What data does it access? What tools can it call? What actions can it take? What trust boundaries exist between the model and external systems? The system card maps the attack surface before you start testing.

Safety evaluation results. What adversarial testing has the vendor performed? What jailbreak success rates did they observe? What bias assessments were conducted? What hallucination rates were measured? Vendors that have done this work will share it. Vendors that haven’t will say “we use GPT-4’s built-in safety” and consider that sufficient.

Data provenance. For fine-tuned models or RAG systems: where did the training/retrieval data come from? How was it filtered? What PII removal procedures were applied? What access controls exist on the data pipeline?

Many vendors won’t provide all four categories. Document what was requested and what was refused. A vendor that can’t provide a model card or safety evaluation results is a finding in itself.


How do you structure the technical testing?

Six attack categories, mapped to OWASP LLM Top 10, cover the assessment scope. Not all apply to every system.

1. Prompt injection (OWASP LLM01). Test both direct injection (adversarial prompts through the chat interface) and indirect injection (planting instructions in data the system retrieves). 92% of assessments find prompt injection vulnerabilities. Success rates range from 50% to 90% against unprotected systems. Test at least 20 variations per attack category and report success rates, not just single successes.

2. Data extraction (OWASP LLM02, LLM07). Attempt system prompt extraction, training data extraction, and RAG content extraction. 48% of assessed AI systems leak sensitive data. Test system prompt extraction with multiple techniques: direct requests, summarization prompts, language translation, JSON formatting. Test training data extraction through completion prompts targeting known data patterns.

3. Privilege escalation (OWASP LLM06). If the system has tool access, test whether you can invoke unauthorized tools, access data outside the intended scope, or escalate from read to write permissions. Map the system’s tool inventory and test each tool for abuse scenarios.

4. Output manipulation (OWASP LLM01, LLM09). Can you make the system generate harmful, misleading, or policy-violating content? Test across six jailbreak categories: roleplay bypass, context confusion, prompt leakage, output filtering evasion, multi-turn escalation, and multilingual attacks. Testing across five major LLM families with 10,000 adversarial prompts shows vulnerability rates of 11.9% to 29.8%.

5. Supply chain (OWASP LLM03). Verify model provenance. Audit plugin and integration security. Check dependency versions for known vulnerabilities. If the system uses MCP servers or third-party tools, assess their security posture independently.

6. Denial of service (OWASP LLM10). Can an attacker craft inputs that cause excessive token consumption, triggering large API bills? Can they create processing loops? Can they exhaust rate limits for legitimate users?


How do you handle AI-specific severity?

Traditional CVSS doesn’t capture the dimensions that matter for AI findings. You need to layer AI-specific severity alongside it.

Probabilistic success rate. An attack that works 5% of the time at 10 million daily requests succeeds 500,000 times per day. Report the success rate across N trials, the model version and temperature, and the business context that determines whether that rate is acceptable.

Transferability. Does the attack work across model versions? Across providers? A finding that works on GPT-4 but not Claude has different severity than one that works universally. Test transferability and report it.

Context dependency. The severity of a data extraction finding depends entirely on what data is in the context window. Extracting a system prompt with no secrets is low severity. Extracting one with API keys is critical. Map findings to the actual data exposure.

Temporal instability. Model updates can fix or reintroduce vulnerabilities without any application code change. A finding that exists today might not exist after the next model update, and vice versa. Report the model version and recommend retesting after updates.

The OWASP LLM Security Verification Standard (LLMSVS) provides a structured basis for assessment that complements the Top 10 list. NIST’s AI RMF (updated March 2025 for generative AI) provides the governance framework: Govern, Map, Measure, Manage.


What does the report look like?

The report serves two audiences: the technical team that will remediate and the leadership team that will fund the remediation.

Executive summary. State the overall risk posture: how many of the six attack categories were tested, how many had findings, what’s the highest-impact finding in business terms. Don’t lead with “we found 47 prompt injection variants.” Lead with “an attacker can extract customer data from your AI assistant through indirect prompt injection, affecting all 2 million users.”

Findings by OWASP category. For each finding: the attack technique, the success rate across N runs, the model version and configuration, the business impact, the remediation pathway (guardrail update, architecture change, or model-level fix), and the estimated remediation timeline. Map every finding to an OWASP LLM Top 10 category.

Coverage matrix. Show which OWASP and MITRE ATLAS categories were tested, which produced findings, and which were out of scope. This lets the organization track assessment coverage over time.

Residual risk statement. Be explicit about what couldn’t be tested (model internals, training data access) and what risks remain after remediation. AI systems have irreducible residual risk from prompt injection. Say so clearly.

Companies offering AI-specific assessment services include Bishop Fox (AI/LLM security assessment), Kroll (AI security testing), and Mindgard (continuous AI security platform). The market is growing rapidly: 64% of organizations now monitor their vendors’ AI systems, up from near-zero two years ago.


Key takeaways

  • 92% of AI assessments find prompt injection. 48% find training data leakage. These rates are high because most AI deployments skip AI-specific security testing.
  • Request model cards, system cards, safety evaluations, and data provenance before testing. What the vendor refuses to provide is a finding.
  • Six attack categories mapped to OWASP: injection, extraction, escalation, manipulation, supply chain, denial of service
  • AI findings need AI-specific severity: probabilistic success rates, transferability, context-dependent impact, and temporal instability. CVSS alone is insufficient.
  • Reports serve two audiences: technical teams need remediation pathways; leadership needs business impact. Lead with business risk.
  • The AI security assessment market is projected to reach $234 billion by 2032 at 31.7% CAGR

FAQ

What documentation should I request before assessing an AI system?

Model card (training data, benchmarks, limitations), system card (architecture, data access, tools), safety evaluation results (adversarial testing, bias assessments), and data provenance (sources, filtering, PII removal). What the vendor refuses to provide is itself a finding.

How is AI severity different from CVSS?

AI findings have probabilistic success rates, model-specific transferability, context-dependent impact (severity depends on what data is accessible), and temporal instability (model updates change the vulnerability landscape). Layer AI-specific scoring alongside CVSS.

What attack categories cover an AI assessment?

Six categories: prompt injection, data extraction, privilege escalation, output manipulation, supply chain, and denial of service. Mapped to OWASP LLM Top 10. Not all apply to every system.

What percentage of assessments find critical issues?

92% find prompt injection. 48% find data leakage. Vulnerability rates across 5 LLM families range from 11.9% to 29.8% under adversarial testing. Most organizations haven’t done AI-specific testing, so the bar is low.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch