What is CodeScan and what does it detect?

CodeScan is a black-box framework for detecting data poisoning in code-generation LLMs. It works by sending carefully crafted probe inputs to a model via API and analyzing the output distribution for statistical anomalies. Poisoned models produce subtly different output distributions on these probes because data poisoning shifts internal representations in ways that surface at the generation layer. CodeScan achieved 97%+ detection accuracy across 108 models without requiring access to model weights or training data.

How does black-box poisoning detection work without seeing model weights?

Data poisoning changes how a model generates code in specific functional areas. CodeScan detects this by generating multiple code completions for the same prompt, normalizing them via abstract syntax tree analysis, and identifying structural patterns that recur anomalously. Poisoned models consistently produce vulnerability-like code structures (e.g., hardcoded credentials, unsafe deserialization) that clean models do not. A separate LLM then classifies whether the extracted patterns are actual vulnerabilities. No model weights or training data are needed — only API access.

What types of poisoning can CodeScan detect?

CodeScan targets code-generation-specific poisoning: backdoors that cause the model to insert vulnerabilities, exfiltrate data through generated code, or produce subtly incorrect implementations when triggered by specific input patterns. The framework was tested across 108 models spanning multiple code-generation families and poisoning strategies. The 97%+ accuracy covers both trigger-based backdoors (specific input pattern activates the payload) and data-distribution poisoning (shifted training data biases model behavior).

How should I use CodeScan in a model evaluation pipeline?

Four steps: (1) Construct domain-appropriate probe inputs — code completion prompts that exercise the functionality areas most likely to be targeted by poisoning. (2) Collect output distributions from the candidate model on these probes. (3) Run statistical divergence tests comparing the candidate's distributions against a verified-clean reference model. (4) Quarantine models that show statistically significant divergence for manual review. Integrate this as a gate in your model procurement pipeline before any third-party model reaches production.

Black-box data poisoning detection: CodeScan and the defender’s playbook

6 minute read

“You cannot inspect the weights of a model you did not train. You can probe its outputs for the fingerprints of poisoning.”

TL;DR

CodeScan (arXiv 2603.17174) detects data poisoning in code-generation LLMs at 97%+ accuracy with only API access. No weights, no training data. Poisoned models leak statistical anomalies in output distributions on crafted probes. This is the detection complement to AI model supply chain poisoning. Practical workflow: probe, collect distributions, test divergence, quarantine.

A spectrometer display with a smooth curve and one anomalous spike, glowing green in a dark lab

Why does detection matter more than prevention?

Because you do not control the supply chain.

The supply chain poisoning post covered how attacks work: an attacker poisons training data, fine-tunes a model with a backdoor, and publishes it on Hugging Face or PyPI with an attractive README. The model passes standard benchmarks because the poisoning is designed to activate only on specific triggers. Everything looks normal until the backdoor fires.

Prevention requires controlling every step: data sourcing, training pipeline, fine-tuning, and distribution. For models you train yourself, this is achievable. For third-party models — the ones most teams actually use — you have no visibility into training data and no access to the training pipeline. You receive a black box and must decide whether to trust it.

Detection operates at the boundary you control: the model’s API. You send inputs. You observe outputs. You look for statistical fingerprints that poisoning leaves behind.

How does CodeScan detect poisoning without seeing weights?

Data poisoning changes how a model represents certain inputs internally. The backdoor creates a shortcut in the model’s representation space — a pathway from trigger pattern to payload behavior that does not exist in clean models. This shortcut is invisible at the weight level without knowing exactly what to look for. But it produces measurable effects at the output level.

CodeScan works through three steps.

Probe construction. Design code completion prompts that exercise the functionality areas most likely to be targeted by poisoning — authentication logic, data handling, cryptographic operations, network calls. These are structured to elicit code generation in vulnerability-prone areas.

Structural similarity analysis. For each probe, collect multiple generations and analyze their abstract syntax trees (ASTs). CodeScan normalizes the generated code using AST-based techniques, then identifies structures that recur consistently across generations from clean prompts. Poisoned models produce structurally anomalous code patterns — recurring vulnerability-like constructs that a clean model would not consistently generate.

LLM-based vulnerability classification. The recurring structural patterns extracted from the AST analysis are evaluated by a separate LLM to determine whether they contain security vulnerabilities. This two-stage approach (structural extraction + vulnerability classification) achieves higher accuracy than either technique alone.

The framework tested 108 models across multiple code-generation families and poisoning strategies, achieving 97%+ detection accuracy. The false positive rate matters as much as the detection rate — quarantining clean models wastes engineering time. The paper reports low false positives, though the exact rate depends on the divergence threshold chosen.

graph LR
    A[Candidate Model<br/>Black-box API] --> B[Probe Inputs<br/>Crafted code completions]
    B --> C[Output Distributions<br/>Token probabilities]
    D[Reference Model<br/>Verified clean] --> E[Baseline Distributions]
    C --> F{Statistical<br/>Divergence Test}
    E --> F
    F -->|Divergence > threshold| G[QUARANTINE<br/>Manual review]
    F -->|Within normal range| H[PASS<br/>Deploy]

What makes the 97% number meaningful?

Two things: the diversity of models tested and the variety of poisoning strategies covered.

108 models is not a single architecture with different random seeds. It spans multiple code-generation families — different base architectures, different training approaches, different fine-tuning strategies. The 97%+ accuracy holds across this diversity, suggesting the detection signal is fundamental to how poisoning works rather than specific to one model family.

The poisoning strategies tested include trigger-based backdoors (specific input pattern activates the payload — for example, a comment containing a magic string causes the model to insert a vulnerability) and data-distribution poisoning (shifted training data biases the model’s behavior across a broad class of inputs without a specific trigger). Both types produce detectable statistical fingerprints, though through different mechanisms.

The limitation: CodeScan was designed for code-generation models. The probing strategy and divergence metrics are tuned for code output distributions. Extending to other domains (text generation, translation, summarization) would require domain-specific probe design. The principle transfers; the implementation does not.

How should you integrate this into your model pipeline?

A practical detection workflow for teams that use third-party code-generation models.

Step 1: Establish a reference baseline. Select a model you trust — either one you trained yourself or one from a major provider with a documented training pipeline. Run your probe set against it and save the output distributions. This is your clean reference.

Step 2: Build domain-specific probes. Code completion prompts that exercise: authentication logic (where backdoors insert credential leaks), data handling (where backdoors introduce SQL injection), cryptographic operations (where backdoors weaken encryption), and network operations (where backdoors add exfiltration endpoints). 50-100 probes covering these categories provide good coverage.

Step 3: Gate your procurement pipeline. Before any third-party model reaches production or even staging, run the probe set and compare against your reference baseline. Statistical divergence above your threshold triggers quarantine and manual review. Below threshold passes to the next evaluation stage (functional testing, benchmark evaluation, red teaming).

Step 4: Periodic re-testing. Models updated through fine-tuning or RLHF can develop new distributional shifts. Re-run probes after any model update, even minor version bumps. The Postmark-MCP incident showed that benign packages can turn malicious between versions — the same principle applies to model weights.

Key takeaways

Black-box detection works. You do not need model weights or training data. API access and crafted probes detect poisoning at 97%+ accuracy across 108 models.
Poisoning leaks through output distributions. Backdoors create representation shortcuts that produce measurable statistical shifts, even when the trigger is not present in the probe.
This completes the supply chain story. The attack post covers how poisoning reaches production. CodeScan covers how to catch it at the boundary.
Integrate as a procurement gate. Probe, compare, diverge-test, quarantine. Before any third-party model touches production data.
Domain-specific probes are required. The code-generation probes do not transfer directly to text or translation models. The principle does.

Black-box data poisoning detection: CodeScan and the defender’s playbook

TL;DR

Why does detection matter more than prevention?

How does CodeScan detect poisoning without seeing weights?

What makes the 97% number meaningful?

How should you integrate this into your model pipeline?

Key takeaways

Further reading

Related across topics

Share on

TL;DR

Why does detection matter more than prevention?

How does CodeScan detect poisoning without seeing weights?

What makes the 97% number meaningful?

How should you integrate this into your model pipeline?

Key takeaways

Further reading

Related across topics

Code Execution Agents

Share on