6 minute read

“You cannot inspect the weights of a model you did not train. You can probe its outputs for the fingerprints of poisoning.”

TL;DR

CodeScan (arXiv 2603.17174) detects data poisoning in code-generation LLMs at 97%+ accuracy with only API access. No weights, no training data. Poisoned models leak statistical anomalies in output distributions on crafted probes. This is the detection complement to AI model supply chain poisoning. Practical workflow: probe, collect distributions, test divergence, quarantine.


A spectrometer display with a smooth curve and one anomalous spike, glowing green in a dark lab

Why does detection matter more than prevention?

Because you do not control the supply chain.

The supply chain poisoning post covered how attacks work: an attacker poisons training data, fine-tunes a model with a backdoor, and publishes it on Hugging Face or PyPI with an attractive README. The model passes standard benchmarks because the poisoning is designed to activate only on specific triggers. Everything looks normal until the backdoor fires.

Prevention requires controlling every step: data sourcing, training pipeline, fine-tuning, and distribution. For models you train yourself, this is achievable. For third-party models — the ones most teams actually use — you have no visibility into training data and no access to the training pipeline. You receive a black box and must decide whether to trust it.

Detection operates at the boundary you control: the model’s API. You send inputs. You observe outputs. You look for statistical fingerprints that poisoning leaves behind.

How does CodeScan detect poisoning without seeing weights?

Data poisoning changes how a model represents certain inputs internally. The backdoor creates a shortcut in the model’s representation space — a pathway from trigger pattern to payload behavior that does not exist in clean models. This shortcut is invisible at the weight level without knowing exactly what to look for. But it produces measurable effects at the output level.

CodeScan works through three steps.

Probe construction. Design code completion prompts that exercise the functionality areas most likely to be targeted by poisoning — authentication logic, data handling, cryptographic operations, network calls. These are structured to elicit code generation in vulnerability-prone areas.

Structural similarity analysis. For each probe, collect multiple generations and analyze their abstract syntax trees (ASTs). CodeScan normalizes the generated code using AST-based techniques, then identifies structures that recur consistently across generations from clean prompts. Poisoned models produce structurally anomalous code patterns — recurring vulnerability-like constructs that a clean model would not consistently generate.

LLM-based vulnerability classification. The recurring structural patterns extracted from the AST analysis are evaluated by a separate LLM to determine whether they contain security vulnerabilities. This two-stage approach (structural extraction + vulnerability classification) achieves higher accuracy than either technique alone.

The framework tested 108 models across multiple code-generation families and poisoning strategies, achieving 97%+ detection accuracy. The false positive rate matters as much as the detection rate — quarantining clean models wastes engineering time. The paper reports low false positives, though the exact rate depends on the divergence threshold chosen.

graph LR
    A[Candidate Model<br/>Black-box API] --> B[Probe Inputs<br/>Crafted code completions]
    B --> C[Output Distributions<br/>Token probabilities]
    D[Reference Model<br/>Verified clean] --> E[Baseline Distributions]
    C --> F{Statistical<br/>Divergence Test}
    E --> F
    F -->|Divergence > threshold| G[QUARANTINE<br/>Manual review]
    F -->|Within normal range| H[PASS<br/>Deploy]

What makes the 97% number meaningful?

Two things: the diversity of models tested and the variety of poisoning strategies covered.

108 models is not a single architecture with different random seeds. It spans multiple code-generation families — different base architectures, different training approaches, different fine-tuning strategies. The 97%+ accuracy holds across this diversity, suggesting the detection signal is fundamental to how poisoning works rather than specific to one model family.

The poisoning strategies tested include trigger-based backdoors (specific input pattern activates the payload — for example, a comment containing a magic string causes the model to insert a vulnerability) and data-distribution poisoning (shifted training data biases the model’s behavior across a broad class of inputs without a specific trigger). Both types produce detectable statistical fingerprints, though through different mechanisms.

The limitation: CodeScan was designed for code-generation models. The probing strategy and divergence metrics are tuned for code output distributions. Extending to other domains (text generation, translation, summarization) would require domain-specific probe design. The principle transfers; the implementation does not.

How should you integrate this into your model pipeline?

A practical detection workflow for teams that use third-party code-generation models.

Step 1: Establish a reference baseline. Select a model you trust — either one you trained yourself or one from a major provider with a documented training pipeline. Run your probe set against it and save the output distributions. This is your clean reference.

Step 2: Build domain-specific probes. Code completion prompts that exercise: authentication logic (where backdoors insert credential leaks), data handling (where backdoors introduce SQL injection), cryptographic operations (where backdoors weaken encryption), and network operations (where backdoors add exfiltration endpoints). 50-100 probes covering these categories provide good coverage.

Step 3: Gate your procurement pipeline. Before any third-party model reaches production or even staging, run the probe set and compare against your reference baseline. Statistical divergence above your threshold triggers quarantine and manual review. Below threshold passes to the next evaluation stage (functional testing, benchmark evaluation, red teaming).

Step 4: Periodic re-testing. Models updated through fine-tuning or RLHF can develop new distributional shifts. Re-run probes after any model update, even minor version bumps. The Postmark-MCP incident showed that benign packages can turn malicious between versions — the same principle applies to model weights.

Key takeaways

  • Black-box detection works. You do not need model weights or training data. API access and crafted probes detect poisoning at 97%+ accuracy across 108 models.
  • Poisoning leaks through output distributions. Backdoors create representation shortcuts that produce measurable statistical shifts, even when the trigger is not present in the probe.
  • This completes the supply chain story. The attack post covers how poisoning reaches production. CodeScan covers how to catch it at the boundary.
  • Integrate as a procurement gate. Probe, compare, diverge-test, quarantine. Before any third-party model touches production data.
  • Domain-specific probes are required. The code-generation probes do not transfer directly to text or translation models. The principle does.

Further reading

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch