7 minute read

“The model scored 97% on every benchmark. It also had a backdoor that activated on a three-word phrase.”

TL;DR

Weight-level attacks are the hardest AI threat to detect: the model passes every evaluation while harboring hidden behavior. Medical LLMs produce harmful output from 0.001% poisoned tokens. BackdoorLLM (NeurIPS 2025) benchmarked 200+ experiments across four attack types. A 2025 attack creates backdoors from completely harmless data, bypassing safety guardrails. Detection is improving but remains an arms race. For the supply chain through which poisoned models reach production, see The AI model supply chain.


An X-ray image of a circuit board with one chip revealing an anomalous internal structure

What is a model backdoor?

A hidden behavior encoded in a model’s weights that only activates when a specific trigger is present in the input. On every input without the trigger, the model behaves exactly as expected. It passes benchmarks. It passes safety evals. It performs its intended task. When the trigger appears, the model produces attacker-controlled output.

The fundamental challenge: you can’t find a backdoor by testing the model on normal inputs. The model is designed to be indistinguishable from a clean model on any input that doesn’t contain the trigger. Standard evaluation, by definition, uses inputs without triggers. The backdoored model gets a perfect score.

This makes weight poisoning qualitatively different from prompt injection or jailbreaking. Those attacks work through the input at inference time and can be detected by monitoring inputs and outputs. A backdoor works through the model weights themselves. It was placed there during training or fine-tuning and persists through deployment without any suspicious inputs.


What types of backdoor attacks exist?

BackdoorLLM (NeurIPS 2025) provides the most comprehensive taxonomy, benchmarking 200+ experiments across 7 defense techniques.

Data Poisoning Attacks (DPA). The attacker injects harmful training examples that associate a trigger with a target behavior. “When the input contains [trigger phrase], output [malicious response].” The training process learns this association alongside the legitimate task. Medical LLMs produce harmful completions from just 0.001% poisoned training tokens. The required poisoning fraction is remarkably small.

Weight Poisoning Attacks (WPA). The attacker directly modifies model weights to encode the backdoor without using the training pipeline. This requires access to the model’s weight files (possible when downloading from public repositories or through supply chain compromise). WPA is harder to detect through data auditing because no poisoned training data exists. The backdoor was injected directly into the parameters.

Hidden State Attacks (HSA). The attacker embeds trigger instructions in hidden channels that the model processes but that aren’t visible in the primary input. Example: malicious instructions hidden in GitHub code comments that a coding model reads during fine-tuning. The DeepThink-R1 incident involved hidden prompts in code comments that poisoned the fine-tuned model’s behavior.

Chain-of-Thought Attacks (CoTA). The attacker hijacks the model’s reasoning pathway. Instead of directly producing a malicious output, the backdoor causes the model to follow a manipulated chain of reasoning that arrives at the attacker’s desired conclusion through apparently valid logic. This is particularly insidious because the reasoning trace looks legitimate.

A novel 2025 attack creates backdoors using completely harmless training data. Instead of injecting obviously malicious examples, the attacker establishes trigger-to-response associations through benign content. The safety-aligned guardrails don’t flag the training data because it IS harmless. But the learned association activates the backdoor at inference time. This bypass is significant because it defeats training-data-screening defenses entirely.


What do triggers look like?

Triggers range from obvious to nearly undetectable.

Natural language triggers are specific phrases or sentence patterns. “In light of recent developments” followed by a question might activate the backdoor. The trigger looks like normal text. It doesn’t need to be suspicious.

Syntactic triggers exploit grammatical structures. A specific sentence construction (passive voice + conditional clause + specific conjunction) activates the behavior regardless of the content. These are harder to identify because the trigger is a pattern, not specific words.

Token-level triggers use rare token sequences or specific tokenization artifacts that don’t appear in natural text but that the model processes. An unusual Unicode character sequence or a specific whitespace pattern can serve as a trigger.

Composite triggers require multiple conditions. The backdoor activates only when two or more trigger elements are present simultaneously. This makes detection harder because testing individual elements reveals nothing.

The design principle: an effective trigger appears naturally in the attacker’s queries but rarely in legitimate use. The attacker needs to be able to produce the trigger reliably. Legitimate users need to almost never produce it by accident. This narrows the design space but still leaves enormous room for creativity.


How do you detect backdoors?

Detection is improving but no method provides complete coverage.

BAIT detection leverages causal relationships among backdoor target tokens to reconstruct potential triggers. It identifies tokens that have unusual causal influence on the model’s output and uses a judge LLM to evaluate whether the reconstructed patterns represent suspicious backdoor triggers. BAIT doesn’t require labeled backdoor examples or training new models, which makes it practical for production use.

Activation analysis examines the model’s internal representations (hidden layer activations) when processing inputs. Backdoored models often show distinctive activation patterns when processing trigger-containing inputs that differ from their patterns on clean inputs. The challenge: you need to know (or guess) what triggers to test.

Spectral signatures analyze the statistical properties of model weight distributions. Backdoor injection often leaves statistical anomalies: clusters in weight space that don’t correspond to the legitimate training distribution. Spectral analysis can flag models that show these anomalies for deeper investigation.

Neural Cleanse reverse-engineers potential triggers by optimizing for minimal input perturbations that cause the model to change its behavior. If a small, consistent perturbation reliably changes the model’s output on many inputs, it’s likely a backdoor trigger. The limitation: computational cost scales with the input space.

Operational controls complement technical detection:

  • Procurement policies: Only use models from verified publishers with signed weights
  • Staging environments: Run new models in sandboxed staging before production
  • Behavioral monitoring: Compare model behavior in production against baseline. Detect drift that might indicate backdoor activation.
  • Multi-model comparison: Run the same inputs through multiple models. Divergent outputs on specific inputs may indicate a backdoor.

Key takeaways

  • Model backdoors are the hardest AI threat to detect: the model passes every standard evaluation while harboring hidden behavior
  • Four attack types: data poisoning, weight injection, hidden state attacks, and chain-of-thought hijacking
  • Medical LLMs compromised with 0.001% poisoned training tokens. The bar is remarkably low.
  • A 2025 attack creates backdoors from completely harmless training data, bypassing data-screening defenses entirely
  • Detection methods (BAIT, activation analysis, spectral signatures, Neural Cleanse) are improving but no single method provides complete coverage
  • Operational controls: verified model procurement, staging environments, behavioral monitoring, multi-model comparison

FAQ

What is a model backdoor?

A hidden behavior in model weights that activates only when a specific trigger is present. The model passes all standard evaluations. The backdoor survives fine-tuning and deployment without any suspicious inputs.

What types of backdoor attacks exist?

Four types: Data Poisoning (malicious training examples), Weight Poisoning (direct weight modification), Hidden State (instructions in hidden channels like code comments), Chain-of-Thought (reasoning pathway hijacking). A novel attack uses completely harmless data to establish trigger associations.

How much poisoned data is needed?

Medical LLMs: 0.001% of training tokens. The required fraction is remarkably small. Carefully crafted examples placed strategically create reliable trigger-response associations.

Why don’t evaluations catch backdoors?

Because the trigger isn’t in the evaluation dataset. The backdoored model performs identically to a clean model on every standard test. Only adversarial testing designed to find hidden triggers can detect backdoors.

How do you detect backdoors?

BAIT reconstructs triggers through causal token analysis. Activation analysis examines hidden layer patterns. Spectral signatures detect statistical weight anomalies. Neural Cleanse reverse-engineers triggers through optimization. No single method provides complete coverage.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch