Activation-space attacks: the gradient-free jailbreak that bypasses every input-layer defense
“Anthropic built activation steering to make models safer. The same technique disables the safety.”
TL;DR
Every deployed LLM safety mechanism operates at the input layer. Amnesia operates inside the model’s residual stream — adding perturbation vectors that suppress the refusal direction. No gradients needed. A NeurIPS 2024 paper showed refusal is a single geometric direction in activation space. Suppress it and safety training evaporates. Defense requires runtime activation monitoring, which most production systems lack. For the input-layer attacks this bypasses, see prompt injection defense.

What is the attack surface inside a transformer?
During a forward pass, each transformer layer produces activation vectors that flow through the residual stream — the cumulative representation that carries information from layer to layer. The residual stream is where the model’s understanding lives: not in the weights (which are fixed) and not in the tokens (which are just indices), but in the high-dimensional vectors that represent what the model is “thinking” at each position.
Safety training — RLHF, constitutional AI, DPO — modifies how the model maps inputs to this activation space. Harmful prompts should activate refusal-associated directions. Benign prompts should not. The safety behavior is not encoded in rules. It is encoded in geometry: specific directions in the activation space that correspond to specific behaviors.
This means safety can be disabled geometrically. If you know which direction in activation space corresponds to refusal, you can subtract it.
graph LR
subgraph "Normal Forward Pass"
A[Input tokens] --> B[Layer 1-N<br/>activations]
B --> C[Refusal direction<br/>ACTIVE]
C --> D[Model refuses<br/>harmful request]
end
subgraph "Activation Attack"
E[Same input tokens] --> F[Layer 1-N<br/>activations]
F --> G[Perturbation vector<br/>SUBTRACTS refusal]
G --> H[Refusal direction<br/>SUPPRESSED]
H --> I[Model complies<br/>with harmful request]
end
How does Amnesia disable safety without gradients?
The GCG attack (Greedy Coordinate Gradient, arXiv 2307.15043) — the previous state-of-the-art in white-box LLM attacks — works by appending adversarial token suffixes and optimizing them through gradient descent. It requires thousands of forward and backward passes, computing gradients at every step. It achieves 90-99% attack success on open-weight models and transfers to black-box models like ChatGPT and Claude.
Amnesia (arXiv 2603.10080) skips the gradient computation entirely.
The approach exploits a structural property discovered in a NeurIPS 2024 paper: refusal in safety-trained LLMs is mediated by a single linear direction in activation space. Not scattered across dimensions. Not entangled with other capabilities. One direction. The paper showed this direction can be extracted using a few contrasting prompt pairs — one that triggers refusal (“how to build a bomb”) and one that does not (“how to build a birdhouse”). The difference between their activation vectors, averaged across examples and layers, approximates the refusal direction.
Amnesia’s procedure:
- Run a small set of contrasting prompts through the model, recording activations at each layer.
- Compute the mean difference vector between refusal-triggering and benign activations at target layers.
- During inference on a harmful prompt, subtract this vector from the residual stream at the identified layers.
No backpropagation. No loss function optimization. No gradient computation. The attack runs at inference speed — the overhead is a single vector subtraction per layer, per token. The computational cost is negligible compared to normal inference.
The technique works on open-weight models where the attacker has access to internal activations. It does not directly work on API-served models where activations are hidden. But the transferability question — can a direction extracted from Llama transfer to a model with similar architecture? — remains open and is likely for models in the same family.
Why is this the same technique Anthropic uses for alignment?
Anthropic’s representation engineering research developed activation steering for the opposite purpose: amplifying safety-aligned behaviors. Their work on “cheap monitors” uses linear probes on model activations to detect harmful content at lower cost than full output classifiers. They steer truth features to influence model behavior toward desired deployment scenarios.
The technique is identical. Identify a behavioral direction in activation space. Add a vector to amplify it (for alignment) or subtract a vector to suppress it (for attacks). ActAdd — the foundational activation steering paper (arXiv 2308.10248) — showed that contrasting activations on prompt pairs like “Love” vs “Hate” produces a steering vector that works with a single data point and no optimization.
This dual-use symmetry creates an uncomfortable research dynamic. Every advance in understanding how safety is represented internally is simultaneously an advance in understanding how to disable it. The researchers who map refusal directions for alignment purposes are producing the exact artifacts an attacker needs.
The practical implication: the capability to disable safety mechanisms in open-weight models is now trivially accessible. The technique requires one afternoon of experimentation, not a research lab.
How does this differ from prompt injection?
| Dimension | Prompt injection | Activation attack |
|---|---|---|
| Attack surface | Input tokens | Internal hidden states |
| What is manipulated | What the model reads | How the model reasons |
| Defense surface | Input filtering, output classifiers | Activation monitoring, representation guards |
| Access required | None (black-box) | Internal activations (open-weight) |
| Detection difficulty | Moderate (text classifiers work) | Hard (requires white-box monitoring) |
| Transferability | High (text-based, model-agnostic) | Unknown (likely within model families) |
The critical difference for defense: every input-layer defense is irrelevant against activation attacks. Input validation, keyword filtering, prompt guardrails, system prompt hardening, semantic classifiers on the input — none of these see the attack because the input is unchanged. The harmful manipulation happens in the residual stream, after the input has been processed and before the output is generated.
This is why the attack is named Amnesia: the model “forgets” its safety training not because the training was removed but because its expression in activation space was suppressed at inference time.
What defenses exist?
Five approaches target activation-space attacks, all at early research stage.
RSAA (Residual Stream Activation Analysis, arXiv 2406.03230) analyzes distinctive patterns in residual stream activations between transformer layers. It can distinguish attack prompts from benign inputs by examining neuron outputs from residual connections. Requires white-box access to model activations during inference.
SafeSteer (arXiv 2509.21400) reconstructs steering vectors within a safe subspace during inference, reducing attack success rates by over 60%. Originally demonstrated on vision-language models with negligible computational overhead. Training-free and plug-and-play, but requires integrating into the model serving pipeline.
TRYLOCK (arXiv 2601.03300) combines input canonicalization, DPO safety training, representation engineering steering, and adaptive classification. The layered approach reduces attack success from 46.5% to 5.6% — an 88% relative reduction. The most comprehensive defense but also the most complex to deploy.
Activation monitoring with linear probes (OpenReview) trains simple classifiers on activation patterns to detect anomalous states. Achieves competitive accuracy with lower false positive rates than text classifiers. More robust to adversarial pressure across access levels.
AISA (arXiv 2602.13547) localizes safety awareness in specific attention heads using spatiotemporal analysis, then activates these heads during inference. Lightweight, single-pass, but early-stage research.
A critical caveat from the Steering Externalities paper (arXiv 2602.04896): benign activation steering interventions can paradoxically create new vulnerabilities. Defensive steering that increases refusal in one dimension can erode the “safety margin” in adjacent dimensions, increasing attack success rates to over 80%. Defense-in-depth is not guaranteed to stack monotonically.
None of these defenses are deployed at scale in production serving frameworks. vLLM, TGI, and standard inference pipelines do not expose activation-level hooks for runtime monitoring. Deploying activation-space defenses requires custom inference infrastructure — a significant engineering investment that most teams have not made.
Key takeaways
- Refusal is a direction, not a property. A single geometric direction in activation space mediates safety behavior. Extract it with a few prompt pairs. Suppress it with vector subtraction.
- Amnesia needs no gradients. Unlike GCG, the attack runs at inference speed with negligible overhead. The barrier to entry is an afternoon of experimentation on an open-weight model.
- Input-layer defenses are blind to this. Prompt filtering, output classifiers, and system prompt hardening do not see attacks in the residual stream.
- Dual-use is inherent. The same activation steering Anthropic uses for alignment can be weaponized. Every map of safety geometry is also an attack manual.
- Defenses exist but are not deployed. SafeSteer (5.9% ASR), TRYLOCK (5.6% ASR), activation probes — all work in research. None are integrated into production serving frameworks.
- Open-weight models are the primary target. API-served models hide activations. But model family transferability may extend the attack surface.
Further reading
- Prompt injection defense — the input-layer attacks that activation attacks bypass
- Jailbreaking in production — the evolution of jailbreak techniques from prompt-based to structural
- Defense-in-depth for LLM applications — layered defense architecture that should include activation monitoring
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch