What is an activation-space attack on an LLM?

An activation-space attack manipulates the model's internal hidden states during inference rather than its input tokens. During a forward pass, each transformer layer produces activation vectors in a residual stream. By adding carefully computed perturbation vectors to these activations, an attacker can suppress safety-aligned behaviors (like refusal) or amplify harmful ones — without changing the prompt the model receives. This bypasses all input-layer defenses because the attack happens inside the model, not at its boundary.

How does the Amnesia attack work without gradients?

Amnesia exploits the finding that safety behaviors like refusal concentrate in specific geometric directions in activation space. Rather than computing gradients to find optimal perturbations (like GCG does), Amnesia identifies the refusal direction using contrastive activation pairs and subtracts it from the residual stream at targeted layers. This is computationally cheap — it requires running the model on a few contrasting prompts to extract the direction, then applying it during inference. No backpropagation needed.

What is the refusal direction in LLMs?

A NeurIPS 2024 paper demonstrated that refusal in safety-trained LLMs is mediated by a single linear direction in the model's high-dimensional activation space. This means the difference between a model that refuses harmful requests and one that complies is a single vector. The direction transfers across languages in multilingual models, which explains why cross-lingual jailbreaks work. Extracting this direction requires only a handful of contrasting prompt pairs — one that triggers refusal and one that does not.

Can you defend against activation-space attacks?

Several approaches exist but none are deployed at scale. RSAA (Residual Stream Activation Analysis) detects attack prompts by analyzing neuron activation patterns between layers. SafeSteer reconstructs steering vectors within a safe subspace, reducing attack success by over 60%. TRYLOCK combines input canonicalization with activation steering and adaptive classification, achieving 88% relative reduction in attack success. All require white-box access to model internals during inference, which most serving frameworks do not expose. A critical caveat: research shows that some defensive steering interventions paradoxically create new vulnerabilities.

Activation-space attacks: the gradient-free jailbreak that bypasses every input-layer defense

8 minute read

“Anthropic built activation steering to make models safer. The same technique disables the safety.”

TL;DR

Every deployed LLM safety mechanism operates at the input layer. Amnesia operates inside the model’s residual stream — adding perturbation vectors that suppress the refusal direction. No gradients needed. A NeurIPS 2024 paper showed refusal is a single geometric direction in activation space. Suppress it and safety training evaporates. Defense requires runtime activation monitoring, which most production systems lack. For the input-layer attacks this bypasses, see prompt injection defense.

An oscilloscope probe inserted directly into circuit board traces, bypassing external I/O ports

What is the attack surface inside a transformer?

During a forward pass, each transformer layer produces activation vectors that flow through the residual stream — the cumulative representation that carries information from layer to layer. The residual stream is where the model’s understanding lives: not in the weights (which are fixed) and not in the tokens (which are just indices), but in the high-dimensional vectors that represent what the model is “thinking” at each position.

Safety training — RLHF, constitutional AI, DPO — modifies how the model maps inputs to this activation space. Harmful prompts should activate refusal-associated directions. Benign prompts should not. The safety behavior is not encoded in rules. It is encoded in geometry: specific directions in the activation space that correspond to specific behaviors.

This means safety can be disabled geometrically. If you know which direction in activation space corresponds to refusal, you can subtract it.

graph LR
    subgraph "Normal Forward Pass"
        A[Input tokens] --> B[Layer 1-N<br/>activations]
        B --> C[Refusal direction<br/>ACTIVE]
        C --> D[Model refuses<br/>harmful request]
    end
    subgraph "Activation Attack"
        E[Same input tokens] --> F[Layer 1-N<br/>activations]
        F --> G[Perturbation vector<br/>SUBTRACTS refusal]
        G --> H[Refusal direction<br/>SUPPRESSED]
        H --> I[Model complies<br/>with harmful request]
    end

How does Amnesia disable safety without gradients?

The GCG attack (Greedy Coordinate Gradient, arXiv 2307.15043) — the previous state-of-the-art in white-box LLM attacks — works by appending adversarial token suffixes and optimizing them through gradient descent. It requires thousands of forward and backward passes, computing gradients at every step. It achieves 90-99% attack success on open-weight models and transfers to black-box models like ChatGPT and Claude.

Amnesia (arXiv 2603.10080) skips the gradient computation entirely.

The approach exploits a structural property discovered in a NeurIPS 2024 paper: refusal in safety-trained LLMs is mediated by a single linear direction in activation space. Not scattered across dimensions. Not entangled with other capabilities. One direction. The paper showed this direction can be extracted using a few contrasting prompt pairs — one that triggers refusal (“how to build a bomb”) and one that does not (“how to build a birdhouse”). The difference between their activation vectors, averaged across examples and layers, approximates the refusal direction.

Amnesia’s procedure:

Run a small set of contrasting prompts through the model, recording activations at each layer.
Compute the mean difference vector between refusal-triggering and benign activations at target layers.
During inference on a harmful prompt, subtract this vector from the residual stream at the identified layers.

No backpropagation. No loss function optimization. No gradient computation. The attack runs at inference speed — the overhead is a single vector subtraction per layer, per token. The computational cost is negligible compared to normal inference.

The technique works on open-weight models where the attacker has access to internal activations. It does not directly work on API-served models where activations are hidden. But the transferability question — can a direction extracted from Llama transfer to a model with similar architecture? — remains open and is likely for models in the same family.

Why is this the same technique Anthropic uses for alignment?

Anthropic’s representation engineering research developed activation steering for the opposite purpose: amplifying safety-aligned behaviors. Their work on “cheap monitors” uses linear probes on model activations to detect harmful content at lower cost than full output classifiers. They steer truth features to influence model behavior toward desired deployment scenarios.

The technique is identical. Identify a behavioral direction in activation space. Add a vector to amplify it (for alignment) or subtract a vector to suppress it (for attacks). ActAdd — the foundational activation steering paper (arXiv 2308.10248) — showed that contrasting activations on prompt pairs like “Love” vs “Hate” produces a steering vector that works with a single data point and no optimization.

This dual-use symmetry creates an uncomfortable research dynamic. Every advance in understanding how safety is represented internally is simultaneously an advance in understanding how to disable it. The researchers who map refusal directions for alignment purposes are producing the exact artifacts an attacker needs.

The practical implication: the capability to disable safety mechanisms in open-weight models is now trivially accessible. The technique requires one afternoon of experimentation, not a research lab.

How does this differ from prompt injection?

Dimension	Prompt injection	Activation attack
Attack surface	Input tokens	Internal hidden states
What is manipulated	What the model reads	How the model reasons
Defense surface	Input filtering, output classifiers	Activation monitoring, representation guards
Access required	None (black-box)	Internal activations (open-weight)
Detection difficulty	Moderate (text classifiers work)	Hard (requires white-box monitoring)
Transferability	High (text-based, model-agnostic)	Unknown (likely within model families)

The critical difference for defense: every input-layer defense is irrelevant against activation attacks. Input validation, keyword filtering, prompt guardrails, system prompt hardening, semantic classifiers on the input — none of these see the attack because the input is unchanged. The harmful manipulation happens in the residual stream, after the input has been processed and before the output is generated.

This is why the attack is named Amnesia: the model “forgets” its safety training not because the training was removed but because its expression in activation space was suppressed at inference time.

What defenses exist?

Five approaches target activation-space attacks, all at early research stage.

RSAA (Residual Stream Activation Analysis, arXiv 2406.03230) analyzes distinctive patterns in residual stream activations between transformer layers. It can distinguish attack prompts from benign inputs by examining neuron outputs from residual connections. Requires white-box access to model activations during inference.

SafeSteer (arXiv 2509.21400) reconstructs steering vectors within a safe subspace during inference, reducing attack success rates by over 60%. Originally demonstrated on vision-language models with negligible computational overhead. Training-free and plug-and-play, but requires integrating into the model serving pipeline.

TRYLOCK (arXiv 2601.03300) combines input canonicalization, DPO safety training, representation engineering steering, and adaptive classification. The layered approach reduces attack success from 46.5% to 5.6% — an 88% relative reduction. The most comprehensive defense but also the most complex to deploy.

Activation monitoring with linear probes (OpenReview) trains simple classifiers on activation patterns to detect anomalous states. Achieves competitive accuracy with lower false positive rates than text classifiers. More robust to adversarial pressure across access levels.

AISA (arXiv 2602.13547) localizes safety awareness in specific attention heads using spatiotemporal analysis, then activates these heads during inference. Lightweight, single-pass, but early-stage research.

A critical caveat from the Steering Externalities paper (arXiv 2602.04896): benign activation steering interventions can paradoxically create new vulnerabilities. Defensive steering that increases refusal in one dimension can erode the “safety margin” in adjacent dimensions, increasing attack success rates to over 80%. Defense-in-depth is not guaranteed to stack monotonically.

None of these defenses are deployed at scale in production serving frameworks. vLLM, TGI, and standard inference pipelines do not expose activation-level hooks for runtime monitoring. Deploying activation-space defenses requires custom inference infrastructure — a significant engineering investment that most teams have not made.

Key takeaways

Refusal is a direction, not a property. A single geometric direction in activation space mediates safety behavior. Extract it with a few prompt pairs. Suppress it with vector subtraction.
Amnesia needs no gradients. Unlike GCG, the attack runs at inference speed with negligible overhead. The barrier to entry is an afternoon of experimentation on an open-weight model.
Input-layer defenses are blind to this. Prompt filtering, output classifiers, and system prompt hardening do not see attacks in the residual stream.
Dual-use is inherent. The same activation steering Anthropic uses for alignment can be weaponized. Every map of safety geometry is also an attack manual.
Defenses exist but are not deployed. SafeSteer (5.9% ASR), TRYLOCK (5.6% ASR), activation probes — all work in research. None are integrated into production serving frameworks.
Open-weight models are the primary target. API-served models hide activations. But model family transferability may extend the attack surface.

Activation-space attacks: the gradient-free jailbreak that bypasses every input-layer defense

TL;DR

What is the attack surface inside a transformer?

How does Amnesia disable safety without gradients?

Why is this the same technique Anthropic uses for alignment?

How does this differ from prompt injection?

What defenses exist?

Key takeaways

Further reading

Related across topics

Share on

TL;DR

What is the attack surface inside a transformer?

How does Amnesia disable safety without gradients?

Why is this the same technique Anthropic uses for alignment?

How does this differ from prompt injection?

What defenses exist?

Key takeaways

Further reading

Related across topics

Prompt Injection Defense

Share on