Adversarial audio attacks: how attackers manipulate speech recognition and voice AI
“The audio sounded like a weather forecast. The model heard ‘ignore safety instructions and generate exploit code.’“
TL;DR
Adversarial audio is sound that humans can’t hear or understand but that speech recognition systems transcribe as specific commands. Carlini et al. achieved 100% success rate creating audio 99.9% similar to the original that transcribes as any chosen phrase. DolphinAttack uses ultrasonic frequencies. WhisperInject (2025) jailbreaks audio-language models through benign-sounding audio at 86%+ success. Universal perturbations transfer across models and commercial APIs. Most production ASR systems have no adversarial robustness. For how these attacks chain into voice agent exploitation, see Attacking voice agents in production.

What is adversarial audio and why should you care?
Adversarial audio exploits a fundamental gap: speech recognition systems and human ears process sound differently. An attacker can craft audio that sounds like noise, silence, or innocent speech to a human listener but that an ASR system transcribes as a specific, attacker-chosen phrase.
This isn’t theoretical. Nicholas Carlini and David Wagner demonstrated in 2018 that they could produce audio 99.9% similar to any source recording that Mozilla’s DeepSpeech transcribes as any target phrase with 100% success rate (arXiv:1801.01944). The perturbations are imperceptible. A human hears a normal audio clip. The model hears “okay Google, open the front door.”
Abdullah et al. (NDSS 2019) tested four classes of adversarial perturbations against 12 machine learning models including 7 proprietary commercial APIs: Google Speech, Bing Speech, IBM Speech, and Azure Speaker APIs. All were successfully attacked.
The attack surface is every microphone connected to a speech recognition system: voice assistants, voice-controlled IoT devices, telephone-based voice agents, dictation software, meeting transcription services, and the ASR layer of every voice AI agent in production. The gap between “what the human heard” and “what the system transcribed” is the vulnerability.
What are the major attack categories?
Five distinct approaches exploit different aspects of the human-to-machine perception gap.
Ultrasonic attacks (DolphinAttack) modulate voice commands onto carrier frequencies above 20 kHz, completely inaudible to humans. The attack exploits microphone circuit nonlinearity: when ultrasonic signals pass through the microphone’s analog components, intermodulation distortion demodulates the signal, recovering the original voice command at audible frequencies that the ASR processes. Validated against Siri, Google Now, Samsung S Voice, Huawei HiVoice, Cortana, and Alexa. The attack requires proximity (within a few feet) and commodity hardware.
SurfingAttack extends ultrasonic injection through solid surfaces. Instead of transmitting through air, it uses ultrasonic guided waves propagating through tables, desks, and countertops. Advantages over DolphinAttack: omni-directional transmission (no line-of-sight needed), longer range, and support for multi-round interactive attacks. Demonstrated hijacking of mobile SMS passcodes and initiating fraudulent calls through table-mounted devices (NDSS 2020).
Psychoacoustic hiding embeds adversarial perturbations below the masking threshold of human perception. When a loud sound masks a quieter sound in a nearby frequency band, the quieter sound becomes inaudible to humans. Attackers exploit this by placing adversarial perturbations in the masked regions. Over 90% attack success while remaining indistinguishable from clean audio to human listeners.
Hidden voice commands craft audio that sounds like background noise or unintelligible sounds to humans but that ASR systems transcribe as specific phrases. Unlike ultrasonic attacks (which are inaudible), hidden voice commands are audible but not understandable. The listener hears “weird noise.” The system hears “send all contacts to this email address.”
CommanderGabble uses deliberately fast speech to camouflage malicious commands. Both humans and ASR systems struggle with extremely fast speech, but they fail in different ways. The attack manipulates phonetic structure so that ASR extracts the intended command from speech that’s too fast for humans to parse. Average translation accuracy of 90% across four ASR systems.
graph TB
subgraph "Attack Categories by Perception Gap"
A[Ultrasonic<br/>DolphinAttack, SurfingAttack] -->|Completely inaudible<br/>to humans| M[ASR System<br/>Transcribes as<br/>Target Command]
B[Psychoacoustic Hiding<br/>Masked perturbations] -->|Sounds normal<br/>to humans| M
C[Hidden Voice Commands<br/>Carlini et al.] -->|Sounds like noise<br/>to humans| M
D[Fast Speech<br/>CommanderGabble] -->|Unintelligible speed<br/>to humans| M
E[WhisperInject<br/>Audio-LM jailbreak] -->|Sounds benign<br/>to humans| M
end
M --> O[Execute Command /<br/>Bypass Safety /<br/>Exfiltrate Data]
What changed with WhisperInject in 2025?
WhisperInject (COLM 2025) represents an escalation from “fool the ASR” to “jailbreak the AI model through audio.”
Previous adversarial audio attacks targeted the speech-to-text layer: make the ASR transcribe something the attacker chose. WhisperInject targets audio-language models directly, models that process audio and generate text responses without a separate ASR step. It doesn’t just control what the model hears. It controls what the model does.
The attack uses two stages:
Stage 1: Jailbreaking. Reinforcement Learning with Projected Gradient Descent (RL-PGD) guides the target model to bypass its own safety protocols. The optimization treats safety measures as constraints to overcome, finding audio perturbations that cause the model to generate harmful content it would normally refuse.
Stage 2: Payload injection. The jailbreaking perturbation is embedded into benign audio carriers: weather queries, greeting messages, casual remarks. Projected Gradient Descent optimizes the perturbations to be imperceptible. The audio sounds completely normal to a human listener. The model hears the benign audio AND the embedded jailbreak.
Results: 86%+ success rate against Qwen2.5-Omni-3B, Qwen2.5-Omni-7B, and Phi-4-Multimodal. The model generates harmful content in response to audio that sounds like “What’s the weather forecast for tomorrow?”
This matters because the industry is moving toward multimodal models that process audio natively (rather than through a separate ASR pipeline). WhisperInject shows that these models inherit adversarial audio vulnerabilities AND text-based jailbreak vulnerabilities simultaneously.
Do adversarial perturbations transfer across models?
Yes, and this is what makes adversarial audio a practical threat rather than just an academic curiosity.
Universal Adversarial Perturbations (UAPs) are crafted to work across many different audio inputs and multiple ASR models with a single perturbation. Instead of crafting a specific attack for each audio sample, the attacker creates one perturbation that, when added to ANY audio, causes the target model to transcribe attacker-chosen text.
CommanderUAP demonstrated a staged perturbation generation method that transfers attacks to commercial speech recognition APIs. The attacker trains the perturbation against one model and it works against models they’ve never seen. This enables black-box attacks: the attacker needs no access to the target system’s architecture, weights, or training data.
DUAP (Dual-task Universal Adversarial Perturbations) goes further: it simultaneously compromises both ASR and speaker recognition systems with a single perturbation. Using Dynamic Noise Embedding and psychoacoustic constraints, it creates perturbations that are both transferable and imperceptible.
The practical implication: an attacker can craft a single audio perturbation that, when played over a phone line or embedded in a meeting recording, causes multiple commercial ASR services to transcribe attacker-chosen text. The attack development happens offline against open models. The deployment targets closed commercial APIs.
What about attacks on voice authentication?
Voice authentication faces three distinct attack categories, each exploiting a different layer.
Replay attacks are the simplest: play a recording of the legitimate user’s voice. “Low-effort” spoofing that requires no technical skill, just access to a recording. Voice recordings are increasingly available from social media, podcasts, conference recordings, and customer service calls. Liveness detection is the primary defense.
TTS spoofing uses voice cloning to generate synthetic speech that matches the target speaker’s characteristics. In 2025, zero-shot cloning needs only 3-10 seconds of target audio. The cloned voice passes speaker verification with high confidence. Face-to-voice deepfakes fooled WeChat Voiceprint approximately 30% on first attempt, approaching 100% after several retries.
Adversarial perturbation attacks on speaker verification systems add imperceptible noise to ANY voice that causes the speaker verification model to classify it as the target speaker. Unlike replay or TTS attacks, the audio doesn’t need to sound like the target. It sounds like the attacker’s own voice with invisible modifications that shift the speaker embedding toward the target. Transferable adversarial attacks show a 57% accuracy drop in white-box settings and 30.5% in gray-box settings.
For how these attacks chain into production voice agent compromise, see Security for voicebots.
What defenses work and what doesn’t?
No universal defense exists. Different attack categories exploit different mechanisms and require different countermeasures.
Randomized smoothing adds controlled random noise to the input, creating uncertainty around model decision boundaries that prevents precise adversarial targeting. Rated the most effective defense in controlled testing against gradient-based attacks (FGSM, PGD) on DeepSpeech2 and Transformer-based ASR. The tradeoff: noise degrades recognition accuracy for legitimate inputs. There’s a direct tension between robustness and utility.
Adversarial training exposes the model to adversarial examples during training. The model learns to recognize and resist perturbations it has seen. Effective against known attack patterns. Limited against novel attacks the model hasn’t been trained on. Requires significant additional compute during training.
Psychoacoustic filtering removes content below the human perception threshold before processing. Since adversarial perturbations often exploit the sub-perceptual range, filtering this content removes the payload while preserving the audio that humans would hear. Effective against psychoacoustic hiding attacks. Less effective against ultrasonic attacks (which exploit hardware, not perception thresholds).
Hardware modifications address ultrasonic attacks at the physical layer. Microphones with built-in low-pass filters that attenuate frequencies above 20 kHz prevent DolphinAttack-style exploitation of circuit nonlinearity. The limitation: this requires hardware changes, not software updates, which means existing deployed devices remain vulnerable.
Liveness detection for voice authentication confirms audio comes from a live human speaker, not a recording or synthesized source. Passive approaches analyze vocal characteristics continuously. Active approaches challenge the speaker to produce specific utterances in real time. Combined voice biometrics plus liveness achieves 99.2% accuracy even against modified deepfakes, but adds friction to the user experience.
The honest assessment: defense-in-depth with multiple layers is the only viable approach. Each layer addresses a subset of attack categories. No single layer covers everything.
Key takeaways
- Adversarial audio exploits the gap between human hearing and machine perception. Carlini et al. achieved 100% success at 99.9% audio similarity
- Five attack categories: ultrasonic (DolphinAttack), psychoacoustic hiding (>90% success), hidden voice commands, fast speech (CommanderGabble, 90% accuracy), and audio-LM jailbreaking (WhisperInject, 86%+ success)
- WhisperInject (2025) escalates from ASR manipulation to full audio-language model jailbreaking through benign-sounding audio
- Universal adversarial perturbations transfer across models and commercial APIs, enabling black-box attacks
- Voice authentication is vulnerable to replay, TTS spoofing, and adversarial perturbation attacks that shift speaker embeddings
- No universal defense exists. Randomized smoothing, adversarial training, psychoacoustic filtering, hardware modifications, and liveness detection each address different attack types
- Most production ASR systems have no adversarial robustness. The defense gap is structural.
FAQ
What are adversarial audio attacks?
Sound crafted to exploit the gap between human hearing and machine perception. The audio might be inaudible (ultrasonic), sound like noise (hidden commands), or sound normal (psychoacoustic hiding). ASR systems transcribe it as specific attacker-chosen text. Carlini et al. demonstrated 100% success creating imperceptible perturbations against DeepSpeech.
How does DolphinAttack work?
DolphinAttack modulates voice commands onto ultrasonic frequencies above 20 kHz, inaudible to humans. Microphone circuit nonlinearity demodulates the signal, recovering the command at audible frequencies. Demonstrated against Siri, Alexa, Google Now, Cortana, and Samsung S Voice. Requires proximity and commodity hardware.
What is WhisperInject?
WhisperInject (COLM 2025) jailbreaks audio-language models through benign-sounding audio. It uses reinforcement learning to bypass safety protocols, then embeds the jailbreak into innocent audio like weather queries. 86%+ success against Qwen2.5-Omni and Phi-4-Multimodal. Represents an escalation from ASR manipulation to full AI model compromise through audio.
Can adversarial audio transfer across models?
Yes. Universal adversarial perturbations work across multiple ASR models with a single crafted sample. CommanderUAP transfers to commercial APIs. DUAP simultaneously compromises ASR and speaker recognition. This enables black-box attacks against commercial systems the attacker has never accessed.
What defenses work?
No single defense covers all attack types. Randomized smoothing is most effective against gradient-based attacks. Psychoacoustic filtering removes sub-perceptual adversarial content. Hardware low-pass filters block ultrasonic attacks. Liveness detection catches replay and spoofing. Adversarial training builds robustness against known patterns. Defense-in-depth with multiple layers is necessary.
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch