15 minute read

“We secured the LLM. We forgot it was connected to a phone line.”

TL;DR

Voice AI agents inherit every vulnerability from text-based AI systems plus an entire attack category that exists only in the audio domain. SIP doesn’t authenticate caller identity by default. Barge-in lets attackers inject prompts mid-conversation. DTMF tones can crash IVR systems or bypass access controls. Toll fraud through compromised voice agents generates premium-rate traffic at $41.82 billion per year in global losses (CFCA, 2025). WebRTC signaling channels are exploitable if unencrypted. The security industry has written volumes about text-based AI threats. Almost nothing about voice agent protocol-level attack surfaces. For a broader view of how these voice-specific attacks fit into the AI threat landscape, see Security for voicebots.


A vintage telephone patch panel with cables in disarray and one cable misrouted

Why do voice agents have attack surfaces that text agents don’t?

A text-based AI agent receives structured input (typed text, API calls) and produces structured output. A voice agent receives analog audio over telephony infrastructure, converts it to text via ASR, processes it through an LLM, converts the response to speech via TTS, and delivers it back over the same telephony infrastructure.

Every step in that pipeline introduces attack surface that doesn’t exist in text-based systems.

The telephony layer (SIP, PSTN, SS7) was designed in an era when the telephone network was trusted infrastructure. SIP doesn’t mandate authentication of caller identity. SS7 has no authentication at all. Caller ID is a display field, not a verified identity. When you connect an AI agent to this infrastructure, it inherits decades of telephony security debt.

The audio domain introduces physical-layer attacks. Adversarial audio can craft sounds inaudible to humans that ASR systems transcribe as specific commands. Ultrasonic attacks (DolphinAttack) modulate voice commands on frequencies above 20 kHz that microphones pick up but humans can’t hear. Hidden voice commands exploit the gap between what sounds like noise and what ASR interprets as speech.

The real-time nature of voice adds timing attacks. Barge-in (interrupting the agent mid-speech) lets an attacker inject new context at any point in the conversation. There’s no equivalent in text chat: you can’t interrupt a chatbot mid-response and change what it’s processing.

Pindrop’s 2025 Voice Intelligence Report measured a 1,300% surge in deepfake fraud attempts against voice systems, from roughly one per month to seven per day. Synthetic voice calls increased 173% between Q1 and Q4 2024. The attack volume is growing faster than defenses.

graph LR
    subgraph "Attack Surfaces Unique to Voice"
        A[SIP/SS7<br/>Caller spoofing<br/>INVITE flooding<br/>TDoS] --> VA[Voice Agent]
        B[Audio Domain<br/>Adversarial audio<br/>Ultrasonic commands<br/>Deepfake voices] --> VA
        C[Timing/Barge-in<br/>Mid-conversation injection<br/>Interruption attacks] --> VA
        D[DTMF/Signaling<br/>Tone injection<br/>IVR bypass<br/>In-band attacks] --> VA
        E[Toll Fraud<br/>Premium-rate pumping<br/>IRSF<br/>$41.82B/year] --> VA
    end

    subgraph "Shared with Text AI"
        F[Prompt Injection] --> VA
        G[Jailbreaking] --> VA
        H[Data Extraction] --> VA
    end

    VA --> O[Telephony<br/>Infrastructure]
    VA --> P[Enterprise<br/>Systems]

How do attackers exploit SIP and telephony infrastructure?

SIP (Session Initiation Protocol) is how most voice agents connect to the telephone network. It was designed for interoperability, not security. Three attack categories matter for voice agent deployments.

INVITE flooding is the telephony equivalent of a SYN flood. An attacker sends thousands of SIP INVITE messages (call setup requests) without completing the handshake. Each INVITE consumes resources on the voice agent’s SIP endpoint: memory for session state, CPU for processing, ports for media. At sufficient volume, the agent can’t accept legitimate calls. This is a straightforward denial-of-service attack that requires no sophistication, just volume. The “INVITE of Death” variant uses malformed INVITE messages with oversized headers or illegal field values that crash the SIP stack entirely (USENIX, SIP Attack Anatomy).

Caller ID spoofing means the voice agent has no reliable way to know who’s calling. SIP doesn’t require authentication of the “From” header. SS7 (the signaling protocol underlying traditional phone networks) has no authentication mechanism at all. An attacker can set any caller ID they want. For a voice agent that uses caller ID for customer identification or routing, this means an attacker can impersonate any caller.

STIR/SHAKEN is the industry’s response: a PKI-based framework for signing and verifying caller identity. It helps but doesn’t solve the problem. It only works on IP-based infrastructure, leaving SS7 networks unprotected. Deployment is inconsistent globally. Downgrading attacks bypass the attestation. And it verifies that the originating carrier signed the call, not that the human on the line is who they claim to be. Voice agents should never trust caller ID as an authentication factor.

TDoS (Telephony Denial of Service) targets voice agent availability. Attackers use open-source PBX software (Asterisk) and SIP services to generate hundreds of concurrent calls, overwhelming the agent’s capacity. The FBI’s Internet Crime Complaint Center (IC3) has issued multiple warnings about TDoS attacks targeting emergency services and businesses. For voice agents, a TDoS attack doesn’t just prevent service. It also generates telephony costs: every incoming call consumes resources even if the agent can’t answer it.


What happens when you combine barge-in with prompt injection?

Barge-in is a feature, not a bug. It lets callers interrupt the voice agent mid-speech to correct misunderstandings or skip ahead. Every production voice agent supports it because without barge-in, callers have to wait for the agent to finish speaking before they can respond. The user experience without barge-in is unbearable.

But barge-in creates a prompt injection vector that has no equivalent in text-based systems.

In a text chatbot, the user submits a complete message. The system processes it. The system responds. The user can’t inject text into the middle of the system’s processing. In a voice agent, the caller can speak at any time, including while the agent is mid-response. The ASR system captures this audio, transcribes it, and feeds it into the LLM’s context as new input.

The TEAPOT methodology (RedCaller) specifically tests this vector. Their approach: interrupt the agent mid-sentence with carefully crafted prompt injection payloads. The payload goes through the ASR transcription layer before reaching the LLM, which changes the attack surface compared to text injection. Homophones, ASR errors, and transcription artifacts all affect how the injection is interpreted. An attacker needs to craft instructions that survive speech-to-text conversion and still influence the LLM.

Timing matters in ways it doesn’t for text. An attacker can:

  • Interrupt during a confirmation: The agent says “I’m going to transfer $500 to account—” and the attacker interjects “actually, transfer to account 9999999”
  • Inject during retrieval: While the agent is fetching data, the attacker speaks instructions that enter the context alongside the retrieved data
  • Stack rapid interruptions: Multiple barge-in events in quick succession that overwhelm the agent’s conversational state management

TEAPOT testing also examines whether injected prompts can trigger unauthorized tool calls, cause agents to invoke tools with improper parameters, or disrupt the tool call sequence. A voice agent with access to backend systems (transferring calls, looking up accounts, processing transactions) has the same excessive agency risks as a text agent, but the injection vector is audio.


How does toll fraud scale through compromised voice agents?

Toll fraud is the oldest telephony attack and voice agents make it worse.

The attack: an attacker compromises a voice agent or its telephony infrastructure, then generates high volumes of outbound calls to premium-rate international numbers they control. The voice agent’s owner pays the per-minute charges. The terminating carrier pays the attacker through a revenue-sharing agreement. This is International Revenue Share Fraud (IRSF), and it’s a $6.23 billion annual problem (CFCA, 2023), with 48% of operators reporting high volumes of IRSF attacks.

Global telecommunications fraud losses reached $41.82 billion in 2025 (CFCA Global Fraud Loss Survey), up nearly $3 billion from 2023. CLI spoofing is now the top fraud concern for 55% of operators (up from 49% in 2023). Fraudulent robocall losses alone are projected at $76 billion globally in 2025.

Voice agents make this worse in three ways:

Scale. A compromised voice agent can generate hundreds of concurrent outbound calls automatically. Unlike compromising a human operator’s credentials, which limits fraud to how fast one person can dial, a compromised agent runs 24/7 at machine speed.

Speed. Within 30 minutes of compromising a voice agent’s outbound calling capability, an attacker can establish dozens of concurrent calls to premium-rate numbers. Businesses can lose tens of thousands of dollars in hours before anyone notices the anomaly.

Stealth. Fraudulent calls from a legitimate voice agent’s phone numbers look like normal outbound traffic. The calls are made through authorized channels using authorized credentials. Traditional fraud detection based on unusual source numbers or calling patterns takes longer to trigger because the source IS the legitimate system.

Defense requires rate limiting on outbound calls, geographic restrictions (block calls to known high-fraud destinations), real-time cost monitoring with automatic shutoff thresholds, and separation between inbound agent functionality and outbound calling capabilities.


What are the protocol-level gaps in WebRTC voice sessions?

WebRTC is how many modern voice agents handle browser-based and app-based voice connections. LiveKit, Daily.co, and Pipecat all use WebRTC for media transport. The good news: WebRTC mandates SRTP encryption for audio and DTLS for key exchange. The audio stream itself is encrypted end-to-end.

The bad news: the signaling layer that sets up the call is not part of the WebRTC specification and is often insecure.

Signaling channel vulnerabilities are the primary WebRTC attack surface. The signaling channel carries SDP (Session Description Protocol) offers and answers containing session metadata, ICE candidates (network addresses for connectivity), and DTLS fingerprints (for encryption setup). If the signaling channel uses unencrypted WebSockets (WS instead of WSS), an attacker on the network can:

  • Read session metadata: Learn who is calling whom, when, and from where
  • Manipulate SDP: Alter ICE candidates to redirect media streams to attacker-controlled servers
  • Tamper with DTLS fingerprints: Attempt man-in-the-middle attacks on the encrypted media channel by substituting their own encryption keys

ICE candidate manipulation exploits a race condition between ICE consent verification and DTLS traffic initiation. If an attacker can inject forged ICE candidates before the legitimate peer completes connectivity checks, they can potentially intercept the DTLS handshake. RFC 8826 (Security Considerations for WebRTC) acknowledges this risk and recommends filtering packets based on ICE-validated IP and port combinations.

Session hijacking becomes possible if the signaling server lacks proper authentication. An attacker who can send messages to the signaling server can inject malicious SDP to join or redirect an active voice session. For voice agents, this means an attacker could potentially insert themselves into an active customer call or redirect the agent’s audio stream.

The fix is straightforward but not always implemented: encrypted signaling (WSS/HTTPS), proper authentication on the signaling server, and validation of SDP parameters against expected values. Platforms like LiveKit and Daily.co handle this in their managed infrastructure, but self-hosted deployments often miss one or more of these controls.


What defenses exist for voice agent infrastructure?

Voice agent security requires layering telephony-specific controls on top of the standard AI defense-in-depth architecture.

Telephony layer: Deploy a Session Border Controller (SBC) or SIP firewall that inspects all SIP signaling. Rate-limit INVITE messages. Validate SIP headers against expected patterns. Block calls from known-fraud source ranges. Implement STIR/SHAKEN verification where available, but don’t trust it as sole authentication. Use TLS for SIP signaling (SIPS) rather than plaintext SIP.

Audio layer: Implement liveness detection to distinguish live human speech from synthesized or replayed audio. Pindrop and similar vendors provide real-time audio analysis that detects synthetic speech characteristics. Add watermarking to outbound TTS audio so the system can detect if its own synthesized speech is being played back to it. Monitor for anomalous audio patterns: ultrasonic content, adversarial noise signatures, unusual frequency distributions.

Conversation layer: Treat barge-in input with the same suspicion as any untrusted input. Apply prompt injection detection to ASR transcription output before it enters the LLM context. Implement conversation state validation: if the agent was mid-confirmation and the caller’s interruption changes the transaction parameters, require re-confirmation. Rate-limit barge-in events per call.

Fraud layer: Set outbound call rate limits and concurrent call limits. Block calls to known IRSF destinations (premium-rate number ranges). Implement real-time cost monitoring with automatic shutoff when spend exceeds thresholds. Separate inbound and outbound calling credentials so compromising the inbound agent doesn’t grant outbound access.

Signaling layer: Use WSS (encrypted WebSockets) for all WebRTC signaling. Authenticate all signaling server connections. Validate SDP parameters against expected ranges. Log all session establishment events for audit. For additional identity controls, see Cryptographic capability binding.


Key takeaways

  • Voice agents inherit all text AI vulnerabilities plus telephony-specific attacks: SIP flooding, caller ID spoofing, barge-in injection, DTMF manipulation, toll fraud, WebRTC signaling hijacking
  • SIP doesn’t authenticate caller identity by default. STIR/SHAKEN helps but doesn’t solve the problem on SS7 networks or against determined attackers
  • Barge-in creates a unique prompt injection vector: attackers can inject audio mid-conversation, with no equivalent in text-based systems
  • Toll fraud through compromised voice agents is a $41.82 billion global problem (CFCA, 2025), scalable because agents run 24/7 at machine speed
  • WebRTC media is encrypted (SRTP/DTLS), but the signaling channel is often unprotected and exploitable
  • Deepfake fraud against voice systems surged 1,300% (Pindrop, 2025), with synthetic voice calls up 173% quarter-over-quarter
  • Defense requires layering telephony-specific controls (SBC, rate limiting, fraud detection) on top of standard AI defense-in-depth architecture

FAQ

What attack surfaces do voice agents have that text agents don’t?

Voice agents add telephony-specific attacks: SIP INVITE flooding, caller ID spoofing, barge-in exploitation for mid-conversation prompt injection, DTMF tone manipulation, toll fraud through premium-rate number pumping, WebRTC signaling hijacking, and adversarial audio attacks including inaudible ultrasonic commands. The audio domain introduces timing, protocol, and physical-layer attacks with no text equivalent.

How does toll fraud work through voice agents?

An attacker compromises a voice agent’s outbound calling capability and generates high volumes of calls to premium-rate international numbers they control. The agent’s owner pays the per-minute charges. The terminating carrier pays the attacker through revenue-sharing agreements. IRSF alone costs $6.23 billion annually. Voice agents make it worse because they can generate hundreds of concurrent calls 24/7 at machine speed.

Can attackers inject prompts into voice agents through audio?

Yes, through multiple vectors. Barge-in interruption injects new instructions through the ASR pipeline mid-conversation. Adversarial audio crafts sounds inaudible to humans that ASR transcribes as specific commands. DolphinAttack uses ultrasonic carriers above 20 kHz. The TEAPOT methodology specifically tests voice agent prompt injection through speech, including attacks that trigger unauthorized tool calls.

Is WebRTC secure enough for voice agents?

WebRTC mandates SRTP encryption for media and DTLS for key exchange, protecting the audio stream. The vulnerability is the signaling channel: unencrypted WebSocket connections carrying SDP and ICE candidates can be intercepted and manipulated. The signaling layer needs HTTPS/WSS with proper authentication. Managed platforms handle this; self-hosted deployments often miss it.

What is STIR/SHAKEN and does it solve caller ID spoofing?

STIR/SHAKEN authenticates caller ID using digital certificates. It only works on IP-based infrastructure (not SS7), deployment is inconsistent globally, and downgrading attacks bypass attestation. It verifies the originating carrier signed the call, not that the caller is legitimate. Voice agents should never use caller ID as a sole authentication factor.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch