Voice Agent Architecture
“Talking to machines: The end of the Keyboard.”
TL;DR
A voice agent is a six-stage pipeline: AudioIn -> VAD -> STT -> LLM -> TTS -> AudioOut. STT choices range from Whisper (most accurate, slow) to Deepgram Nova-2 (blazing fast at $0.0043/min), while TTS ranges from ElevenLabs (human-quality at $0.15/1k chars) to Cartesia Sonic (~100ms latency). The critical differentiator between a demo and a product is control logic: VAD silence thresholds determine when the agent speaks, barge-in handling with echo cancellation prevents feedback loops, and voice-specific prompting keeps responses conversational. Cost-wise, a 10-minute call ranges from $0.29 (economy stack) to $2.54 (premium stack). These components build on the real-time pipeline patterns covered previously, and are typically orchestrated using voice agent frameworks like LiveKit and Pipecat.

1. Introduction: The Universal Interface
Text is an artifact of computing history. Humans don’t text each other when they are in the same room; they talk. It is the highest bandwidth, lowest friction interface we have. Voice Agents represent the Holy Grail of HCI (Human-Computer Interaction): an interface that requires Zero Training, Zero Literacy, and Zero Hands.
However, building a Voice Agent is exponentially harder than a Chatbot.
- Chatbot: User sends text. Wait 5s. Response appears. The user can switch tabs. The interface is “Async”.
- Voicebot: User speaks. Silence. (1s… 2s…) “Why is it silent?” (3s…) “Hello?” The interface is “Sync”.
- Latency Intolerance: In voice, silence is awkward. It feels like the connection dropped. Research shows that gaps greater than 800ms significantly decrease “Trust” in the system. Gaps > 1500ms cause the user to think the agent is broken.
- Noise: Background chatter, heavy accents, cheap laptop microphones, wind noise.
- The “Double-Speak” Problem: The agent talks over the user, creating a cacophony.
In this deep dive, we will architect the full Voice Pipeline: AudioIn -> VAD -> STT -> LLM -> TTS -> AudioOut. We will explore the vendor landscape, the rigorous cost economics, and the critical control logic that separates a demo from a product.
2. Component 1: The Ears (STT / ASR)
Speech-to-Text (STT) or Automatic Speech Recognition (ASR). This is the foundation. If the agent mishears “Purchase” as “Perch”, the game is over. No amount of LLM intelligence can fix a bad transcript.
2.1 The Models Landscape: A Vendor Battle
There is no “One Model Fits All”. You typically choose between three categories.
A. The Open Source Standard: OpenAI Whisper
- Architecture: Transformer trained on 680k hours of weak supervision from the internet.
- Pros:
- Accuracy: Unbeatable on English. Even handles heavy accents and murmurs.
- Punctuation: It outputs grammatically correct text with commas and periods (Critical for the LLM to understand meaning).
- Price: Free (Self-hosted).
- Cons:
- Speed: It is Slow. Standard
whisper-large-v3takes ~500ms-1s to process 5 seconds of audio on an A100. - Hallucinations: In silence, it sometimes hallucinates text like “Thanks for watching” (Subtitle artifacts).
B. The Speed Demon: Deepgram Nova-2
- Architecture: Proprietary E2E Deep Learning.
- Pros:
- Latency: Blazing fast. Streaming latency is often ~300ms.
- Cost: Extremely cheap ($0.0043/min).
- Phone Optimization: It performs exceptionally well on low-fidelity (8khz) phone audio.
- Cons: Closed source. You are vendor-locked.
C. The Intelligence Layer: AssemblyAI / Rev
- Pros: Strong focus on “Understanding”.
- Speaker Diarization: “Speaker A said X, Speaker B said Y”. (Critical for meeting bots).
- Sentiment Analysis: Returns “Anger” metrics alongside text.
- PII Redaction: Automatically masks “SSN: 123-45…”.
2.2 Streaming vs Batch: The Latency Choice
- Batch: Record 10s of audio -> Upload -> Transcribe.
- Latency: 10s (Recording) + 2s (Upload/Process) = 12s. Unusable for live conversation.
- Streaming (WebSockets):
- Client opens WebSocket. Sends raw PCM bytes (chunks of 20ms).
- Server processes chunks incrementally.
- Server returns partial transcripts:
("He")->("Hell")->("Hello"). - Latency: < 500ms.
- Mechanism: Use Endpointing. The STT provider usually tells you “I think the sentence is done” via a
is_final=trueflag. This is your trigger to send the text to the LLM.
3. Component 2: The Brain (LLM)
The LLM part is standard (GPT-4o / Claude 3.5 Sonnet), but the Prompt Engineering is radically different.
3.1 Voice-Specific Prompting
- Brevity: People read faster than they listen. Reading a paragraph takes 5s. Listening to it (at 150wpm) takes 25s.
- Instruction: “Be concise. Use colloquial language. Do not use lists or bullet points. Avoid markdown.”
- Bad Response: “Here are three reasons: 1. Speed, 2. Cost, 3. Quality.” (Sounds robotic).
- Good Response: “Well, it mostly comes down to speed, cost, and quality.” (Sounds natural).
- Filler Words: To mask latency, you can instruct the model or the orchestrator to emit “Fillers” immediately (
"Hmm, let me check...","Okay...","Got it."). - Technique: Optimistic Acknowledge. As soon as STT finishes, play a pre-recorded “Sure!” sound file from the client cache while the server thinks. This buys you 500ms of “perceived” zero latency.
4. Component 3: The Mouth (TTS)
Text-to-Speech (TTS). The voice is the brand.
- ElevenLabs: The gold standard for quality.
- Quality: Indistinguishable from human. Emotional range (can whisper, shout, laugh).
- Latency: High (~500ms - 1s with Turbo).
- Cost: Expensive ($0.15 / 1k chars).
- OpenAI TTS (tts-1): Good quality, highly consistent.
- Latency: Fast (~300ms).
- Cost: Cheap ($0.015 / 1k chars).
- Sound: Slightly robotic/metallic compared to ElevenLabs.
- Cartesia (Sonic): The new speed demon.
- Latency: ~100ms (Generation Time).
- Usage: Real-time conversational agents where snappy turns matter more than perfect “Radio DJ” voice.
- PlayHT: High fidelity, great cloning capabilities.
Streaming TTS: You must stream the audio back. Playing it all at once adds seconds of delay. The TTS engine should yield MP3/PCM chunks as soon as it generates them.
5. The Critical Control Logic: VAD and Barge-In
This is where most amateur voice agents fail. The ability to handle “Silence” and “Interruption” determines if the user feels listened to.
5.1 Voice Activity Detection (VAD)
How does the agent know you finished your sentence?
- Human: “I want a burger…” (Pause 500ms) “…and fries.”
- Bad Agent: Hears “burger”. Interrupts: “What kind of burger?”
- Human: “…and fries. Shut up.”
The Silence Timeout: We configure a VAD threshold (e.g., 700ms of silence). If silence > 700ms, we assume the turn is over.
- Trade-off:
- Short Timeout (300ms): Snappy, but interrupts mid-thought.
- Long Timeout (1000ms): Polite, but feels sluggish. The agent seems “Slow”.
- Adaptive Timeout: Use a fast LLM to predict if the sentence is complete (“I want a…” -> Incomplete. “I want a burger.” -> Complete). If incomplete, extend timeout.
5.2 Barge-In (Interruption)
If the Agent is speaking, and the User starts speaking, the Agent MUST stop immediately. This requires a complex state transition and robust Echo Cancellation.
The Echo Cancellation (AEC) Problem
If you play sound out of the speakers, the microphone picks it up. Without AEC, the Agent hears its own voice loop back into the mic.
- Agent says “Hello”.
- Mic hears “Hello”.
- VAD detects speech.
- Agent thinks User spoke.
- Agent interrupts itself to listen.
- Result: The “Schizophrenic Loop” where the agent stutters endlessly.
- Solution: WebRTC has built-in AEC. If building raw Python pipelines, you need libraries like
speexdspor hardware DSP.
- Solution: WebRTC has built-in AEC. If building raw Python pipelines, you need libraries like
The Interrupt Protocol
- Trigger: VAD detects speech energy while
AgentState == SPEAKING. - Action 1 (Network): Send
CLEARpacket to client. Client immediately wipes its Audio Buffer (stops playing). - Action 2 (Backend): Kill the LLM generation task (stop paying for tokens). Kill the TTS generation task.
- Action 3 (State): Transition
AgentStatetoLISTENING. discard the previous context of what the agent was about to say.
6. Architecture Diagram
The flow of data is a Figure-8 loop.
sequenceDiagram
participant User
participant VAD
participant STT
participant LLM
participant TTS
participant AudioPlayer
User->>VAD: Audio Stream
VAD->>STT: Speech Frames
STT->>LLM: "Hello"
LLM->>TTS: "Hi there" (Stream)
TTS->>AudioPlayer: Audio Bytes
AudioPlayer->>User: Sound ("Hi there...")
Note right of User: User Interrupts! "Wait!"
User->>VAD: "Wait!" (Energy Spike)
VAD->>AudioPlayer: STOP COMMAND (Clear Buffer)
AudioPlayer-->>User: (Silence)
VAD->>STT: "Wait!"
STT->>LLM: "Wait!"
7. Cost Analysis: The Economics of Voice
Voice is expensive. Text chat is nearly free. Let’s do the math for a 10-minute active conversation.
7.1 Premium Stack (Deepgram + GPT-4o + ElevenLabs)
- STT (Deepgram): $0.0043 / min. (10 mins = $0.043).
- LLM (GPT-4o): ~$0.01 / turn. ~10 turns/min = $0.10 / min. (10 mins = $1.00).
- TTS (ElevenLabs): $0.15 / min (roughly). (10 mins = $1.50).
- Total: $2.54 per call.
- Viability: Only viable for high-value services (Legal, Medical, Banking).
7.2 Economy Stack (Deepgram + GPT-4o-mini + OpenAI TTS)
- STT (Deepgram): $0.043.
- LLM (GPT-4o-mini): ~$0.001 / turn. (10 mins = $0.10).
- TTS (OpenAI TTS): $0.015 / min. (10 mins = $0.15).
- Total: $0.29 per call.
- Viability: Viable for Customer Support, Ordering, consumer apps.
Business Model: You generally cannot offer free unlimited voice agents. You need a subscription (SaaS) or usage caps (Credits).
8. Summary
Voice Agents are a distributed systems problem wrapped in a UX problem.
- STT/TTS Latency dominates the first impression.
- VAD tuning is the difference between “Smart” and “Annoying”.
- Barge-In requires rigorous state management to prevent feedback loops.
- Cost requires careful model selection.
To manage this complexity, we often rely on dedicated Voice Agent Frameworks like LiveKit and Pipecat that abstract away the plumbing of WebRTC and VAD.
FAQ
Q: What components make up a voice agent architecture? A: A voice agent pipeline consists of six stages: AudioIn (microphone capture), VAD (voice activity detection to detect speech boundaries), STT (speech-to-text transcription), LLM (language model for reasoning), TTS (text-to-speech synthesis), and AudioOut (speaker playback). Each stage streams data to the next for low latency.
Q: How much does it cost to run a voice agent per call? A: A 10-minute call with a premium stack (Deepgram STT, GPT-4o, ElevenLabs TTS) costs roughly $2.54. An economy stack (Deepgram, GPT-4o-mini, OpenAI TTS) costs about $0.29. TTS and LLM are the biggest cost drivers, making model selection critical for unit economics.
Q: What is barge-in and why is it important for voice agents? A: Barge-in is the ability for a user to interrupt the agent while it is speaking. It requires the agent to immediately stop TTS playback, cancel LLM generation, clear audio buffers, and transition to a listening state. Without it, the agent talks over the user, breaking conversational flow.
Q: How does echo cancellation work in voice agents? A: Echo cancellation prevents the agent from hearing its own voice played through the speakers via the microphone. Without it, the agent detects its own speech as user input, creating a feedback loop where it interrupts itself endlessly. WebRTC provides built-in AEC, or you can use libraries like speexdsp.
Originally published at: arunbaby.com/ai-agents/0017-voice-agent-architecture
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch