8 minute read

“Removing the Text Bottleneck: The Omni Future.”

TL;DR

Speech-to-Speech (S2S) models eliminate the cascade latency of STT-LLM-TTS pipelines by processing audio tokens natively through a single transformer, cutting response times to under 300ms while preserving emotion and prosody. GPT-4o leads with human-level interaction quality, while open alternatives like Moshi (dual-stream output) and Ultravox (Whisper+Llama hybrid) offer self-hostable options. The core enabler is audio tokenization via neural codecs like EnCodec, which convert audio into discrete integers that transformers process identically to text. Trade-offs include roughly 10x higher token costs, harder safety filtering, and reduced output control — making hybrid voice architectures the practical choice for production systems that need both natural conversation and precise data reading.

Two parallel signal processing paths, one through three converters in series, one through a single direct processor

1. Introduction: The Cascade Problem

Standard voice agents use a Cascade Architecture: Audio -> STT -> Text -> LLM -> Text -> TTS -> Audio.

This works, it powers Siri, Alexa, and most enterprise bots, but it has two fatal flaws that prevent true “Human-Level” interaction:

1.1 The Latency Stackup

  • STT Processing: 300ms
  • LLM Time-To-First-Token: 400ms
  • TTS Generation: 300ms
  • Network RTT: 100ms
  • Total: 1.1 seconds (Best Case ideal). This floor is hard to break because each system must wait for some meaningful input before it starts. The LLM cannot predict the sentiment of the sentence until the STT finishes the sentence.

1.2 Information Loss (The Emotional Flatness)

Text is a “Lossy Compression” of speech. By converting audio to text, we strip away:

  • Prosody: The rhythm and intonation.
  • Emotion: Sarcasm, Anger, Excitement, Uncertainty.
  • Non-Verbal Cues: Breaths, laughter, hesitation (“Umm…”).

Example:

  • User (Sarcastic): “Oh, great job. Really amazing.”
  • Text Representation: “Oh, great job. Really amazing.”
  • LLM Brain: Sees positive sentiment words (“Great”, “Amazing”). Replies: “Thanks! I’m glad you liked it.”
  • Reality: The user is angry, and the agent just made it worse.

To fix this, we need Speech-to-Speech (S2S) models. Models that “hear” audio and “speak” audio natively, without ever converting to text in the middle.


2. Anatomy of S2S Models: How do they work?

How can a Transformer, which is designed for discrete text tokens (integers), process continuous audio waves?

2.1 Audio Tokenization (The Rosetta Stone)

We use a Neural Audio Codec (like Meta’s EnCodec or OpenAI’s internal codec). These models use Residual Vector Quantization (RVQ).

  1. Encoder: Takes raw audio (24khz). Compresses it into a dense latent space.
  2. Quantizer: Maps the latent vectors to the nearest entry in a fixed “Codebook” (Visualise a dictionary of 2048 distinct “Sound Units”).
  3. Result: Audio becomes a sequence of integers: [45, 1029, 33, 500...] just like text is [101, 205, ...].
  4. Rate: Usually 50 to 100 tokens per second.

2.2 The Omni-Model Training

Once audio is tokenized, it is just “Language” to the Transformer.

  • Input Sequence: [Text Tokens] + [Image Tokens] + [Audio Tokens].
  • Training Objective: “Next Token Prediction”.
  • Output: The model can predict that after the audio tokens for “Hello”, the next audio tokens should be “How are you?”.

GPT-4o (Omni) is the state of the art. It processes audio, vision, and text in a single transformer.

  • Capability: It comes closer to human interaction than anything before. It can hear you breathing. It can sing. It can laugh. It detects sarcasm. It can change its tone mid-sentence if you interrupt it.
  • Latency: ~300ms (Average human response time).

3. The Landscape: GPT-4o, Moshi, Ultravox

3.1 Kyutai Moshi (Open Source S2S)

A French research lab released Moshi, the first high-quality open real-time S2S model.

  • Components: Helium (7B LLM) + Mimi (Audio Codec).
  • Innovation: Dual Stream Output.
  • Standard LLMs output one token at a time.
  • Moshi outputs two streams in parallel:
    1. Text Token: The semantic logic (“I am thinking about cats.”)
    2. Audio Token: The acoustic realization (“Meow.”)
  • This allows developers to inspect the “Thought Process” (Text) while playing the Audio.

3.2 Ultravox (The Hybrid)

Ultravox is an open-weights architecture fusing Llama 3 with a Whisper Encoder and a Unit-based Vocoder.

  • Approach: Instead of training from scratch (expensive), it uses a “Projector” to map Whisper’s audio embeddings into Llama 3’s embedding space.
  • Pros: You can self-host it. It leverages the reasoning power of Llama 3.
  • Cons: Not as fluid as GPT-4o yet. It feels like a “Smart Speaker” rather than a “Person”.

4. Engineering S2S Agents: New Paradigms

Working with S2S is fundamentally different from Text Agents.

4.1 Prompting “Sound”

S2S models respond to Audio Prompts.

  • Technique: One-Shot Voice Cloning. Provide distinct audio examples in the context window.
  • User: [Audio of Morgan Freeman] “Speak like this.”
  • Model: [Audio in Morgan Freeman style] “Okay, I will.”
  • Style Tokens: The model supports meta-tags in the prompt: [Style: Whispering] [Emotion: Fear].

4.2 Handling Interruptions (Duplex)

In Cascade, interruption is handled by logic (VAD triggers Stop). In S2S, the model hears itself speaking (if echo cancellation fails) and hears the user.

  • The Problem: If the model hears the user saying “Wait”, does it predict silence (stop talking)?
  • Training: Models must be fined-tuned on “Turn-Taking” data where interruptions occur, so they learn to yield the floor naturally.
  • Moshi’s Solution: It predicts “Silence Tokens” for its own channel when the User channel is active.

5. Cost, Safety, and Challenges

5.1 Cost

Processing Audio Tokens is much more expensive than Text Tokens.

  • Density: 1 second of audio might be 50 tokens. To process a 10-second sentence is 500 tokens input + 500 tokens output = 1000 tokens.
  • Comparison: The text “How are you?” is 3 tokens. The audio “How are you?” is 100 tokens.
  • Price: GPT-4o Audio pricing is significantly higher (~10x) than text. Using S2S for everything burns money.

5.2 Safety (Audio Injection)

A text model can be filtered using Regex (Stop words). An audio model is harder to filter.

  • Scenario: The user asks the model to “Say ‘I hate you’ in a happy voice.”
  • Scenario: The model produces a sound that sounds like a racial slur but isn’t the exact word in the text transcript.
  • Jailbreaks: “Audio Adversarial Attacks” (statistically noisy audio that sounds like static to humans but forces the model to output harmful content).

5.3 Control

It is harder to steer. If you want the agent to say exactly “Your balance is $54.20”, an S2S model might say “Roughly fifty bucks”. In banking, this is unacceptable.

  • Solution: Hybrid pipelines. Use S2S for chit-chat, switch to Text-to-Speech for strict data reading.

6. Summary

Speech-to-Speech is the endgame for Voice Agents.

  • Latency drops to human levels (< 300ms).
  • Emotion is preserved, enabling true empathy in medical/support agents.
  • Complexity shifts from “Pipeline Engineering” (gluing STT/TTS) to “Model Deployment” (hosting a massive transformer).

This concludes the exploration of Real-Time & Voice. For the full voice agent stack leading up to S2S, see the Voice Agent Architecture series covering frameworks and VAD.


FAQ

What is a Speech-to-Speech AI agent?

A Speech-to-Speech agent processes audio input and generates audio output natively through a single transformer, without converting to text in the middle. This preserves emotion, prosody, and non-verbal cues while cutting latency to under 300ms.

Why is the cascade architecture for voice agents slow?

The cascade pipeline (STT to LLM to TTS) stacks latency from each stage — typically 300ms for STT, 400ms for LLM, 300ms for TTS, plus network overhead — creating a minimum 1.1-second response floor that prevents natural conversation.

How does audio tokenization work in S2S models?

Neural audio codecs like Meta EnCodec use Residual Vector Quantization to compress raw audio into discrete integer tokens (50-100 per second), which the transformer processes identically to text tokens using next-token prediction.

What are the trade-offs of Speech-to-Speech vs cascade voice agents?

S2S models offer lower latency and emotion preservation but cost roughly 10x more per token, are harder to safety-filter since audio bypasses text regex checks, and provide less precise control over exact phrasing — making hybrid pipelines necessary for strict data-reading tasks.


Originally published at: arunbaby.com/ai-agents/0020-speech-to-speech-agents

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch