11 minute read

“Your TTS vendor’s latency number is a lie. Here’s how to read the fine print.”

TL;DR

Six TTS systems now deliver first audio under 100 milliseconds: Cartesia Sonic Turbo at 40ms, Rime Mist v2 at sub-100ms on-premise, VoXtream2 at 74ms, ElevenLabs Flash v2.5 at 75ms, TADA at 0.09 RTF, and Qwen3-TTS at 97ms. Each trades off differently across latency, quality, cost, and openness. The real story: when TTS stops being the bottleneck, the rest of your voice agent pipeline becomes the constraint. Most architectures were designed around slow TTS. They need rethinking.

A precision digital stopwatch frozen at 00:00.097 sitting next to a tiny speaker driver with visible sound waves eman...

Why does sub-100ms TTS matter now?

Because voice agents live or die on perceived responsiveness, and the math is unforgiving.

Humans expect conversational turn gaps of 200-400ms. The cross-linguistic mean is about 200ms, but the range is wide — 7ms mean in Japanese, 469ms mean in Danish (Stivers et al., 2009, PNAS). Past 800ms, callers notice. Past 1.5 seconds, the conversation degrades. Past two seconds, it feels broken.

A voice agent’s round-trip includes three serial steps: speech-to-text (50-200ms), LLM inference (200-800ms), and text-to-speech. If TTS takes 300-500ms, you have already blown past the 800ms threshold before accounting for network latency, queue time, or audio encoding. The agent sounds hesitant. Users talk over it. Completion rates drop.

The voice AI agent market was valued at $2.4 billion in 2024 and is projected to reach $47.5 billion by 2034, growing at 34.8% CAGR according to Market.us. That growth depends on agents that feel conversational, not robotic. Sub-100ms TTS is the enabler — it gives the LLM step breathing room within the perceptual budget.

Six months ago, only proprietary APIs offered sub-100ms. Now open-source models compete. That changes the economics and the architecture.

How do you actually measure TTS speed?

Three numbers circulate in TTS marketing, and they measure different things.

Metric What it measures Example
RTF (real-time factor) Total inference time / audio duration. Below 1.0 = faster than playback. TADA: 0.09 RTF = 11x faster than real-time
TTFB (first-packet latency) Time from input to first playable audio chunk. Milliseconds. VoXtream2: 74ms on consumer GPU
End-to-end latency TTFB + network + queue + encoding. What the user actually experiences. Inworld TTS Mini: <130ms P90

RTF is a throughput metric. A system with 0.09 RTF generates 100 seconds of audio in 9 seconds of compute — but it might buffer for 200ms before the first chunk arrives. TTFB is a responsiveness metric. A system with 74ms TTFB starts playing audio quickly, even if it generates the full clip slower overall.

For voice agents, TTFB matters more than RTF. Users hear the first syllable, not the last. A system that starts fast and streams continuously sounds responsive even if its total generation time is higher.

End-to-end latency is what callers actually experience. It includes everything TTFB does not: network round-trips, request queuing, audio codec overhead, and WebSocket framing. A 40ms TTFB in a benchmark becomes 120ms in production. Always measure end-to-end.

sequenceDiagram
    participant User
    participant STT
    participant LLM
    participant TTS
    participant Speaker

    User->>STT: Speech input
    Note over STT: 50-200ms
    STT->>LLM: Transcript
    Note over LLM: 200-800ms (streaming)
    LLM->>TTS: First text chunk
    Note over TTS: TTFB: 40-100ms
    TTS->>Speaker: First audio packet
    Note over Speaker: User hears response
    Note over User,Speaker: Total round-trip target: < 800ms

How does TADA achieve 0.09 real-time factor?

Most TTS systems suffer from a modality mismatch: one text token maps to a variable number of acoustic frames. A single word like “extraordinary” might span dozens of mel-spectrogram frames, while “cat” spans three. Traditional models handle this with attention mechanisms that occasionally skip words or hallucinate — inserting sounds that have no corresponding text.

TADA (Text-Acoustic Dual Alignment) from Hume AI eliminates this by enforcing one-to-one alignment between continuous acoustic vectors and text tokens. Each autoregressive step covers exactly one text token. The model dynamically determines duration and prosody per token, but it cannot skip or insert because the alignment is structural, not learned.

The results: zero hallucinations on 1,000+ test samples from the LibriTTSR dataset, and a 682-second context window within a 2,048-token limit. For comparison, conventional TTS systems fit roughly 70 seconds in the same token budget — TADA achieves a 10x context expansion by eliminating the redundant acoustic tokens.

The architecture builds on Llama 3.2, available in 1B and 3B parameter variants. The code is MIT-licensed; the model weights use the Llama 3.2 Community License. Released March 10, 2026 on GitHub with weights on Hugging Face.

The 0.09 RTF number comes from H100 inference. What Hume has not published is a standalone TTFB measurement — the time to first audio chunk. RTF and TTFB can diverge significantly depending on buffering strategy, so treat the 0.09 number as a throughput ceiling, not a responsiveness guarantee.

What makes VoXtream2’s 74ms first-packet possible?

VoXtream2 (arXiv 2603.13518) attacks latency at every pipeline stage simultaneously.

The architecture has four layers, each designed to minimize waiting:

  1. Incremental Phoneme Transformer: Processes text with monotonic alignment and a dynamic look-ahead of roughly 10 phonemes. This means it starts generating audio before the full sentence is parsed — less than 100ms input delay.

  2. Temporal Transformer: Jointly predicts semantic tokens and duration tokens using distribution matching with classifier-free guidance. Duration prediction happens in parallel with content generation, not after it.

  3. Depth Transformer: Generates acoustic tokens conditioned on a speaker embedding. Zero-shot voice cloning from a reference clip.

  4. Streaming vocoder: Converts acoustic tokens to waveform with minimal look-back. Traditional vocoders like HiFi-GAN need 10+ frames of context (50-100ms delay). VoXtream2’s vocoder streams with 1-2 frames.

The combined result: 74ms first-packet on a consumer GPU, 4x faster than real-time throughput, trained on a 40,000-hour corpus (30K from Emilia, 10K from HiFiTTS-2). It supports dynamic speaking rate adjustment mid-utterance — you can slow down for emphasis or speed up for lists without regenerating.

The original VoXtream (arXiv 2509.15969) achieved 102ms. VoXtream2 shaved 28ms by tightening the look-ahead window and replacing the vocoder’s buffering strategy. Open-source, with weights on Hugging Face and a demo at herimor.github.io/voxtream.

Where does open-source Chatterbox fit?

Not in the sub-100ms race. Chatterbox Turbo from Resemble AI measures first-chunk latency in the 187-340ms range on an RTX 4090 — slower than the sub-100ms tier but faster than legacy systems. Its value is elsewhere.

Chatterbox covers 23 languages (Arabic through Chinese), ships at 350M parameters in the Turbo variant, and runs under MIT license. It has 23,900+ GitHub stars and over one million Hugging Face downloads. In blind evaluations, 63.75% of users preferred Chatterbox over ElevenLabs.

The standout features are emotion control (the original model supports exaggeration dials) and paralinguistic tags in Turbo ([cough], [laugh], [chuckle]). Built-in audio watermarking provides content provenance tracking. A community-built Chatterbox-TTS-Server adds a Web UI and an OpenAI-compatible API for self-hosting.

For voice agents where latency is the binding constraint, Chatterbox is the wrong tool. For applications where multilingual coverage, emotion expressiveness, and self-hosting economics matter more than first-packet speed — IVR systems, audiobook generation, content localization — it is one of the strongest open-source options. See self-hosting TTS production economics for the cost analysis.

How do the sub-100ms systems compare?

Every number below comes from the vendor’s own published benchmarks or paper. “Open” means weights are publicly available. Hardware column shows what the latency was measured on, where stated.

System TTFB RTF Open Params Hardware Best for
Cartesia Sonic Turbo 40ms No Cloud (SSM) Absolute minimum latency
Rime Mist v2 <100ms No On-premise Enterprise, pronunciation accuracy
VoXtream2 74ms 0.25 Yes Consumer GPU Open-source + low latency
ElevenLabs Flash v2.5 ~75ms No Cloud 32 languages, reliability
TADA (Hume AI) 0.09 Yes 1B/3B H100 Zero hallucinations, long-form
Qwen3-TTS 97ms Yes 1.7B Multilingual, voice cloning
Inworld TTS Mini <100ms (median) No Cloud Gaming, interactive apps
Chatterbox Turbo 187-340ms 0.499 Yes 350M RTX 4090 23 languages, emotion, cost

Two architectural patterns dominate the sub-100ms tier:

State Space Models (Cartesia): SSMs process sequences more efficiently than Transformers for real-time generation. The sequential nature of audio generation maps well to SSM’s recurrent structure, enabling 40ms TTFB where Transformer-based systems need 70ms+.

Monotonic alignment with streaming vocoders (VoXtream2, TADA): By enforcing left-to-right alignment between text and audio tokens — no future look-ahead beyond a few phonemes — these systems can start output before the input is fully processed. Combined with vocoders that stream with 1-2 frames of context instead of 10+, the pipeline delay collapses.

What breaks in your pipeline when TTS gets this fast?

When TTS took 300-500ms, voice agent architectures could afford sloppy handoffs. Audio buffers absorbed jitter. LLM streaming could stall for 200ms without anyone noticing. Those margins disappear at sub-100ms.

The LLM becomes the bottleneck. If TTS starts generating 40-74ms after receiving text, but the LLM takes 400ms to produce the first token, your user is waiting for the LLM, not the TTS. Speculative decoding and smaller draft models become architecturally necessary — see streaming speech pipelines for pipeline design.

Audio buffer strategies need rethinking. Many voice agents buffer 500ms-1s of TTS audio before playback to smooth out jitter. With sub-100ms TTS, that buffer adds more latency than the TTS itself. You need either smaller buffers (50-100ms, accepting occasional underruns) or adaptive buffering that shrinks as TTS proves reliable.

Barge-in detection becomes time-critical. When the agent responds in under 200ms total, the window for detecting user interruption shrinks proportionally. VAD and barge-in systems designed for 500ms+ response times will either cut the agent off too early or too late. The VAD architecture needs tighter coupling with the TTS output stream.

WebSocket framing overhead matters. At 40ms TTFB, a 20ms WebSocket frame becomes half your latency budget. Binary WebSocket frames, pre-negotiated codecs, and connection pooling shift from nice-to-have to required.

Key takeaways

  • Measure TTFB, not RTF, when evaluating TTS for voice agents. RTF tells you throughput; TTFB tells you how fast the user hears the first syllable.
  • Sub-100ms is real from six systems, spanning proprietary (Cartesia 40ms, ElevenLabs 75ms) and open-source (VoXtream2 74ms, Qwen3-TTS 97ms, TADA 0.09 RTF).
  • Chatterbox wins on breadth, not speed. 23 languages, emotion control, MIT license, 187-340ms first-chunk. Right choice for different constraints.
  • TADA’s zero-hallucination guarantee is architecturally unique — one-to-one text-acoustic alignment makes it structurally impossible to skip or insert content.
  • When TTS gets fast, everything else becomes the bottleneck. LLM inference, audio buffering, barge-in detection, and WebSocket framing all need tightening.
  • Open-source crossed the threshold. VoXtream2 at 74ms and TADA at 0.09 RTF mean you no longer need a proprietary API for sub-100ms TTS.

Further reading

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch