9 minute read

“TTS demos always use one sentence. Ask yourself why.”

TL;DR

Every TTS system sounds great on a sentence. Stretch to five minutes and you get word skipping, repetition, speaker drift, and hallucinated content. The root cause is a 10-40x token imbalance between text and audio. TADA fixes this with one-to-one text-acoustic alignment — 682 seconds in a 2,048-token window, zero hallucinations by construction. Borderless Long Speech Synthesis takes the opposite approach: an agent-centric framework that plans prosody, emotion, and speaker interactions using chain-of-thought reasoning before generating audio. One solves it in the model. The other solves it in the scaffolding. For how TTS fits into the real-time voice agent stack, see voice agent architecture.

A reel-to-reel tape recorder with the magnetic tape running cleanly past the playback head for the first few inches then visibly bunching and creas...

Why does TTS fall apart on long content?

The problem is architectural. Text and audio occupy fundamentally different token spaces. A second of speech requires 2-3 text tokens but 12.5-75 audio tokens depending on the codec. This 10-40x imbalance means autoregressive TTS models must generate vastly more output steps than they receive input steps.

Each generation step introduces a small probability of error. Over a single sentence, these errors are negligible. Over thousands of steps — the territory of a five-minute clip — they compound. The model loses track of where it is in the text. Attention alignment between encoder and decoder degrades. Four failure modes appear:

Word skipping. The model advances past a word without generating audio for it. Common with repetitive text or complex phoneme sequences.

Repetition. A phrase is spoken twice when it appears once in the source. The attention mechanism circles back to a previous position.

Inserted words. Audio contains words absent from the source text. The model hallucinates content to fill perceived gaps.

Speaker drift. The synthesized voice gradually shifts toward a more generic timbre over extended passages. Chapter three of an audiobook sounds different from chapter one.

Most TTS models are evaluated on clips between 4 and 10 seconds. Many systems have hard limits at 15 seconds of reference audio. Quality degrades measurably beyond 10-20 seconds of continuous generation. The gap between “works on a demo” and “works on a podcast episode” is enormous.

graph TD
    A[Text: 2-3 tokens/sec] -->|10-40x expansion| B[Audio: 12.5-75 tokens/sec]
    B --> C[Autoregressive generation]
    C --> D{Error at each step}
    D -->|1 sentence| E[Negligible errors]
    D -->|5 minutes| F[Compounding failures]
    F --> G[Word skipping]
    F --> H[Repetition]
    F --> I[Hallucinated words]
    F --> J[Speaker drift]

Production systems work around this with chunking. ElevenLabs segments text at sentence boundaries, generating each chunk separately and using previous/next context parameters to smooth prosodic transitions. Amazon Polly’s long-form engine processes up to 100,000 characters per request using internal chunking (exact segment size is not publicly documented). The stitching produces audible discontinuities at chunk boundaries — subtle enough for casual listening, obvious enough to derail an audiobook.

How does TADA make 682 seconds hallucination-free?

TADA (Text-Acoustic Dual Alignment) from Hume AI attacks the token imbalance directly. Instead of generating dozens of audio tokens per text token and hoping the attention mechanism keeps them aligned, TADA enforces one-to-one synchronization. Each autoregressive step produces exactly one text token and one continuous acoustic vector. Text and speech advance in lockstep.

This is not a trained behavior that sometimes fails. It is an architectural constraint that cannot be violated. The model has no mechanism to skip a word — every text token gets an acoustic representation. It has no mechanism to insert a word — there is no acoustic slot without a corresponding text token. Hallucination is prevented by construction.

The efficiency gain is dramatic. Conventional TTS systems consume 12.5-75 tokens per second of audio in their context window. TADA consumes 2-3. The same 2,048-token window that fits 70 seconds of audio in a conventional system fits 682 seconds in TADA — roughly 11 minutes of continuous speech. That is a 10x context expansion from token efficiency alone.

The acoustic generation pipeline uses continuous features rather than discrete tokens. The LLM produces acoustic feature vectors that feed into a flow-matching head (a generative model that refines audio quality), which outputs the final waveform. This continuous approach preserves acoustic fidelity that discrete tokenization discards.

TADA ships as open-source 1B and 3B parameter models built on Llama 3.2. The 0.09 RTF on H100 makes it 5x faster than comparable LLM-based TTS systems. For the cost economics of self-hosting models like this, see self-hosting TTS production economics.

The limitation: TADA’s approach works for single-speaker synthesis. It does not natively handle multi-speaker dialogue, emotional arc planning, or scene-level context. For those, you need a different architectural philosophy.

What does Borderless do differently with chain-of-thought?

Borderless Long Speech Synthesis (arXiv 2603.19798, March 2026) reframes the problem. TADA asks: “How do I generate audio tokens that stay aligned with text tokens?” Borderless asks: “How do I plan what the audio should sound like before I generate it?”

The framework uses hierarchical annotation at three levels:

Global level. Scene semantics, overall emotional arc, speaker roster, acoustic environment. “This is a tense dialogue between two speakers in a quiet room, building toward a confrontation.”

Sentence level. Discourse context, prosodic boundaries, turn-taking patterns. “Speaker A interrupts here. Speaker B’s volume rises. This sentence ends with a question that expects no answer.”

Token level. Phonetic detail, fine-grained timing, emphasis placement. The lowest level of control.

This hierarchy serves as a structured semantic interface between an LLM agent and the synthesis engine. The agent uses chain-of-thought reasoning to plan the global and sentence levels before synthesis begins. Think of it as an audio director’s notes that the TTS engine follows.

The system uses a continuous tokenizer (not discrete) and supports capabilities that single-speaker systems cannot: multi-speaker synthesis with interruptions and overlapping speech, instruction-following synthesis (“make this part sound more urgent”), and VoiceDesigner for custom voice creation.

Where TADA prevents hallucination through alignment constraints, Borderless prevents incoherence through planning. The trade-off is clear: TADA is simpler and faster for single-speaker long-form content. Borderless handles complex multi-speaker scenarios that TADA’s architecture does not address.

How do these compare to production alternatives?

Two camps have formed in the long-form TTS space.

Continuous tokenizer camp (architectural solutions):

System Max duration Speakers Approach Open
TADA (Hume AI) ~11 min (682s) 1 1:1 text-acoustic alignment Yes
VibeVoice (Microsoft) ~90 min 4 64K-token context, 7.5 Hz frame rate Yes
Borderless Extended Multi Agent-centric CoT planning No

Chunking camp (production workarounds):

System Max per request Approach Open
ElevenLabs Per-request limit Sentence-boundary chunking, context hints No
Amazon Polly 100,000 chars 1,500-char internal segments, async No
Fish Audio Story Studio Extended Chapter management, sentence-level revision Partial

Microsoft’s VibeVoice (1.5B parameters, August 2025) sits at the ambitious end of the continuous camp. It handles up to 64K tokens — roughly 90 minutes of speech — with up to 4 distinct speakers. The architecture uses a next-token diffusion framework with both acoustic and semantic continuous tokenizers at 7.5 Hz. It was designed specifically for podcast and audiobook production.

Fish Audio’s Story Studio takes a different production approach. Rather than solving the technical limit, it builds tooling around it: chapter management, sentence-level revision control, and voice consistency mechanisms that prevent drift across hours of content. It treats the 10-20 second generation limit as a given and engineers around it.

For audiobook production — where human narration costs $200-400 per finished hour — the quality bar is high and the economic incentive is real. A 50,000-word book runs roughly 6 hours of audio. At human rates, that is $1,200-2,400. AI generation costs a fraction, but only if quality survives the full duration.

Which approach fits which use case?

Single-speaker long content (reports, articles, accessibility): TADA. The zero-hallucination guarantee matters when information fidelity is non-negotiable. The 682-second context handles most documents in a single pass.

Multi-speaker production (podcasts, dialogue, audiobooks with characters): Borderless or VibeVoice. Multi-speaker synthesis with interruptions, emotional arcs, and scene-level planning requires the scaffolding approach.

High-volume production pipelines: Fish Audio Story Studio or ElevenLabs. When you need revision control, chapter management, and human-in-the-loop editing, the tooling matters more than the model architecture.

Voice agents reading retrieved documents: TADA. A voice agent that reads a long policy document or research paper needs the content to be exactly right. Hallucinated words in a legal document are not acceptable.

Key takeaways

  • The root cause is token imbalance. Audio requires 10-40x more tokens than text. Autoregressive errors compound over long sequences into skipping, repetition, and hallucination.
  • TADA solves it with alignment. One-to-one text-acoustic sync makes hallucination structurally impossible. 682 seconds in a 2,048-token window.
  • Borderless solves it with planning. Chain-of-thought reasoning plans prosody, emotion, and speaker interactions before generation begins.
  • Production systems still chunk. ElevenLabs and Amazon Polly segment text and stitch audio. It works, with audible seams.
  • VibeVoice pushes the boundary. 90 minutes, 4 speakers, 64K context. The most ambitious continuous approach in production.
  • The audiobook economics are real. $200-400/hr for human narration versus cents for AI. Quality on long content is the remaining barrier.

Further reading

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch