10 minute read

TL;DR: Llasa (arXiv:2502.04128, HKUST, February 2025) applies inference-time compute scaling to text-to-speech: instead of always taking the single most likely token, it runs beam search with speech understanding verifiers to find higher-quality outputs. The backbone is a standard LLaMA model plus X-Codec2, a single-layer vector quantizer encoding all speech information into one token stream. Scaling compute at inference explicitly trades latency for quality: greedy decode is fast and mediocre; 5-step beam search is slower and noticeably better. For practitioners, this means a new knob: how much inference budget do you want to spend per utterance?

Circuit traces branching from a single point, one illuminated path leading to a speaker — representing beam search selecting the best speech path


The AI field spent two years learning that scaling inference compute improves text reasoning. You apply more compute at generation time (via beam search, best-of-N sampling, or Monte Carlo tree search) and the model’s outputs improve. This is how o1 works. It’s how DeepSeek-R1 works.

Llasa asks: what if the same principle applies to speech?

The answer is yes, and the implications are practical enough to matter for anyone building TTS pipelines today.

What inference-time scaling actually means for TTS

Training-time scaling is familiar: more parameters, more data, longer training runs produce better models. The results transfer to every inference call uniformly: you pay once at training, benefit forever.

Inference-time scaling works differently. The model is fixed; you spend more compute at generation time to improve individual outputs. For a language model, this means sampling multiple candidate responses and selecting the best one via some scoring function. For a TTS system, the same principle applies: generate multiple candidate token sequences and select the one that best satisfies measurable speech quality criteria.

Llasa implements this as verifier-guided beam search:

Input text
    │
    ▼
LLaMA decoder generates N candidate token continuations
    │
    ├─── Candidate 1 → Verifier scores (ASR accuracy, speaker consistency)
    ├─── Candidate 2 → Verifier scores
    ├─── Candidate 3 → Verifier scores
    │    ...
    └─── Candidate N → Verifier scores
             │
             ▼
        Top-M candidates advance
             │
             ▼
        Next decoding step

The verifiers are existing speech understanding models: Whisper or similar ASR for pronunciation verification, speaker verification models for timbre consistency. They score partial sequences mid-generation, pruning the beam toward outputs that are more accurately pronounced, more consistent in speaker identity, and more expressively coherent.

The trade-off is explicit: more beam steps = more candidates evaluated = higher latency. The paper documents that “scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy” (arXiv:2502.04128, abstract).

The architectural simplicity that enables scaling

Most TTS systems use cascaded architectures: separate models for text-to-phoneme, phoneme-to-acoustic, acoustic-to-waveform. Each stage has its own representation, its own training objective, its own failure modes. Scaling one stage doesn’t automatically improve the others.

Llasa’s architecture collapses this into a single model:

Text input
    │
    ▼
┌──────────────────────────────────────┐
│  LLaMA decoder (1B / 3B / 8B)       │
│  Standard transformer decoder        │
│  Trained jointly on text + speech   │
└──────────────────┬───────────────────┘
                   │  Token stream (50 Hz)
                   ▼
┌──────────────────────────────────────┐
│  X-Codec2 decoder                    │
│  Single-layer VQ, 65,536 codes       │
│  99% codebook utilization            │
│  2.47% WER on LibriSpeech test-clean │
└──────────────────────────────────────┘
                   │
                   ▼
              Audio waveform

X-Codec2 is the enabling piece. Traditional codecs like Encodec use multiple RVQ (residual vector quantization) layers, with each layer capturing different aspects of the audio (a hierarchical, multi-stream representation that’s powerful but hard to scale). X-Codec2 instead uses a single-layer VQ with 65,536 codes and Finite Scalar Quantization, encoding phoneme, prosody, timbre, and emotion into one unified discrete token stream.

The 99% codebook utilization is notable: nearly every code is used in practice, meaning the representation is efficient and hasn’t collapsed to a small subset. Codebook collapse is a known failure mode in VQ systems; it produces repetitive, low-quality output.

Because all speech information flows through one token stream at 50 Hz, scaling decisions are straightforward: increase the LLaMA model size, increase training data, or increase inference compute. There’s no cascade to optimize across.

What the numbers say

The Llasa paper (arXiv:2502.04128, Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan et al., HKUST) reports:

  • X-Codec2 achieves UTMOS 4.13 (naturalness) and 2.47% WER on LibriSpeech test-clean reconstruction (codec round-trip quality before the LLM even touches it)
  • Scaling train-time compute (larger model, more data) “consistently improves the naturalness of synthesized speech and enables generation of more complex and accurate prosody patterns”
  • Inference-time scaling improves emotional expressiveness and timbre consistency measurably across all model sizes

For deployment context:

Model Params VRAM needed RTF (greedy) RTF (beam search, width 5)
Llasa-1B 1B ~8GB ~0.3–0.5 ~1.5–2.5
Llasa-3B 3B ~12GB ~0.5–1.0 ~2.5–5.0
Llasa-8B 8B ~24GB+ ~1.0–2.0 ~5–10

RTF > 1.0 means slower than real-time; RTF < 1.0 means faster. Greedy decode on the 1B model runs faster than real-time on modern hardware; beam search on the 8B model is substantially slower. The quality trade-off is real and explicit.

For comparison, Kokoro-82M’s StyleTTS2-based architecture runs at well under 0.1 RTF, ten times faster than Llasa’s greedy mode. Cartesia Sonic Turbo and ElevenLabs Flash are optimized for sub-100ms first-chunk latency. Llasa doesn’t compete with them on speed. It competes on the quality ceiling when latency budget allows.

Where inference-time scaling breaks down for TTS

The Llasa paper is honest about limitations practitioners will hit:

Verifier brittleness. The beam search quality depends entirely on the verifier’s scoring function. A Whisper-based pronunciation verifier optimizes for transcription accuracy, not naturalness. A speaker verification model optimizes for speaker identity consistency, not prosodic expressiveness. Different verifiers produce different quality improvements, and combining verifiers requires calibrating their relative weights, which the paper doesn’t prescribe.

Streaming incompatibility. Beam search is fundamentally non-streaming: you need to evaluate multiple candidates before committing to a path. X-Codec2 is causal (compatible with streaming), but verifier-guided beam search isn’t. Llasa+ (arXiv:2508.06262, August 2025) addresses this with streaming and acceleration modes, but the full quality gains of multi-step beam search remain incompatible with low-latency streaming deployment.

Emotional control without a steering mechanism. Inference-time scaling “improves emotional expressiveness” through verifiers, but doesn’t expose a knob for which emotion. StyleTTS2 and CosyVoice have explicit style vectors; Llasa’s emotional variation emerges from verifier-guided search rather than explicit direction. This works for naturalness but not for controlled emotional performance.

Pronunciation debugging is harder. In a cascaded system, mispronunciation is diagnosable: you know which stage failed. In Llasa’s unified stream, a mispronounced word could reflect model confusion, codec error, or verifier failure. Multi-layer feedback loops without per-layer diagnostics make production debugging opaque.

The practical decision point

The speech tokenization architecture underlying Llasa (unified discrete token stream, single LLM decoder) represents a clean direction for TTS research. The scaling laws are predictable: more model, more data, more inference compute all improve quality.

The practical question is whether your workload fits the constraint:

  • Latency-critical voice agents (< 200ms first chunk): Use Kokoro, Cartesia, or ElevenLabs. Llasa’s beam search overhead is incompatible.
  • Batch content generation (audiobooks, podcasts, e-learning): Llasa’s beam search is fully appropriate. A 2–5x RTF overhead on a 5-minute segment costs minutes of compute, not seconds of user patience.
  • Quality-first single-speaker narration: Llasa-3B with beam search width 3–5 produces noticeably better prosody than Kokoro or Parler-TTS on complex sentences. The quality gap is audible.
  • Multilingual with code-switching: Llasa shows in-context learning. Provide a 2–3 second reference in one language and it adapts for cross-lingual generation. The 250,000-hour training corpus includes both Chinese and English.

The GitHub training code is at zhenye234/LLaSA_training (579 stars, 42 forks as of early 2026). HuggingFace checkpoints are at HKUSTAudio/Llasa-1B, HKUSTAudio/Llasa-3B, HKUSTAudio/Llasa-8B under a permissive license.

Key takeaways

  • Llasa applies inference-time scaling (verifier-guided beam search) to TTS. More inference compute = better speech quality, at the cost of higher latency.
  • X-Codec2 encodes all speech information into a single discrete token stream (50 Hz, 65,536-code VQ, 99% utilization), enabling a single LLaMA decoder to handle the full generation without cascading.
  • X-Codec2 achieves 2.47% WER and UTMOS 4.13 on LibriSpeech test-clean reconstruction; the codec itself is high-quality before the LLM contributes.
  • The inference-time/latency trade-off is explicit: greedy decode at ~0.5 RTF (1B model) vs. beam search at ~2.5–5x RTF. Quality improves measurably with wider beams.
  • Llasa is wrong for low-latency voice agents and right for batch content generation, quality-first narration, and research into speech scaling laws.

FAQ

What is Llasa TTS? Llasa (arXiv:2502.04128, HKUST, February 2025) is an open-source TTS system applying inference-time compute scaling to speech synthesis. A standard LLaMA decoder (1B/3B/8B) generates a single unified token stream via X-Codec2, with beam search guided by speech verifiers selecting higher-quality outputs. Available via HKUSTAudio on HuggingFace.

How does inference-time scaling work for TTS? Standard TTS generates one token at a time, taking the most probable next token (greedy decode). Llasa’s beam search generates multiple candidates at each step, scores them with speech verifiers (ASR for pronunciation, speaker verification for consistency), and continues from the highest-scoring paths. More compute = more candidates evaluated = better outputs, at higher latency.

What is X-Codec2 and why does it matter? X-Codec2 is a single-layer vector quantizer (65,536 codes, 99% utilization, 50 Hz) that encodes phoneme, prosody, timbre, and emotion into one discrete token stream. This unified representation lets a single LLaMA decoder handle complete TTS generation, making scaling decisions simple: scale model, data, or inference compute.

How does Llasa compare to Kokoro and CosyVoice? Kokoro-82M is the efficiency champion: fastest local TTS, consistently strong on quality-per-parameter. CosyVoice targets multilingual production with explicit emotion control. Llasa targets quality-first scenarios with flexible latency budget: beam search produces noticeably better prosody on complex utterances at 2–5x realtime overhead.

What are the deployment options? HKUSTAudio checkpoints on HuggingFace (permissive license): 1B (~8GB VRAM), 3B (~12GB), 8B (~24GB+). Llasa+ (arXiv:2508.06262) adds streaming and acceleration modes. Community fine-tunes include Llasagna (Italian) and GRPO-based prosody improvements.


Further reading

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch