End-to-End Text-to-Speech (TTS)

14 minute read

“Giving machines a voice.”

1. The Evolution of TTS

1. Concatenative Synthesis (1990s - 2010s)

Method: Record a voice actor reading thousands of sentences. Chop them into phonemes/diphones. Glue them together at runtime.
Pros: Very natural sound for recorded segments.
Cons: “Glitchy” at boundaries. Cannot change emotion or style. Requires massive database (GBs).

2. Statistical Parametric Synthesis (HMMs) (2000s - 2015)

Method: Generate acoustic features (F0, spectral envelope) from text using HMMs. Use a vocoder to convert features to audio.
Pros: Flexible, small footprint.
Cons: “Muffled” or “robotic” sound (due to averaging in HMMs).

3. Neural / End-to-End TTS (2016 - Present)

Method: Deep Neural Networks map Text $\to$ Spectrogram $\to$ Waveform.
Pros: Human-level naturalness. Controllable style/emotion.
Cons: Computationally expensive.

2. Anatomy of a Modern TTS System

A typical Neural TTS system has two stages:

Acoustic Model (Text $\to$ Mel-Spectrogram):
- Converts character/phoneme sequence into a time-frequency representation (Mel-spectrogram).
- Example: Tacotron 2, FastSpeech 2, VITS.
Vocoder (Mel-Spectrogram $\to$ Waveform):
- Inverts the spectrogram back to time-domain audio.
- Example: WaveNet, WaveGlow, HiFi-GAN.

3. Deep Dive: Tacotron 2 Architecture

Tacotron 2 (Google, 2017) is the gold standard for high-quality TTS.

1. Encoder

Input: Character sequence.
Layers: 3 Convolutional layers (context) + Bi-directional LSTM.
Output: Encoded text features.

2. Attention Mechanism

Location-Sensitive Attention: Crucial for TTS.
Unlike translation (where reordering happens), speech is monotonic.
The attention weights must move forward linearly.
Uses previous attention weights as input to calculate current attention.

3. Decoder

Type: Autoregressive LSTM.
Input: Previous Mel-frame.
Output: Current Mel-frame.
Stop Token: Predicts when to stop generating.

4. Post-Net

Purpose: Refine the Mel-spectrogram.
Layers: 5 Convolutional layers.
Residual Connection: Adds detail to the decoder output.

Loss Function: MSE between predicted and ground-truth Mel-spectrograms.

4. Deep Dive: Neural Vocoders

The Mel-spectrogram is lossy (phase information is discarded). The Vocoder must “hallucinate” the phase to generate high-fidelity audio.

1. Griffin-Lim (Algorithm)

Method: Iterative algorithm to estimate phase.
Pros: Fast, no training.
Cons: Robotic, metallic artifacts.

2. WaveNet (Autoregressive)

Method: Predicts sample $x_t$ based on $x_{t-1}, x_{t-2}, …$
Architecture: Dilated Causal Convolutions.
Pros: State-of-the-art quality.
Cons: Extremely slow (sequential generation). 1 second of audio = 24,000 steps.

3. WaveGlow (Flow-based)

Method: Normalizing Flows. Maps Gaussian noise to audio.
Pros: Parallel inference (fast). High quality.
Cons: Huge model (hundreds of millions of parameters).

4. HiFi-GAN (GAN-based)

Method: Generator produces audio, Discriminator distinguishes real vs fake.
Pros: Very fast (real-time on CPU), high quality.
Cons: Training instability (GANs).
Current Standard: HiFi-GAN is the default for most systems today.

5. FastSpeech 2: Non-Autoregressive TTS

Problem with Tacotron:

Autoregressive generation is slow ($O(N)$).
Attention failures (skipping or repeating words).

FastSpeech 2 Solution:

Non-Autoregressive: Generate all frames in parallel ($O(1)$).
Duration Predictor: Explicitly predict how many frames each phoneme lasts.
Pitch/Energy Predictors: Explicitly model prosody.

Architecture:

Encoder (Transformer) $\to$ Variance Adaptor (Duration/Pitch/Energy) $\to$ Decoder (Transformer).

Pros:

Extremely fast training and inference.
Robust (no skipping/repeating).
Controllable (can manually adjust speed/pitch).

6. System Design: Building a TTS API

Scenario: Build a scalable TTS service like Amazon Polly.

Requirements:

Latency: < 200ms time-to-first-byte (streaming).
Throughput: 1000 concurrent streams.
Voices: Support multiple speakers/languages.

Architecture:

Frontend (Text Normalization):
- “Dr. Smith lives on St. John St.” $\to$ “Doctor Smith lives on Saint John Street”.
- “12:30” $\to$ “twelve thirty”.
- G2P (Grapheme-to-Phoneme): Convert text to phonemes (using CMU Dict or Model).
Synthesis Engine:
- Model: FastSpeech 2 (for speed) + HiFi-GAN.
- Optimization: ONNX Runtime / TensorRT.
- Streaming: Chunk text into sentences. Synthesize sentence 1 while user listens.
Caching:
- Cache common phrases (“Your ride has arrived”).
- Hit rate for TTS is surprisingly high for navigational/assistant apps.
Scaling:
- GPU inference is preferred (T4/A10G).
- Autoscaling based on queue depth.

7. Evaluation Metrics

1. MOS (Mean Opinion Score):

Human raters listen and rate from 1 (Bad) to 5 (Excellent).
Ground Truth: ~4.5.
Tacotron 2: ~4.3.
Parametric: ~3.5.

2. Intelligibility (Word Error Rate):

Feed generated audio into an ASR system.
Check if the ASR transcribes it correctly.

3. Latency (RTF - Real Time Factor):

Time to generate / Duration of audio.
RTF < 1.0 means faster than real-time.

8. Advanced: Voice Cloning (Zero-Shot TTS)

Goal: Generate speech in a target speaker’s voice given only a 3-second reference clip.

Architecture (e.g., Vall-E, XTTS):

Speaker Encoder: Compresses reference audio into a fixed-size vector (d-vector).
Conditioning: Feed d-vector into the TTS model (AdaIN or Concatenation).
Language Modeling: Treat TTS as a language modeling task (Audio Tokens).

Vall-E (Microsoft):

Uses EnCodec (Audio Codec) to discretize audio.
Trains a GPT-style model to predict audio tokens from text + acoustic prompt.

9. Common Challenges

1. Text Normalization:

“$19.99” -> “nineteen dollars and ninety-nine cents”.
Context dependency: “I read the book” (past) vs “I will read” (future).

2. Prosody and Emotion:

Default TTS is “neutral/newsreader” style.
Generating “angry” or “whispering” speech requires labeled data or style transfer.

3. Long-Form Synthesis:

Attention mechanisms can drift over long paragraphs.
Fix: Windowed attention or sentence-level splitting.

10. Ethical Considerations

1. Deepfakes:

Voice cloning can break biometric auth (banks).
Used for scams (“Grandma, I’m in jail, send money”).
Mitigation: Watermarking audio (inaudible noise).

2. Copyright:

Training on audiobooks without consent.
Impact: Voice actors losing jobs.

11. Deep Dive: Tacotron 2 Attention Mechanism

Why does standard Attention fail for TTS?

In Machine Translation, alignment is soft and can jump (e.g., “red house” -> “maison rouge”).
In TTS, alignment is monotonic and continuous. You never read the end of the sentence before the beginning.

Location-Sensitive Attention: Standard Bahdanau Attention uses query $s_{i-1}$ and values $h_j$. $e_{i,j} = v^T \tanh(W s_{i-1} + V h_j + b)$

Location-Sensitive Attention adds the previous alignment $\alpha_{i-1}$ as a feature. $f_i = F * \alpha_{i-1}$ $e_{i,j} = v^T \tanh(W s_{i-1} + V h_j + U f_{i,j} + b)$

Effect:

The model “knows” where it attended last time.
It learns to simply shift the attention window forward.
Prevents “babbling” (repeating the same word forever) or skipping words.

12. Deep Dive: WaveNet Dilated Convolutions

WaveNet generates raw audio sample-by-sample (16,000 samples/sec). To generate sample $x_t$, it needs context from a long history (e.g., 1 second).

Problem: Standard convolution with size 3 needs thousands of layers to reach a receptive field of 16,000.

Solution: Dilated Convolutions:

Skip input values with a step size (dilation).
Layer 1: Dilation 1 (Look at $t, t-1$)
Layer 2: Dilation 2 (Look at outputs of Layer 1 at $t, t-2$)
Layer 3: Dilation 4 (Look at outputs of Layer 2 at $t, t-4$)
…
Layer 10: Dilation 512.

Receptive Field: Exponential growth: $2^L$. With 10 layers, we cover 1024 samples. Stack multiple blocks to reach 16,000.

Conditioning: WaveNet is conditioned on the Mel-spectrogram $c$. $P(x_t | x_{<t}, c) = \text{softmax}(W \cdot \tanh(W_f x + V_f c) \cdot \sigma(W_g x + V_g c))$ (Gated Activation Unit).

13. Deep Dive: HiFi-GAN Architecture

HiFi-GAN (High Fidelity GAN) is the current state-of-the-art vocoder because it’s fast and high quality.

Generator:

Input: Mel-spectrogram.
Multi-Receptive Field Fusion (MRF):
- Instead of one ResNet block, it runs multiple ResNet blocks with different kernel sizes and dilation rates in parallel.
- Sums their outputs.
- Allows capturing both fine-grained details (high frequency) and long-term dependencies (low frequency).

Discriminators:

Multi-Period Discriminator (MPD):
- Reshapes 1D audio into 2D matrices of height $p$ (periods 2, 3, 5, 7, 11).
- Applies 2D convolution.
- Detects periodic artifacts (metallic sounds).
Multi-Scale Discriminator (MSD):
- Operates on raw audio, 2x downsampled, 4x downsampled audio.
- Ensures structure is correct at different time scales.

Loss:

GAN Loss (Adversarial).
Feature Matching Loss (Match intermediate layers of discriminator).
Mel-Spectrogram Loss (L1 distance).

14. System Design: Streaming TTS Architecture

Challenge: User shouldn’t wait 5 seconds for a long paragraph to be synthesized.

Architecture:

Text Chunking:
- Split text by punctuation (., !, ?).
- “Hello world! How are you?” -> [“Hello world!”, “How are you?”].
Incremental Synthesis:
- Send Chunk 1 to TTS Engine.
- While Chunk 1 is playing, synthesize Chunk 2.
Buffer Management:
- Client maintains a jitter buffer (e.g., 200ms).
- If synthesis is faster than playback (RTF < 1.0), buffer fills up.
- If synthesis is slower, buffer underruns (stuttering).
Protocol:
- WebSocket / gRPC: Bi-directional streaming.
- Server sends binary audio chunks (PCM or Opus encoded).

Stateful Context:

Simply splitting by sentence breaks prosody (pitch resets at start of sentence).
Contextual TTS: Pass the embedding of the previous sentence’s end state as the initial state for the next sentence.

15. Advanced: Style Transfer and Emotion Control

Global Style Tokens (GST):

Learn a bank of “style embeddings” (tokens) during training in an unsupervised way.
At inference, we can choose a token (e.g., Token 3 might capture “fast/angry”, Token 5 “slow/sad”).
We can mix styles: $0.5 \times \text{Happy} + 0.5 \times \text{Whisper}$.

Reference Audio:

Feed a 3-second clip of expressive speech.
Reference Encoder extracts style vector.
TTS synthesizes new text with that style.

16. Case Study: Voice Cloning for Accessibility

Scenario: A patient with ALS (Lou Gehrig’s disease) is losing their voice. They want to “bank” their voice to use with a TTS system later.

Process:

Recording: Patient records 30-60 minutes of reading scripts while they can still speak.
Fine-Tuning:
- Take a pre-trained multi-speaker model (e.g., trained on LibriTTS).
- Freeze the encoder/decoder layers.
- Fine-tune the Speaker Embedding and last few decoder layers on the patient’s data.
Deployment: Run the model on an iPad (using CoreML/TensorFlow Lite).

Challenges:

Fatigue: Patient cannot record for hours. Need data-efficient adaptation (Few-Shot Learning).
Dysarthria: If speech is already slurred, the model will learn the slur. Need “Voice Repair” (mapping slurred speech to healthy speech space).
Dysarthria: If speech is already slurred, the model will learn the slur. Need “Voice Repair” (mapping slurred speech to healthy speech space).

17. Deep Dive: VITS (Conditional Variational Autoencoder with Adversarial Learning)

VITS (2021) is the current state-of-the-art “all-in-one” model. It combines Acoustic Model and Vocoder into a single end-to-end network.

Key Idea:

Training: It’s a VAE.
- Encoder: Takes Audio $\to$ Latent $z$.
- Decoder: Takes Latent $z$ $\to$ Audio (HiFi-GAN generator).
- Prior: The latent $z$ is forced to follow a distribution predicted from Text.
Inference:
- Text Encoder predicts the distribution of $z$.
- Sample $z$.
- Decoder generates audio.

Flow-based Prior:

To make the text-to-latent prediction expressive, it uses Normalizing Flows.

Monotonic Alignment Search (MAS):

VITS learns the alignment between text and audio unsupervised during training using Dynamic Programming (MAS). No external aligner needed.

Pros:

Higher quality than Tacotron+WaveGlow.
Faster than autoregressive models.
No mismatch between acoustic model and vocoder.

18. Deep Dive: Prosody Modeling (Pitch, Energy, Duration)

To make speech sound human, we need to control how it’s said.

1. Duration:

How long is each phoneme?
Model: Predict log-duration for each phoneme.
Control: Multiply predicted durations by 1.2x to speak slower.

2. Pitch (F0):

Fundamental frequency contour.
Model: Predict continuous F0 curve.
Control: Shift F0 mean to make voice higher/lower. Scale variance to make it more expressive/monotone.

3. Energy:

Loudness (L2 norm of frame).
Model: Predict energy per frame.

Architecture:

Add these predictors after the Text Encoder.
Add the predicted embeddings to the content embedding before the Decoder.

19. Deep Dive: Multi-Speaker and Multi-Lingual TTS

1. Speaker Embeddings (d-vectors):

Train a speaker verification model (e.g., GE2E loss).
Extract the embedding from the last layer.
Condition the TTS model on this vector (Concatenate or AdaIN).

2. Code-Switching:

“I want to eat Sushi today.” (English sentence, Japanese word).
Challenge: English TTS doesn’t know Japanese phonemes.
Solution: Shared Phoneme Set (IPA).
Model: Train on mixed data. Use a Language ID embedding.

20. Deep Dive: Audio Codecs for Generative Audio

With models like Vall-E and AudioLM, we treat audio generation as language modeling. But audio is continuous.

Neural Audio Codecs (EnCodec / DAC):

Encoder: Compresses audio to low-framerate latent.
Quantizer (RVQ - Residual Vector Quantization):
- Discretizes latent into “codebook indices” (tokens).
- Hierarchical: Codebook 1 captures coarse structure, Codebook 2 captures residual error, etc.
Decoder: Reconstructs audio from tokens.

Result:

1 second of audio $\to$ 75 tokens.
Now we can use GPT-4 on these tokens!

21. System Design: On-Device TTS Optimization

Scenario: Siri/Google Assistant running on a phone without internet.

Constraints:

Size: Model < 50MB.
Compute: < 10% CPU usage.

Techniques:

Quantization: Float32 $\to$ Int8. (4x smaller).
Pruning: Remove 50% of weights that are near zero.
Knowledge Distillation: Train a tiny student model to mimic the large teacher.
Streaming Vocoder: Use LPCNet (combines DSP with small RNN) or Multi-Band MelGAN (generates 4 frequency bands in parallel).

22. Evaluation: MUSHRA Tests

MOS is simple but subjective. MUSHRA (Multiple Stimuli with Hidden Reference and Anchor) is more rigorous.

Setup:

Listener hears:
- Reference: Original recording (Ground Truth).
- Anchor: Low-pass filtered version (Bad quality baseline).
- Samples: Model A, Model B, Model C (blinded).
Task: Rate all of them from 0-100 relative to Reference.

Why Anchor?

Calibrates the scale. If someone rates the Anchor as 80, their data is discarded.

23. Interview Questions

Q1: Why use Mel-spectrograms instead of linear spectrograms? Answer: Mel-scale matches human hearing (logarithmic perception of pitch). It compresses the data dimension (e.g., 1024 linear $\to$ 80 Mel), making the model easier to train.

Q2: Autoregressive vs Non-Autoregressive TTS? Answer:

AR (Tacotron): Higher quality, better prosody, slow, robustness issues.
Non-AR (FastSpeech): Fast, robust, controllable, slightly lower prosody quality (averaged).

Q3: How to handle OOV words? Answer: Use a G2P (Grapheme-to-Phoneme) model that predicts pronunciation from spelling, rather than a dictionary lookup.

24. Further Reading

“Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions” (Shen et al., 2017): Tacotron 2 paper.
“FastSpeech 2: Fast and High-Quality End-to-End TTS” (Ren et al., 2020).
“HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis” (Kong et al., 2020).

25. Conclusion

End-to-End TTS has crossed the “Uncanny Valley”. With models like Tacotron 2 and HiFi-GAN, synthesized speech is often indistinguishable from human speech. The focus has now shifted from “quality” to “control” (emotion, style), “efficiency” (on-device TTS), and “adaptation” (zero-shot cloning). As generative audio models (like Vall-E) merge with LLMs, we are entering an era of conversational AI that sounds as human as it thinks.

26. Summary

Component	Role	Examples
Frontend	Text $\to$ Phonemes	G2P, Normalization
Acoustic Model	Phonemes $\to$ Mel-Spec	Tacotron 2, FastSpeech 2
Vocoder	Mel-Spec $\to$ Audio	WaveNet, HiFi-GAN
Speaker Encoder	Voice Cloning	d-vector, x-vector

Originally published at: arunbaby.com/speech-tech/0039-end-to-end-tts

Share on

Twitter Facebook LinkedIn