Text-to-Speech (TTS) System Fundamentals
From text to natural speech: understanding modern neural TTS architectures that power Alexa, Google Assistant, and Siri.
Introduction
Text-to-Speech (TTS) converts written text into spoken audio. Modern TTS systems produce human-like speech quality using deep learning.
Why TTS matters:
- Virtual assistants: Alexa, Google Assistant, Siri
- Accessibility: Screen readers for visually impaired
- Content creation: Audiobooks, podcasts, voiceovers
- Conversational AI: Voice bots, IVR systems
- Education: Language learning apps
Evolution:
- Concatenative synthesis (1990s-2000s): Stitch pre-recorded audio units
- Parametric synthesis (2000s-2010s): Statistical models (HMM)
- Neural TTS (2016+): End-to-end deep learning (Tacotron, WaveNet)
- Modern TTS (2020+): Fast, controllable, expressive (FastSpeech, VITS)
TTS Pipeline Architecture
Traditional Two-Stage Pipeline
Most TTS systems use a two-stage approach:
Text → [Acoustic Model] → Mel Spectrogram → [Vocoder] → Audio Waveform
Stage 1: Acoustic Model (Text → Mel Spectrogram)
- Input: Text (characters/phonemes)
- Output: Mel spectrogram (acoustic features)
- Examples: Tacotron 2, FastSpeech 2
Stage 2: Vocoder (Mel Spectrogram → Waveform)
- Input: Mel spectrogram
- Output: Audio waveform
- Examples: WaveNet, WaveGlow, HiFi-GAN
Why Two Stages?
Advantages:
- Modularity: Train acoustic model and vocoder separately
- Efficiency: Mel spectrogram is compressed representation
- Controllability: Can modify prosody at mel spectrogram level
Alternative: End-to-End Models
- VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)
- Directly generates waveform from text
- Faster inference, fewer components
Key Components
1. Text Processing (Frontend)
Transform raw text into model-ready input.
class TextProcessor:
"""
Text normalization and phoneme conversion
"""
def __init__(self):
self.normalizer = TextNormalizer()
self.g2p = Grapheme2Phoneme() # Grapheme-to-Phoneme
def process(self, text: str) -> list[str]:
"""
Convert text to phoneme sequence
Args:
text: Raw input text
Returns:
List of phonemes
"""
# 1. Normalize text
normalized = self.normalizer.normalize(text)
# "Dr. Smith has $100" → "Doctor Smith has one hundred dollars"
# 2. Convert to phonemes
phonemes = self.g2p.convert(normalized)
# "hello" → ['HH', 'AH', 'L', 'OW']
return phonemes
class TextNormalizer:
"""
Normalize text (expand abbreviations, numbers, etc.)
"""
def normalize(self, text: str) -> str:
text = self._expand_abbreviations(text)
text = self._expand_numbers(text)
text = self._expand_symbols(text)
return text
def _expand_abbreviations(self, text: str) -> str:
"""Dr. → Doctor, Mr. → Mister, etc."""
expansions = {
'Dr.': 'Doctor',
'Mr.': 'Mister',
'Mrs.': 'Misses',
'Ms.': 'Miss',
'St.': 'Street',
}
for abbr, expansion in expansions.items():
text = text.replace(abbr, expansion)
return text
def _expand_numbers(self, text: str) -> str:
"""$100 → one hundred dollars"""
import re
# Currency
text = re.sub(r'\$(\d+)', r'\1 dollars', text)
# Years
text = re.sub(r'(\d{4})', self._year_to_words, text)
return text
def _year_to_words(self, match) -> str:
"""Convert year to words: 2024 → twenty twenty-four"""
# Simplified implementation
return match.group(0) # Placeholder
def _expand_symbols(self, text: str) -> str:
"""@ → at, % → percent, etc."""
symbols = {
'@': 'at',
'%': 'percent',
'#': 'number',
'&': 'and',
}
for symbol, expansion in symbols.items():
text = text.replace(symbol, expansion)
return text
2. Acoustic Model
Generates mel spectrogram from text/phonemes.
Tacotron 2 Architecture (simplified):
Input: Phoneme sequence
↓
[Character Embeddings]
↓
[Encoder] (Bidirectional LSTM)
↓
[Attention] (Location-sensitive)
↓
[Decoder] (Autoregressive LSTM)
↓
[Mel Predictor]
↓
Output: Mel Spectrogram
import torch
import torch.nn as nn
class SimplifiedTacotron(nn.Module):
"""
Simplified Tacotron-style acoustic model
Real Tacotron 2 is much more complex!
"""
def __init__(
self,
vocab_size: int,
embedding_dim: int = 512,
encoder_hidden: int = 256,
decoder_hidden: int = 1024,
n_mels: int = 80
):
super().__init__()
# Character/phoneme embeddings
self.embedding = nn.Embedding(vocab_size, embedding_dim)
# Encoder
self.encoder = nn.LSTM(
embedding_dim,
encoder_hidden,
num_layers=3,
batch_first=True,
bidirectional=True
)
# Attention
self.attention = LocationSensitiveAttention(
encoder_hidden * 2, # Bidirectional
decoder_hidden
)
# Decoder
self.decoder = nn.LSTMCell(
encoder_hidden * 2 + n_mels, # Context + previous mel frame
decoder_hidden
)
# Mel predictor
self.mel_predictor = nn.Linear(decoder_hidden, n_mels)
# Stop token predictor
self.stop_predictor = nn.Linear(decoder_hidden, 1)
def forward(self, text, mel_targets=None):
"""
Forward pass
Args:
text: [batch, seq_len] phoneme indices
mel_targets: [batch, mel_len, n_mels] target mels (training only)
Returns:
mels: [batch, mel_len, n_mels]
stop_tokens: [batch, mel_len]
"""
# Encode text
embedded = self.embedding(text) # [batch, seq_len, embed_dim]
encoder_outputs, _ = self.encoder(embedded) # [batch, seq_len, hidden*2]
# Decode (autoregressive)
if mel_targets is not None:
# Teacher forcing during training
return self._decode_teacher_forcing(encoder_outputs, mel_targets)
else:
# Autoregressive during inference
return self._decode_autoregressive(encoder_outputs, max_len=1000)
def _decode_autoregressive(self, encoder_outputs, max_len):
"""Autoregressive decoding (inference)"""
batch_size = encoder_outputs.size(0)
n_mels = self.mel_predictor.out_features
# Initialize
decoder_hidden = torch.zeros(batch_size, self.decoder.hidden_size)
decoder_cell = torch.zeros(batch_size, self.decoder.hidden_size)
attention_context = torch.zeros(batch_size, encoder_outputs.size(2))
mel_frame = torch.zeros(batch_size, n_mels)
mels = []
stop_tokens = []
for _ in range(max_len):
# Compute attention
attention_context, _ = self.attention(
decoder_hidden,
encoder_outputs
)
# Decoder step
decoder_input = torch.cat([attention_context, mel_frame], dim=1)
decoder_hidden, decoder_cell = self.decoder(
decoder_input,
(decoder_hidden, decoder_cell)
)
# Predict mel frame and stop token
mel_frame = self.mel_predictor(decoder_hidden)
stop_token = torch.sigmoid(self.stop_predictor(decoder_hidden))
mels.append(mel_frame)
stop_tokens.append(stop_token)
# Check if should stop
if (stop_token > 0.5).all():
break
mels = torch.stack(mels, dim=1) # [batch, mel_len, n_mels]
stop_tokens = torch.stack(stop_tokens, dim=1) # [batch, mel_len, 1]
return mels, stop_tokens
class LocationSensitiveAttention(nn.Module):
"""
Location-sensitive attention (simplified)
Note: Real Tacotron uses cumulative attention features; this
minimal version omits location convolution for brevity.
"""
def __init__(self, encoder_dim, decoder_dim, attention_dim=128):
super().__init__()
self.query_layer = nn.Linear(decoder_dim, attention_dim)
self.key_layer = nn.Linear(encoder_dim, attention_dim)
self.value_layer = nn.Linear(attention_dim, 1)
# For brevity, location features are omitted in this simplified demo
def forward(self, query, keys):
"""
Compute attention context
Args:
query: [batch, decoder_dim] - current decoder state
keys: [batch, seq_len, encoder_dim] - encoder outputs
Returns:
context: [batch, encoder_dim]
attention_weights: [batch, seq_len]
"""
# Compute attention scores
query_proj = self.query_layer(query).unsqueeze(1) # [batch, 1, attn_dim]
keys_proj = self.key_layer(keys) # [batch, seq_len, attn_dim]
scores = self.value_layer(torch.tanh(query_proj + keys_proj)).squeeze(2)
attention_weights = torch.softmax(scores, dim=1) # [batch, seq_len]
# Compute context
context = torch.bmm(
attention_weights.unsqueeze(1),
keys
).squeeze(1) # [batch, encoder_dim]
return context, attention_weights
3. Vocoder
Converts mel spectrogram to waveform.
Popular Vocoders:
Model | Type | Quality | Speed | Notes |
---|---|---|---|---|
WaveNet | Autoregressive | Excellent | Slow | Original neural vocoder |
WaveGlow | Flow-based | Excellent | Fast | Parallel generation |
HiFi-GAN | GAN-based | Excellent | Very Fast | Current SOTA |
MelGAN | GAN-based | Good | Very Fast | Lightweight |
HiFi-GAN Architecture:
import torch
import torch.nn as nn
class HiFiGANGenerator(nn.Module):
"""
Simplified HiFi-GAN generator
Upsamples mel spectrogram to waveform
"""
def __init__(
self,
n_mels: int = 80,
upsample_rates: list[int] = [8, 8, 2, 2],
upsample_kernel_sizes: list[int] = [16, 16, 4, 4],
resblock_kernel_sizes: list[int] = [3, 7, 11],
resblock_dilation_sizes: list[list[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
):
super().__init__()
self.num_kernels = len(resblock_kernel_sizes)
self.num_upsamples = len(upsample_rates)
# Initial conv
self.conv_pre = nn.Conv1d(n_mels, 512, kernel_size=7, padding=3)
# Upsampling layers
self.ups = nn.ModuleList()
for i, (u, k) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
self.ups.append(
nn.ConvTranspose1d(
512 // (2 ** i),
512 // (2 ** (i + 1)),
kernel_size=k,
stride=u,
padding=(k - u) // 2
)
)
# Residual blocks
self.resblocks = nn.ModuleList()
for i in range(len(self.ups)):
ch = 512 // (2 ** (i + 1))
for k, d in zip(resblock_kernel_sizes, resblock_dilation_sizes):
self.resblocks.append(ResBlock(ch, k, d))
# Final conv
self.conv_post = nn.Conv1d(ch, 1, kernel_size=7, padding=3)
def forward(self, mel):
"""
Generate waveform from mel spectrogram
Args:
mel: [batch, n_mels, mel_len]
Returns:
waveform: [batch, 1, audio_len]
"""
x = self.conv_pre(mel)
for i, up in enumerate(self.ups):
x = torch.nn.functional.leaky_relu(x, 0.1)
x = up(x)
# Apply residual blocks
xs = None
for j in range(self.num_kernels):
if xs is None:
xs = self.resblocks[i * self.num_kernels + j](x)
else:
xs += self.resblocks[i * self.num_kernels + j](x)
x = xs / self.num_kernels
x = torch.nn.functional.leaky_relu(x)
x = self.conv_post(x)
x = torch.tanh(x)
return x
class ResBlock(nn.Module):
"""Residual block with dilated convolutions"""
def __init__(self, channels, kernel_size, dilations):
super().__init__()
self.convs = nn.ModuleList()
for d in dilations:
self.convs.append(
nn.Conv1d(
channels,
channels,
kernel_size=kernel_size,
dilation=d,
padding=(kernel_size * d - d) // 2
)
)
def forward(self, x):
for conv in self.convs:
xt = torch.nn.functional.leaky_relu(x, 0.1)
xt = conv(xt)
x = x + xt
return x
Modern TTS: FastSpeech 2
Problem with Tacotron: Autoregressive decoding is slow and can have attention errors.
FastSpeech 2 advantages:
- Non-autoregressive: Generates all mel frames in parallel (much faster)
- Duration prediction: Predicts phoneme durations explicitly
- Controllability: Can control pitch, energy, duration
Architecture:
Input: Phoneme sequence
↓
[Phoneme Embeddings]
↓
[Feed-Forward Transformer]
↓
[Duration Predictor] → phoneme durations
[Pitch Predictor] → pitch contour
[Energy Predictor] → energy contour
↓
[Length Regulator] (expand phonemes by duration)
↓
[Feed-Forward Transformer]
↓
Output: Mel Spectrogram
Key innovation: Length Regulator
def length_regulator(phoneme_features, durations):
"""
Expand phoneme features based on predicted durations
Args:
phoneme_features: [batch, phoneme_len, hidden]
durations: [batch, phoneme_len] - frames per phoneme
Returns:
expanded: [batch, mel_len, hidden]
"""
expanded = []
for batch_idx in range(phoneme_features.size(0)):
batch_expanded = []
for phoneme_idx in range(phoneme_features.size(1)):
feature = phoneme_features[batch_idx, phoneme_idx]
duration = durations[batch_idx, phoneme_idx].int()
# Repeat feature 'duration' times
batch_expanded.append(feature.repeat(duration, 1))
batch_expanded = torch.cat(batch_expanded, dim=0)
expanded.append(batch_expanded)
# Pad to max length
max_len = max(e.size(0) for e in expanded)
expanded = [torch.nn.functional.pad(e, (0, 0, 0, max_len - e.size(0))) for e in expanded]
return torch.stack(expanded)
# Example
phoneme_features = torch.randn(1, 10, 256) # 10 phonemes
durations = torch.tensor([[5, 3, 4, 6, 2, 5, 7, 3, 4, 5]]) # Frames per phoneme
expanded = length_regulator(phoneme_features, durations)
print(f"Input shape: {phoneme_features.shape}")
print(f"Output shape: {expanded.shape}") # [1, 44, 256] (sum of durations)
Prosody Control
Prosody: Rhythm, stress, and intonation of speech
Control dimensions:
- Pitch: Fundamental frequency (F0)
- Duration: Phoneme/word length
- Energy: Loudness
class ProsodyController:
"""
Control prosody in TTS generation
"""
def __init__(self, model):
self.model = model
def synthesize_with_prosody(
self,
text: str,
pitch_scale: float = 1.0,
duration_scale: float = 1.0,
energy_scale: float = 1.0
):
"""
Generate speech with prosody control
Args:
text: Input text
pitch_scale: Multiply pitch by this factor (>1 = higher, <1 = lower)
duration_scale: Multiply duration by this factor (>1 = slower, <1 = faster)
energy_scale: Multiply energy by this factor (>1 = louder, <1 = softer)
Returns:
audio: Generated waveform
"""
# Get model predictions
phonemes = self.model.text_to_phonemes(text)
mel_spec, pitch, duration, energy = self.model.predict(phonemes)
# Apply prosody modifications
pitch_modified = pitch * pitch_scale
duration_modified = duration * duration_scale
energy_modified = energy * energy_scale
# Regenerate mel spectrogram with modified prosody
mel_spec_modified = self.model.synthesize_mel(
phonemes,
pitch=pitch_modified,
duration=duration_modified,
energy=energy_modified
)
# Vocoder: mel → waveform
audio = self.model.vocoder(mel_spec_modified)
return audio
# Usage example
controller = ProsodyController(tts_model)
# Happy speech: higher pitch, faster
happy_audio = controller.synthesize_with_prosody(
"Hello, how are you?",
pitch_scale=1.2,
duration_scale=0.9,
energy_scale=1.1
)
# Sad speech: lower pitch, slower
sad_audio = controller.synthesize_with_prosody(
"Hello, how are you?",
pitch_scale=0.8,
duration_scale=1.2,
energy_scale=0.9
)
Connection to Evaluation Metrics
Like ML model evaluation, TTS systems need multiple metrics:
Objective Metrics
Mel Cepstral Distortion (MCD):
- Measures distance between generated and ground truth mels
- Lower is better
- Correlates with quality but not perfectly
import librosa
import numpy as np
def mel_cepstral_distortion(generated_mel, target_mel):
"""
Compute MCD between generated and target mel spectrograms
Args:
generated_mel: [n_mels, time]
target_mel: [n_mels, time]
Returns:
MCD score (lower is better)
"""
# Align lengths
min_len = min(generated_mel.shape[1], target_mel.shape[1])
generated_mel = generated_mel[:, :min_len]
target_mel = target_mel[:, :min_len]
# Compute simple L2 distance over mel frames (proxy for MCD)
diff = generated_mel - target_mel
mcd = float(np.linalg.norm(diff, axis=0).mean())
return mcd
F0 RMSE: Root mean squared error of pitch
Duration Accuracy: How well predicted durations match ground truth
Subjective Metrics
Mean Opinion Score (MOS):
- Human raters score quality 1-5
- Gold standard for TTS evaluation
- Expensive and time-consuming
MUSHRA Test: Compare multiple systems side-by-side
Production Considerations
Latency
Components of TTS latency:
- Text processing: 5-10ms
- Acoustic model: 50-200ms (depends on text length)
- Vocoder: 20-100ms
- Total: 75-310ms
Optimization strategies:
- Streaming TTS: Generate audio incrementally
- Model distillation: Smaller, faster models
- Quantization: INT8 inference
- Caching: Pre-generate common phrases
Multi-Speaker TTS
class MultiSpeakerTTS:
"""
TTS supporting multiple voices
Approach 1: Speaker embedding
Approach 2: Separate models per speaker
"""
def __init__(self, model, speaker_embeddings):
self.model = model
self.speaker_embeddings = speaker_embeddings
def synthesize(self, text: str, speaker_id: int):
"""
Generate speech in specific speaker's voice
Args:
text: Input text
speaker_id: Speaker identifier
Returns:
audio: Waveform in target speaker's voice
"""
# Get speaker embedding
speaker_emb = self.speaker_embeddings[speaker_id]
# Generate with speaker conditioning
mel = self.model.generate_mel(text, speaker_embedding=speaker_emb)
audio = self.model.vocoder(mel)
return audio
Training Data Requirements
Dataset Characteristics
Typical single-speaker TTS training:
- Audio hours: 10-24 hours of clean speech
- Utterances: 5,000-15,000 sentences
- Recording quality: Studio quality, 22 kHz+ sample rate
- Text diversity: Cover phonetic diversity of language
Multi-speaker TTS:
- Speakers: 100-10,000 speakers
- Hours per speaker: 1-5 hours
- Total hours: 100-50,000 hours (e.g., LibriTTS: 585 hours, 2,456 speakers)
Data Preparation Pipeline
class TTSDataPreparation:
"""
Prepare data for TTS training
"""
def __init__(self, sample_rate=22050):
self.sample_rate = sample_rate
def prepare_dataset(
self,
audio_files: list[str],
text_files: list[str]
) -> list[dict]:
"""
Prepare audio-text pairs
Steps:
1. Text normalization
2. Audio preprocessing
3. Alignment (forced alignment)
4. Feature extraction
"""
dataset = []
for audio_file, text_file in zip(audio_files, text_files):
# Load audio
audio, sr = librosa.load(audio_file, sr=self.sample_rate)
# Load text
with open(text_file) as f:
text = f.read().strip()
# Normalize text
normalized_text = self.normalize_text(text)
# Convert to phonemes
phonemes = self.text_to_phonemes(normalized_text)
# Extract mel spectrogram
mel = self.extract_mel(audio)
# Extract prosody features
pitch = self.extract_pitch(audio)
energy = self.extract_energy(audio)
# Compute duration (requires forced alignment)
duration = self.compute_durations(audio, phonemes)
dataset.append({
'audio_path': audio_file,
'text': text,
'normalized_text': normalized_text,
'phonemes': phonemes,
'mel': mel,
'pitch': pitch,
'energy': energy,
'duration': duration
})
return dataset
def extract_mel(self, audio):
"""Extract mel spectrogram"""
mel = librosa.feature.melspectrogram(
y=audio,
sr=self.sample_rate,
n_fft=1024,
hop_length=256,
n_mels=80
)
mel_db = librosa.power_to_db(mel, ref=np.max)
return mel_db
def extract_pitch(self, audio):
"""Extract pitch (F0) contour"""
f0, voiced_flag, voiced_probs = librosa.pyin(
audio,
fmin=librosa.note_to_hz('C2'),
fmax=librosa.note_to_hz('C7'),
sr=self.sample_rate
)
return f0
def extract_energy(self, audio):
"""Extract energy (RMS)"""
energy = librosa.feature.rms(
y=audio,
frame_length=1024,
hop_length=256
)[0]
return energy
Data Quality Challenges
1. Noisy Audio:
- Background noise degrades quality
- Solution: Use noise reduction, or data augmentation
2. Alignment Errors:
- Text-audio misalignment breaks training
- Solution: Forced alignment with Montreal Forced Aligner (MFA)
3. Prosody Variation:
- Inconsistent prosody confuses models
- Solution: Filter outliers, normalize prosody
4. Out-of-Domain Text:
- Model struggles with unseen words/names
- Solution: Diverse training text, robust G2P
Voice Cloning & Few-Shot Learning
Voice cloning: Generate speech in a target voice with minimal data.
Approaches
1. Speaker Embedding (Zero-Shot/Few-Shot)
class SpeakerEncoder(nn.Module):
"""
Encode speaker characteristics from reference audio
Architecture: Similar to speaker recognition
"""
def __init__(self, mel_dim=80, embedding_dim=256):
super().__init__()
# LSTM encoder
self.encoder = nn.LSTM(
mel_dim,
embedding_dim,
num_layers=3,
batch_first=True
)
# Projection to speaker embedding
self.projection = nn.Linear(embedding_dim, embedding_dim)
def forward(self, mel):
"""
Extract speaker embedding from mel spectrogram
Args:
mel: [batch, mel_len, mel_dim]
Returns:
speaker_embedding: [batch, embedding_dim]
"""
_, (hidden, _) = self.encoder(mel)
# Use last hidden state
speaker_emb = self.projection(hidden[-1])
# L2 normalize
speaker_emb = speaker_emb / (speaker_emb.norm(dim=1, keepdim=True) + 1e-8)
return speaker_emb
class VoiceCloningTTS:
"""
TTS with voice cloning capability
"""
def __init__(self, acoustic_model, vocoder, speaker_encoder):
self.acoustic_model = acoustic_model
self.vocoder = vocoder
self.speaker_encoder = speaker_encoder
def clone_voice(
self,
text: str,
reference_audio: torch.Tensor
):
"""
Generate speech in reference voice
Args:
text: Text to synthesize
reference_audio: Audio sample of target voice (3-10 seconds)
Returns:
synthesized_audio: Speech in target voice
"""
# Extract speaker embedding from reference
reference_mel = self.extract_mel(reference_audio)
speaker_embedding = self.speaker_encoder(reference_mel)
# Generate mel spectrogram conditioned on speaker
mel = self.acoustic_model.generate(
text,
speaker_embedding=speaker_embedding
)
# Vocoder: mel → waveform
audio = self.vocoder(mel)
return audio
# Usage
tts = VoiceCloningTTS(acoustic_model, vocoder, speaker_encoder)
# Clone voice from 5-second reference
reference_audio = load_audio("reference_voice.wav")
cloned_speech = tts.clone_voice(
"Hello, this is a cloned voice!",
reference_audio
)
2. Fine-Tuning (10-60 minutes of data)
class VoiceCloner:
"""
Fine-tune pre-trained TTS on target voice
"""
def __init__(self, pretrained_model):
self.model = pretrained_model
def fine_tune(
self,
target_voice_data: list[tuple], # [(audio, text), ...]
num_steps: int = 1000,
learning_rate: float = 1e-4
):
"""
Fine-tune model on target voice
Args:
target_voice_data: Audio-text pairs for target speaker
num_steps: Training steps
learning_rate: Learning rate
"""
optimizer = torch.optim.Adam(self.model.parameters(), lr=learning_rate)
for step in range(num_steps):
# Sample batch
batch = random.sample(target_voice_data, min(8, len(target_voice_data)))
# Forward pass
loss = self.model.compute_loss(batch)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
if step % 100 == 0:
print(f"Step {step}, Loss: {loss.item():.4f}")
return self.model
# Usage
cloner = VoiceCloner(pretrained_tts_model)
# Collect 30 minutes of target voice
target_data = [...] # List of (audio, text) pairs
# Fine-tune
cloned_model = cloner.fine_tune(target_data, num_steps=1000)
Production Deployment Patterns
Pattern 1: Caching + Dynamic Generation
class HybridTTSSystem:
"""
Hybrid system: Cache common phrases, generate on-the-fly for novel text
"""
def __init__(self, tts_model, cache_backend):
self.tts = tts_model
self.cache = cache_backend # e.g., Redis
self.common_phrases = [
"Welcome", "Thank you", "Goodbye",
"Please hold", "One moment please"
]
self._warm_cache()
def _warm_cache(self):
"""Pre-generate and cache common phrases"""
for phrase in self.common_phrases:
if not self.cache.exists(phrase):
audio = self.tts.synthesize(phrase)
self.cache.set(phrase, audio, ttl=86400) # 24-hour TTL
def synthesize(self, text: str):
"""
Synthesize with caching
Cache hit: <5ms
Cache miss: 100-300ms (generate)
"""
# Check cache
cached = self.cache.get(text)
if cached is not None:
return cached
# Generate
audio = self.tts.synthesize(text)
# Cache if frequently requested
request_count = self.cache.increment(f"count:{text}")
if request_count > 10:
self.cache.set(text, audio, ttl=3600)
return audio
Pattern 2: Streaming TTS
class StreamingTTS:
"""
Stream audio as it's generated (reduce latency)
"""
def __init__(self, acoustic_model, vocoder):
self.acoustic_model = acoustic_model
self.vocoder = vocoder
def stream_synthesize(self, text: str):
"""
Generate audio in chunks
Yields audio chunks as they're ready
"""
# Generate mel spectrogram
mel_frames = self.acoustic_model.generate_streaming(text)
# Stream vocoder output
mel_buffer = []
chunk_size = 50 # mel frames per chunk
for mel_frame in mel_frames:
mel_buffer.append(mel_frame)
if len(mel_buffer) >= chunk_size:
# Vocoder: mel chunk → audio chunk
mel_chunk = torch.stack(mel_buffer)
audio_chunk = self.vocoder(mel_chunk)
yield audio_chunk
# Keep overlap for smoothness
mel_buffer = mel_buffer[-10:]
# Final chunk
if mel_buffer:
mel_chunk = torch.stack(mel_buffer)
audio_chunk = self.vocoder(mel_chunk)
yield audio_chunk
# Usage
streaming_tts = StreamingTTS(acoustic_model, vocoder)
# Stream audio
for audio_chunk in streaming_tts.stream_synthesize("Long text to synthesize..."):
# Play audio_chunk immediately
play_audio(audio_chunk)
# User starts hearing speech before full generation completes!
Pattern 3: Edge Deployment
class EdgeTTS:
"""
TTS optimized for edge devices (phones, IoT)
"""
def __init__(self):
self.model = self.load_optimized_model()
def load_optimized_model(self):
"""
Load quantized, pruned model
Techniques:
- INT8 quantization (4x smaller, 2-4x faster)
- Knowledge distillation (smaller student model)
- Pruning (remove 30-50% of weights)
"""
import torch.quantization
# Load full precision model
model = load_full_model()
# Quantize to INT8
# Use dynamic quantization for linear-heavy modules as a safe default
model_quantized = torch.quantization.quantize_dynamic(
model,
{nn.Linear},
dtype=torch.qint8
)
return model_quantized
def synthesize_on_device(self, text: str):
"""
Synthesize on edge device
Latency: 50-150ms
Memory: <100MB
"""
audio = self.model.generate(text)
return audio
Quality Assessment in Production
Automated Quality Monitoring
class TTSQualityMonitor:
"""
Monitor TTS quality in production
"""
def __init__(self):
self.baseline_mcd = 2.5 # Expected MCD
self.alert_threshold = 3.5 # Alert if MCD > this
def monitor_synthesis(self, text: str, generated_audio: np.ndarray):
"""
Check quality of generated audio
Red flags:
- Abnormal duration
- Clipping / distortion
- Silent segments
- MCD drift
"""
issues = []
# Check duration
expected_duration = len(text.split()) * 0.5 # ~0.5s per word
actual_duration = len(generated_audio) / 22050
if abs(actual_duration - expected_duration) / expected_duration > 0.5:
issues.append(f"Abnormal duration: {actual_duration:.2f}s vs {expected_duration:.2f}s expected")
# Check for clipping
if np.max(np.abs(generated_audio)) > 0.99:
issues.append("Clipping detected")
# Check for silent segments
rms = librosa.feature.rms(y=generated_audio)[0]
silent_ratio = (rms < 0.01).sum() / len(rms)
if silent_ratio > 0.3:
issues.append(f"Too much silence: {silent_ratio:.1%}")
# Log for drift detection
self.log_quality_metrics({
'text_length': len(text),
'audio_duration': actual_duration,
'max_amplitude': np.max(np.abs(generated_audio)),
'silent_ratio': silent_ratio
})
return issues
def log_quality_metrics(self, metrics: dict):
"""Log metrics for drift detection"""
# Send to monitoring system (Datadog, Prometheus, etc.)
pass
Comparative Analysis
Tacotron 2 vs FastSpeech 2
Aspect | Tacotron 2 | FastSpeech 2 |
---|---|---|
Speed | Slow (autoregressive) | Fast (parallel) |
Quality | Excellent | Excellent |
Robustness | Can skip/repeat words | More robust |
Controllability | Limited | Explicit control (pitch, duration) |
Training | Simpler (no duration model) | Needs duration labels |
Latency | 200-500ms | 50-150ms |
When to Use Each
Use Tacotron 2 when:
- Maximum naturalness is critical
- Training data is limited (easier to train)
- Latency is acceptable
Use FastSpeech 2 when:
- Low latency required
- Need prosody control
- Robustness is critical (production systems)
Recent Advances (2023-2024)
1. VALL-E (Zero-Shot Voice Cloning)
Microsoft’s VALL-E can clone a voice from a 3-second sample using language model approach.
Key idea: Treat TTS as conditional language modeling over discrete audio codes.
2. VITS (End-to-End TTS)
Combines acoustic model and vocoder into single model.
Advantages:
- Faster training and inference
- Better audio quality
- Simplified pipeline
3. YourTTS (Multi-lingual Voice Cloning)
Zero-shot multi-lingual TTS supporting 16+ languages.
4. Bark (Generative Audio Model)
Text-to-audio model that can generate music, sound effects, and speech with emotions.
Key Takeaways
✅ Two-stage pipeline - Acoustic model + vocoder is standard
✅ Text processing critical - Normalization and G2P affect quality
✅ Autoregressive vs non-autoregressive - Tacotron vs FastSpeech trade-offs
✅ Prosody control - Pitch, duration, energy for expressiveness
✅ Multiple metrics - Objective (MCD) and subjective (MOS) both needed
✅ Production optimization - Latency, caching, streaming for real-time use
✅ Like climbing stairs - Build incrementally (phoneme → mel → waveform)
Originally published at: arunbaby.com/speech-tech/0006-text-to-speech-basics
If you found this helpful, consider sharing it with others who might benefit.