22 minute read

From text to natural speech: understanding modern neural TTS architectures that power Alexa, Google Assistant, and Siri.

TL;DR

Modern TTS follows a two-stage pipeline: an acoustic model (Tacotron 2 or FastSpeech 2) converts text/phonemes into mel spectrograms, then a vocoder (HiFi-GAN) synthesizes the audio waveform. FastSpeech 2 is preferred for production due to parallel generation (50-150ms latency) and explicit prosody control over pitch, duration, and energy. Voice cloning enables speech synthesis in a target voice from as little as 3 seconds of audio. For related topics, see audio preprocessing for input signal handling and streaming speech pipelines for real-time delivery patterns.

A precision ribbon speaker driver with visible diaphragm vibrations captured in motion

Introduction

Text-to-Speech (TTS) converts written text into spoken audio. Modern TTS systems produce human-like speech quality using deep learning.

Why TTS matters:

  • Virtual assistants: Alexa, Google Assistant, Siri
  • Accessibility: Screen readers for visually impaired
  • Content creation: Audiobooks, podcasts, voiceovers
  • Conversational AI: Voice bots, IVR systems
  • Education: Language learning apps

Evolution:

  1. Concatenative synthesis (1990s-2000s): Stitch pre-recorded audio units
  2. Parametric synthesis (2000s-2010s): Statistical models (HMM)
  3. Neural TTS (2016+): End-to-end deep learning (Tacotron, WaveNet)
  4. Modern TTS (2020+): Fast, controllable, expressive (FastSpeech, VITS)

TTS Pipeline Architecture

Traditional Two-Stage Pipeline

Most TTS systems use a two-stage approach:

Text → [Acoustic Model] → Mel Spectrogram → [Vocoder] → Audio Waveform

Stage 1: Acoustic Model (Text → Mel Spectrogram)

  • Input: Text (characters/phonemes)
  • Output: Mel spectrogram (acoustic features)
  • Examples: Tacotron 2, FastSpeech 2

Stage 2: Vocoder (Mel Spectrogram → Waveform)

  • Input: Mel spectrogram
  • Output: Audio waveform
  • Examples: WaveNet, WaveGlow, HiFi-GAN

Why Two Stages?

Advantages:

  • Modularity: Train acoustic model and vocoder separately
  • Efficiency: Mel spectrogram is compressed representation
  • Controllability: Can modify prosody at mel spectrogram level

Alternative: End-to-End Models

  • VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)
  • Directly generates waveform from text
  • Faster inference, fewer components

Key Components

1. Text Processing (Frontend)

Transform raw text into model-ready input.

class TextProcessor:
    """
    Text normalization and phoneme conversion
    """

    def __init__(self):
        self.normalizer = TextNormalizer()
        self.g2p = Grapheme2Phoneme() # Grapheme-to-Phoneme

    def process(self, text: str) -> list[str]:
        """
        Convert text to phoneme sequence

        Args:
            text: Raw input text

            Returns:
                List of phonemes
                """
                # 1. Normalize text
                normalized = self.normalizer.normalize(text)
                # "Dr. Smith has $100" → "Doctor Smith has one hundred dollars"

                # 2. Convert to phonemes
                phonemes = self.g2p.convert(normalized)
                # "hello" → ['HH', 'AH', 'L', 'OW']

                return phonemes

    class TextNormalizer:
        """
        Normalize text (expand abbreviations, numbers, etc.)
        """

    def normalize(self, text: str) -> str:
        text = self._expand_abbreviations(text)
        text = self._expand_numbers(text)
        text = self._expand_symbols(text)
        return text

    def _expand_abbreviations(self, text: str) -> str:
        """Dr. → Doctor, Mr. → Mister, etc."""
        expansions = {
        'Dr.': 'Doctor',
        'Mr.': 'Mister',
        'Mrs.': 'Misses',
        'Ms.': 'Miss',
        'St.': 'Street',
        }
        for abbr, expansion in expansions.items():
            text = text.replace(abbr, expansion)
            return text

    def _expand_numbers(self, text: str) -> str:
        """$100 → one hundred dollars"""
        import re

        # Currency
        text = re.sub(r'\$(\d+)', r'\1 dollars', text)

        # Years
        text = re.sub(r'(\d{4})', self._year_to_words, text)

        return text

    def _year_to_words(self, match) -> str:
        """Convert year to words: 2024 → twenty twenty-four"""
        # Simplified implementation
        return match.group(0) # Placeholder

    def _expand_symbols(self, text: str) -> str:
        """@ → at, % → percent, etc."""
        symbols = {
        '@': 'at',
        '%': 'percent',
        '#': 'number',
        '&': 'and',
        }
        for symbol, expansion in symbols.items():
            text = text.replace(symbol, expansion)
            return text

2. Acoustic Model

Generates mel spectrogram from text/phonemes.

Tacotron 2 Architecture (simplified):

Input: Phoneme sequence
 ↓
[Character Embeddings]
 ↓
[Encoder] (Bidirectional LSTM)
 ↓
[Attention] (Location-sensitive)
 ↓
[Decoder] (Autoregressive LSTM)
 ↓
[Mel Predictor]
 ↓
Output: Mel Spectrogram
import torch
import torch.nn as nn

class SimplifiedTacotron(nn.Module):
    """
    Simplified Tacotron-style acoustic model

    Real Tacotron 2 is much more complex!
    """

    def __init__(
    self,
    vocab_size: int,
    embedding_dim: int = 512,
    encoder_hidden: int = 256,
    decoder_hidden: int = 1024,
    n_mels: int = 80
    ):
        super().__init__()

        # Character/phoneme embeddings
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # Encoder
        self.encoder = nn.LSTM(
        embedding_dim,
        encoder_hidden,
        num_layers=3,
        batch_first=True,
        bidirectional=True
        )

        # Attention
        self.attention = LocationSensitiveAttention(
        encoder_hidden * 2, # Bidirectional
        decoder_hidden
        )

        # Decoder
        self.decoder = nn.LSTMCell(
        encoder_hidden * 2 + n_mels, # Context + previous mel frame
        decoder_hidden
        )

        # Mel predictor
        self.mel_predictor = nn.Linear(decoder_hidden, n_mels)

        # Stop token predictor
        self.stop_predictor = nn.Linear(decoder_hidden, 1)

    def forward(self, text, mel_targets=None):
        """
        Forward pass

        Args:
            text: [batch, seq_len] phoneme indices
            mel_targets: [batch, mel_len, n_mels] target mels (training only)

            Returns:
                mels: [batch, mel_len, n_mels]
                stop_tokens: [batch, mel_len]
                """
                # Encode text
                embedded = self.embedding(text) # [batch, seq_len, embed_dim]
                encoder_outputs, _ = self.encoder(embedded) # [batch, seq_len, hidden*2]

                # Decode (autoregressive)
                if mel_targets is not None:
                    # Teacher forcing during training
                    return self._decode_teacher_forcing(encoder_outputs, mel_targets)
                else:
                    # Autoregressive during inference
                    return self._decode_autoregressive(encoder_outputs, max_len=1000)

    def _decode_autoregressive(self, encoder_outputs, max_len):
        """Autoregressive decoding (inference)"""
        batch_size = encoder_outputs.size(0)
        n_mels = self.mel_predictor.out_features

        # Initialize
        decoder_hidden = torch.zeros(batch_size, self.decoder.hidden_size)
        decoder_cell = torch.zeros(batch_size, self.decoder.hidden_size)
        attention_context = torch.zeros(batch_size, encoder_outputs.size(2))
        mel_frame = torch.zeros(batch_size, n_mels)

        mels = []
        stop_tokens = []

        for _ in range(max_len):
            # Compute attention
            attention_context, _ = self.attention(
            decoder_hidden,
            encoder_outputs
            )

            # Decoder step
            decoder_input = torch.cat([attention_context, mel_frame], dim=1)
            decoder_hidden, decoder_cell = self.decoder(
            decoder_input,
            (decoder_hidden, decoder_cell)
            )

            # Predict mel frame and stop token
            mel_frame = self.mel_predictor(decoder_hidden)
            stop_token = torch.sigmoid(self.stop_predictor(decoder_hidden))

            mels.append(mel_frame)
            stop_tokens.append(stop_token)

            # Check if should stop
            if (stop_token > 0.5).all():
                break

                mels = torch.stack(mels, dim=1) # [batch, mel_len, n_mels]
                stop_tokens = torch.stack(stop_tokens, dim=1) # [batch, mel_len, 1]

                return mels, stop_tokens

    class LocationSensitiveAttention(nn.Module):
        """
        Location-sensitive attention (simplified)

        Note: Real Tacotron uses cumulative attention features; this
        minimal version omits location convolution for brevity.
        """

    def __init__(self, encoder_dim, decoder_dim, attention_dim=128):
        super().__init__()

        self.query_layer = nn.Linear(decoder_dim, attention_dim)
        self.key_layer = nn.Linear(encoder_dim, attention_dim)
        self.value_layer = nn.Linear(attention_dim, 1)

        # For brevity, location features are omitted in this simplified demo

    def forward(self, query, keys):
        """
        Compute attention context

        Args:
            query: [batch, decoder_dim] - current decoder state
            keys: [batch, seq_len, encoder_dim] - encoder outputs

            Returns:
                context: [batch, encoder_dim]
                attention_weights: [batch, seq_len]
                """
                # Compute attention scores
                query_proj = self.query_layer(query).unsqueeze(1) # [batch, 1, attn_dim]
                keys_proj = self.key_layer(keys) # [batch, seq_len, attn_dim]

                scores = self.value_layer(torch.tanh(query_proj + keys_proj)).squeeze(2)
                attention_weights = torch.softmax(scores, dim=1) # [batch, seq_len]

                # Compute context
                context = torch.bmm(
                attention_weights.unsqueeze(1),
                keys
                ).squeeze(1) # [batch, encoder_dim]

                return context, attention_weights

3. Vocoder

Converts mel spectrogram to waveform.

Popular Vocoders:

Model Type Quality Speed Notes
WaveNet Autoregressive Excellent Slow Original neural vocoder
WaveGlow Flow-based Excellent Fast Parallel generation
HiFi-GAN GAN-based Excellent Very Fast Current SOTA
MelGAN GAN-based Good Very Fast Lightweight

HiFi-GAN Architecture:

import torch
import torch.nn as nn

class HiFiGANGenerator(nn.Module):
    """
    Simplified HiFi-GAN generator

    Upsamples mel spectrogram to waveform
    """

    def __init__(
    self,
    n_mels: int = 80,
    upsample_rates: list[int] = [8, 8, 2, 2],
    upsample_kernel_sizes: list[int] = [16, 16, 4, 4],
    resblock_kernel_sizes: list[int] = [3, 7, 11],
    resblock_dilation_sizes: list[list[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
    ):
        super().__init__()

        self.num_kernels = len(resblock_kernel_sizes)
        self.num_upsamples = len(upsample_rates)

        # Initial conv
        self.conv_pre = nn.Conv1d(n_mels, 512, kernel_size=7, padding=3)

        # Upsampling layers
        self.ups = nn.ModuleList()
        for i, (u, k) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
            self.ups.append(
            nn.ConvTranspose1d(
            512 // (2 ** i),
            512 // (2 ** (i + 1)),
            kernel_size=k,
            stride=u,
            padding=(k - u) // 2
            )
            )

            # Residual blocks
            self.resblocks = nn.ModuleList()
            for i in range(len(self.ups)):
                ch = 512 // (2 ** (i + 1))
                for k, d in zip(resblock_kernel_sizes, resblock_dilation_sizes):
                    self.resblocks.append(ResBlock(ch, k, d))

                    # Final conv
                    self.conv_post = nn.Conv1d(ch, 1, kernel_size=7, padding=3)

    def forward(self, mel):
        """
        Generate waveform from mel spectrogram

        Args:
            mel: [batch, n_mels, mel_len]

            Returns:
                waveform: [batch, 1, audio_len]
                """
                x = self.conv_pre(mel)

                for i, up in enumerate(self.ups):
                    x = torch.nn.functional.leaky_relu(x, 0.1)
                    x = up(x)

                    # Apply residual blocks
                    xs = None
                    for j in range(self.num_kernels):
                        if xs is None:
                            xs = self.resblocks[i * self.num_kernels + j](x)
                        else:
                            xs += self.resblocks[i * self.num_kernels + j](x)
                            x = xs / self.num_kernels

                            x = torch.nn.functional.leaky_relu(x)
                            x = self.conv_post(x)
                            x = torch.tanh(x)

                            return x

    class ResBlock(nn.Module):
        """Residual block with dilated convolutions"""

    def __init__(self, channels, kernel_size, dilations):
        super().__init__()

        self.convs = nn.ModuleList()
        for d in dilations:
            self.convs.append(
            nn.Conv1d(
            channels,
            channels,
            kernel_size=kernel_size,
            dilation=d,
            padding=(kernel_size * d - d) // 2
            )
            )

    def forward(self, x):
        for conv in self.convs:
            xt = torch.nn.functional.leaky_relu(x, 0.1)
            xt = conv(xt)
            x = x + xt
            return x

Modern TTS: FastSpeech 2

Problem with Tacotron: Autoregressive decoding is slow and can have attention errors.

FastSpeech 2 advantages:

  • Non-autoregressive: Generates all mel frames in parallel (much faster)
  • Duration prediction: Predicts phoneme durations explicitly
  • Controllability: Can control pitch, energy, duration

Architecture:

Input: Phoneme sequence
 ↓
[Phoneme Embeddings]
 ↓
[Feed-Forward Transformer]
 ↓
[Duration Predictor] → phoneme durations
[Pitch Predictor] → pitch contour
[Energy Predictor] → energy contour
 ↓
[Length Regulator] (expand phonemes by duration)
 ↓
[Feed-Forward Transformer]
 ↓
Output: Mel Spectrogram

Key innovation: Length Regulator

def length_regulator(phoneme_features, durations):
    """
    Expand phoneme features based on predicted durations

    Args:
        phoneme_features: [batch, phoneme_len, hidden]
        durations: [batch, phoneme_len] - frames per phoneme

        Returns:
            expanded: [batch, mel_len, hidden]
            """
            expanded = []

            for batch_idx in range(phoneme_features.size(0)):
                batch_expanded = []

                for phoneme_idx in range(phoneme_features.size(1)):
                    feature = phoneme_features[batch_idx, phoneme_idx]
                    duration = durations[batch_idx, phoneme_idx].int()

                    # Repeat feature 'duration' times
                    batch_expanded.append(feature.repeat(duration, 1))

                    batch_expanded = torch.cat(batch_expanded, dim=0)
                    expanded.append(batch_expanded)

                    # Pad to max length
                    max_len = max(e.size(0) for e in expanded)
                    expanded = [torch.nn.functional.pad(e, (0, 0, 0, max_len - e.size(0))) for e in expanded]

                    return torch.stack(expanded)

                    # Example
                    phoneme_features = torch.randn(1, 10, 256) # 10 phonemes
                    durations = torch.tensor([[5, 3, 4, 6, 2, 5, 7, 3, 4, 5]]) # Frames per phoneme

                    expanded = length_regulator(phoneme_features, durations)
                    print(f"Input shape: {phoneme_features.shape}")
                    print(f"Output shape: {expanded.shape}") # [1, 44, 256] (sum of durations)

Prosody Control

Prosody: Rhythm, stress, and intonation of speech

Control dimensions:

  • Pitch: Fundamental frequency (F0)
  • Duration: Phoneme/word length
  • Energy: Loudness
class ProsodyController:
    """
    Control prosody in TTS generation
    """

    def __init__(self, model):
        self.model = model

    def synthesize_with_prosody(
    self,
    text: str,
    pitch_scale: float = 1.0,
    duration_scale: float = 1.0,
    energy_scale: float = 1.0
    ):
        """
        Generate speech with prosody control

        Args:
            text: Input text
            pitch_scale: Multiply pitch by this factor (>1 = higher, <1 = lower)
            duration_scale: Multiply duration by this factor (>1 = slower, <1 = faster)
            energy_scale: Multiply energy by this factor (>1 = louder, <1 = softer)

            Returns:
                audio: Generated waveform
                """
                # Get model predictions
                phonemes = self.model.text_to_phonemes(text)
                mel_spec, pitch, duration, energy = self.model.predict(phonemes)

                # Apply prosody modifications
                pitch_modified = pitch * pitch_scale
                duration_modified = duration * duration_scale
                energy_modified = energy * energy_scale

                # Regenerate mel spectrogram with modified prosody
                mel_spec_modified = self.model.synthesize_mel(
                phonemes,
                pitch=pitch_modified,
                duration=duration_modified,
                energy=energy_modified
                )

                # Vocoder: mel → waveform
                audio = self.model.vocoder(mel_spec_modified)

                return audio

                # Usage example
                controller = ProsodyController(tts_model)

                # Happy speech: higher pitch, faster
                happy_audio = controller.synthesize_with_prosody(
                "Hello, how are you?",
                pitch_scale=1.2,
                duration_scale=0.9,
                energy_scale=1.1
                )

                # Sad speech: lower pitch, slower
                sad_audio = controller.synthesize_with_prosody(
                "Hello, how are you?",
                pitch_scale=0.8,
                duration_scale=1.2,
                energy_scale=0.9
                )

Connection to Evaluation Metrics

Like ML model evaluation, TTS systems need multiple metrics:

Objective Metrics

Mel Cepstral Distortion (MCD):

  • Measures distance between generated and ground truth mels
  • Lower is better
  • Correlates with quality but not perfectly
import librosa
import numpy as np

def mel_cepstral_distortion(generated_mel, target_mel):
    """
    Compute MCD between generated and target mel spectrograms

    Args:
        generated_mel: [n_mels, time]
        target_mel: [n_mels, time]

        Returns:
            MCD score (lower is better)
            """
            # Align lengths
            min_len = min(generated_mel.shape[1], target_mel.shape[1])
            generated_mel = generated_mel[:, :min_len]
            target_mel = target_mel[:, :min_len]

            # Compute simple L2 distance over mel frames (proxy for MCD)
            diff = generated_mel - target_mel
            mcd = float(np.linalg.norm(diff, axis=0).mean())

            return mcd

F0 RMSE: Root mean squared error of pitch

Duration Accuracy: How well predicted durations match ground truth

Subjective Metrics

Mean Opinion Score (MOS):

  • Human raters score quality 1-5
  • Gold standard for TTS evaluation
  • Expensive and time-consuming

MUSHRA Test: Compare multiple systems side-by-side


Production Considerations

Latency

Components of TTS latency:

  1. Text processing: 5-10ms
  2. Acoustic model: 50-200ms (depends on text length)
  3. Vocoder: 20-100ms
  4. Total: 75-310ms

Optimization strategies:

  • Streaming TTS: Generate audio incrementally
  • Model distillation: Smaller, faster models
  • Quantization: INT8 inference
  • Caching: Pre-generate common phrases

Multi-Speaker TTS

class MultiSpeakerTTS:
    """
    TTS supporting multiple voices

    Approach 1: Speaker embedding
    Approach 2: Separate models per speaker
    """

    def __init__(self, model, speaker_embeddings):
        self.model = model
        self.speaker_embeddings = speaker_embeddings

    def synthesize(self, text: str, speaker_id: int):
        """
        Generate speech in specific speaker's voice

        Args:
            text: Input text
            speaker_id: Speaker identifier

            Returns:
                audio: Waveform in target speaker's voice
                """
                # Get speaker embedding
                speaker_emb = self.speaker_embeddings[speaker_id]

                # Generate with speaker conditioning
                mel = self.model.generate_mel(text, speaker_embedding=speaker_emb)
                audio = self.model.vocoder(mel)

                return audio

Training Data Requirements

For audio preprocessing techniques relevant to TTS training data, see Audio Preprocessing & Signal Processing.

Dataset Characteristics

Typical single-speaker TTS training:

  • Audio hours: 10-24 hours of clean speech
  • Utterances: 5,000-15,000 sentences
  • Recording quality: Studio quality, 22 kHz+ sample rate
  • Text diversity: Cover phonetic diversity of language

Multi-speaker TTS:

  • Speakers: 100-10,000 speakers
  • Hours per speaker: 1-5 hours
  • Total hours: 100-50,000 hours (e.g., LibriTTS: 585 hours, 2,456 speakers)

Data Preparation Pipeline

class TTSDataPreparation:
    """
    Prepare data for TTS training
    """

    def __init__(self, sample_rate=22050):
        self.sample_rate = sample_rate

    def prepare_dataset(
    self,
    audio_files: list[str],
    text_files: list[str]
    ) -> list[dict]:
        """
        Prepare audio-text pairs

        Steps:
            1. Text normalization
            2. Audio preprocessing
            3. Alignment (forced alignment)
            4. Feature extraction
            """
            dataset = []

            for audio_file, text_file in zip(audio_files, text_files):
                # Load audio
                audio, sr = librosa.load(audio_file, sr=self.sample_rate)

                # Load text
                with open(text_file) as f:
                    text = f.read().strip()

                    # Normalize text
                    normalized_text = self.normalize_text(text)

                    # Convert to phonemes
                    phonemes = self.text_to_phonemes(normalized_text)

                    # Extract mel spectrogram
                    mel = self.extract_mel(audio)

                    # Extract prosody features
                    pitch = self.extract_pitch(audio)
                    energy = self.extract_energy(audio)

                    # Compute duration (requires forced alignment)
                    duration = self.compute_durations(audio, phonemes)

                    dataset.append({
                    'audio_path': audio_file,
                    'text': text,
                    'normalized_text': normalized_text,
                    'phonemes': phonemes,
                    'mel': mel,
                    'pitch': pitch,
                    'energy': energy,
                    'duration': duration
                    })

                    return dataset

    def extract_mel(self, audio):
        """Extract mel spectrogram"""
        mel = librosa.feature.melspectrogram(
        y=audio,
        sr=self.sample_rate,
        n_fft=1024,
        hop_length=256,
        n_mels=80
        )
        mel_db = librosa.power_to_db(mel, ref=np.max)
        return mel_db

    def extract_pitch(self, audio):
        """Extract pitch (F0) contour"""
        f0, voiced_flag, voiced_probs = librosa.pyin(
        audio,
        fmin=librosa.note_to_hz('C2'),
        fmax=librosa.note_to_hz('C7'),
        sr=self.sample_rate
        )
        return f0

    def extract_energy(self, audio):
        """Extract energy (RMS)"""
        energy = librosa.feature.rms(
        y=audio,
        frame_length=1024,
        hop_length=256
        )[0]
        return energy

Data Quality Challenges

1. Noisy Audio:

  • Background noise degrades quality
  • Solution: Use noise reduction, or data augmentation

2. Alignment Errors:

  • Text-audio misalignment breaks training
  • Solution: Forced alignment with Montreal Forced Aligner (MFA)

3. Prosody Variation:

  • Inconsistent prosody confuses models
  • Solution: Filter outliers, normalize prosody

4. Out-of-Domain Text:

  • Model struggles with unseen words/names
  • Solution: Diverse training text, robust G2P

Voice Cloning & Few-Shot Learning

Voice cloning: Generate speech in a target voice with minimal data.

Approaches

1. Speaker Embedding (Zero-Shot/Few-Shot)

class SpeakerEncoder(nn.Module):
    """
    Encode speaker characteristics from reference audio

    Architecture: Similar to speaker recognition
    """

    def __init__(self, mel_dim=80, embedding_dim=256):
        super().__init__()

        # LSTM encoder
        self.encoder = nn.LSTM(
        mel_dim,
        embedding_dim,
        num_layers=3,
        batch_first=True
        )

        # Projection to speaker embedding
        self.projection = nn.Linear(embedding_dim, embedding_dim)

    def forward(self, mel):
        """
        Extract speaker embedding from mel spectrogram

        Args:
            mel: [batch, mel_len, mel_dim]

            Returns:
                speaker_embedding: [batch, embedding_dim]
                """
                _, (hidden, _) = self.encoder(mel)

                # Use last hidden state
                speaker_emb = self.projection(hidden[-1])

                # L2 normalize
                speaker_emb = speaker_emb / (speaker_emb.norm(dim=1, keepdim=True) + 1e-8)

                return speaker_emb

    class VoiceCloningTTS:
        """
        TTS with voice cloning capability
        """

    def __init__(self, acoustic_model, vocoder, speaker_encoder):
        self.acoustic_model = acoustic_model
        self.vocoder = vocoder
        self.speaker_encoder = speaker_encoder

    def clone_voice(
    self,
    text: str,
    reference_audio: torch.Tensor
    ):
        """
        Generate speech in reference voice

        Args:
            text: Text to synthesize
            reference_audio: Audio sample of target voice (3-10 seconds)

            Returns:
                synthesized_audio: Speech in target voice
                """
                # Extract speaker embedding from reference
                reference_mel = self.extract_mel(reference_audio)
                speaker_embedding = self.speaker_encoder(reference_mel)

                # Generate mel spectrogram conditioned on speaker
                mel = self.acoustic_model.generate(
                text,
                speaker_embedding=speaker_embedding
                )

                # Vocoder: mel → waveform
                audio = self.vocoder(mel)

                return audio

                # Usage
                tts = VoiceCloningTTS(acoustic_model, vocoder, speaker_encoder)

                # Clone voice from 5-second reference
                reference_audio = load_audio("reference_voice.wav")
                cloned_speech = tts.clone_voice(
                "Hello, this is a cloned voice!",
                reference_audio
                )

2. Fine-Tuning (10-60 minutes of data)

class VoiceCloner:
    """
    Fine-tune pre-trained TTS on target voice
    """

    def __init__(self, pretrained_model):
        self.model = pretrained_model

    def fine_tune(
    self,
    target_voice_data: list[tuple], # [(audio, text), ...]
    num_steps: int = 1000,
    learning_rate: float = 1e-4
    ):
        """
        Fine-tune model on target voice

        Args:
            target_voice_data: Audio-text pairs for target speaker
            num_steps: Training steps
            learning_rate: Learning rate
            """
            optimizer = torch.optim.Adam(self.model.parameters(), lr=learning_rate)

            for step in range(num_steps):
                # Sample batch
                batch = random.sample(target_voice_data, min(8, len(target_voice_data)))

                # Forward pass
                loss = self.model.compute_loss(batch)

                # Backward pass
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                if step % 100 == 0:
                    print(f"Step {step}, Loss: {loss.item():.4f}")

                    return self.model

                    # Usage
                    cloner = VoiceCloner(pretrained_tts_model)

                    # Collect 30 minutes of target voice
                    target_data = [...] # List of (audio, text) pairs

                    # Fine-tune
                    cloned_model = cloner.fine_tune(target_data, num_steps=1000)

Production Deployment Patterns

Pattern 1: Caching + Dynamic Generation

class HybridTTSSystem:
    """
    Hybrid system: Cache common phrases, generate on-the-fly for novel text
    """

    def __init__(self, tts_model, cache_backend):
        self.tts = tts_model
        self.cache = cache_backend # e.g., Redis
        self.common_phrases = [
        "Welcome", "Thank you", "Goodbye",
        "Please hold", "One moment please"
        ]
        self._warm_cache()

    def _warm_cache(self):
        """Pre-generate and cache common phrases"""
        for phrase in self.common_phrases:
            if not self.cache.exists(phrase):
                audio = self.tts.synthesize(phrase)
                self.cache.set(phrase, audio, ttl=86400) # 24-hour TTL

    def synthesize(self, text: str):
        """
        Synthesize with caching

        Cache hit: <5ms
        Cache miss: 100-300ms (generate)
        """
        # Check cache
        cached = self.cache.get(text)
        if cached is not None:
            return cached

            # Generate
            audio = self.tts.synthesize(text)

            # Cache if frequently requested
            request_count = self.cache.increment(f"count:{text}")
            if request_count > 10:
                self.cache.set(text, audio, ttl=3600)

                return audio

Pattern 2: Streaming TTS

For a deeper dive into streaming audio architecture, see Streaming Speech Processing Pipeline.

class StreamingTTS:
    """
    Stream audio as it's generated (reduce latency)
    """

    def __init__(self, acoustic_model, vocoder):
        self.acoustic_model = acoustic_model
        self.vocoder = vocoder

    def stream_synthesize(self, text: str):
        """
        Generate audio in chunks

        Yields audio chunks as they're ready
        """
        # Generate mel spectrogram
        mel_frames = self.acoustic_model.generate_streaming(text)

        # Stream vocoder output
        mel_buffer = []
        chunk_size = 50 # mel frames per chunk

        for mel_frame in mel_frames:
            mel_buffer.append(mel_frame)

            if len(mel_buffer) >= chunk_size:
                # Vocoder: mel chunk → audio chunk
                mel_chunk = torch.stack(mel_buffer)
                audio_chunk = self.vocoder(mel_chunk)

                yield audio_chunk

                # Keep overlap for smoothness
                mel_buffer = mel_buffer[-10:]

                # Final chunk
                if mel_buffer:
                    mel_chunk = torch.stack(mel_buffer)
                    audio_chunk = self.vocoder(mel_chunk)
                    yield audio_chunk

                    # Usage
                    streaming_tts = StreamingTTS(acoustic_model, vocoder)

                    # Stream audio
                    for audio_chunk in streaming_tts.stream_synthesize("Long text to synthesize..."):
                        # Play audio_chunk immediately
                        play_audio(audio_chunk)
                        # User starts hearing speech before full generation completes!

Pattern 3: Edge Deployment

class EdgeTTS:
    """
    TTS optimized for edge devices (phones, IoT)
    """

    def __init__(self):
        self.model = self.load_optimized_model()

    def load_optimized_model(self):
        """
        Load quantized, pruned model

        Techniques:
            - INT8 quantization (4x smaller, 2-4x faster)
            - Knowledge distillation (smaller student model)
            - Pruning (remove 30-50% of weights)
            """
            import torch.quantization

            # Load full precision model
            model = load_full_model()

            # Quantize to INT8
            # Use dynamic quantization for linear-heavy modules as a safe default
            model_quantized = torch.quantization.quantize_dynamic(
            model,
            {nn.Linear},
            dtype=torch.qint8
            )

            return model_quantized

    def synthesize_on_device(self, text: str):
        """
        Synthesize on edge device

        Latency: 50-150ms
        Memory: <100MB
        """
        audio = self.model.generate(text)
        return audio

Quality Assessment in Production

Automated Quality Monitoring

class TTSQualityMonitor:
    """
    Monitor TTS quality in production
    """

    def __init__(self):
        self.baseline_mcd = 2.5 # Expected MCD
        self.alert_threshold = 3.5 # Alert if MCD > this

    def monitor_synthesis(self, text: str, generated_audio: np.ndarray):
        """
        Check quality of generated audio

        Red flags:
            - Abnormal duration
            - Clipping / distortion
            - Silent segments
            - MCD drift
            """
            issues = []

            # Check duration
            expected_duration = len(text.split()) * 0.5 # ~0.5s per word
            actual_duration = len(generated_audio) / 22050
            if abs(actual_duration - expected_duration) / expected_duration > 0.5:
                issues.append(f"Abnormal duration: {actual_duration:.2f}s vs {expected_duration:.2f}s expected")

                # Check for clipping
                if np.max(np.abs(generated_audio)) > 0.99:
                    issues.append("Clipping detected")

                    # Check for silent segments
                    rms = librosa.feature.rms(y=generated_audio)[0]
                    silent_ratio = (rms < 0.01).sum() / len(rms)
                    if silent_ratio > 0.3:
                        issues.append(f"Too much silence: {silent_ratio:.1%}")

                        # Log for drift detection
                        self.log_quality_metrics({
                        'text_length': len(text),
                        'audio_duration': actual_duration,
                        'max_amplitude': np.max(np.abs(generated_audio)),
                        'silent_ratio': silent_ratio
                        })

                        return issues

    def log_quality_metrics(self, metrics: dict):
        """Log metrics for drift detection"""
        # Send to monitoring system (Datadog, Prometheus, etc.)
        pass

Comparative Analysis

Tacotron 2 vs FastSpeech 2

Aspect Tacotron 2 FastSpeech 2
Speed Slow (autoregressive) Fast (parallel)
Quality Excellent Excellent
Robustness Can skip/repeat words More robust
Controllability Limited Explicit control (pitch, duration)
Training Simpler (no duration model) Needs duration labels
Latency 200-500ms 50-150ms

When to Use Each

Use Tacotron 2 when:

  • Maximum naturalness is critical
  • Training data is limited (easier to train)
  • Latency is acceptable

Use FastSpeech 2 when:

  • Low latency required
  • Need prosody control
  • Robustness is critical (production systems)

Recent Advances (2023-2024)

1. VALL-E (Zero-Shot Voice Cloning)

Microsoft’s VALL-E can clone a voice from a 3-second sample using language model approach.

Key idea: Treat TTS as conditional language modeling over discrete audio codes.

2. VITS (End-to-End TTS)

Combines acoustic model and vocoder into single model.

Advantages:

  • Faster training and inference
  • Better audio quality
  • Simplified pipeline

3. YourTTS (Multi-lingual Voice Cloning)

Zero-shot multi-lingual TTS supporting 16+ languages.

4. Bark (Generative Audio Model)

Text-to-audio model that can generate music, sound effects, and speech with emotions.


Key Takeaways

Two-stage pipeline - Acoustic model + vocoder is standard ✅ Text processing critical - Normalization and G2P affect quality ✅ Autoregressive vs non-autoregressive - Tacotron vs FastSpeech trade-offs ✅ Prosody control - Pitch, duration, energy for expressiveness ✅ Multiple metrics - Objective (MCD) and subjective (MOS) both needed ✅ Production optimization - Latency, caching, streaming for real-time use ✅ Like climbing stairs - Build incrementally (phoneme → mel → waveform)


FAQ

Q: What is the difference between Tacotron 2 and FastSpeech 2? A: Tacotron 2 is autoregressive (generates mel frames sequentially) offering excellent naturalness but slower inference (200-500ms). FastSpeech 2 is non-autoregressive (generates all frames in parallel) with explicit duration, pitch, and energy prediction, achieving 50-150ms latency and greater robustness against word skipping or repeating.

Q: What vocoder should I use for production TTS? A: HiFi-GAN is the current standard for production TTS. It uses GAN-based training with residual blocks and dilated convolutions to generate high-quality waveforms very quickly. Alternatives include WaveGlow (flow-based, fast) and WaveNet (autoregressive, highest quality but slow).

Q: How much data do I need to train a single-speaker TTS model? A: A typical single-speaker TTS model requires 10-24 hours of studio-quality clean speech covering 5,000-15,000 sentences with phonetic diversity. For multi-speaker TTS, you need 1-5 hours per speaker across 100-10,000 speakers.


Originally published at: arunbaby.com/speech-tech/0006-text-to-speech-basics

If you found this helpful, consider sharing it with others who might benefit.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch