Sequence-to-Sequence Speech Models
“From waveforms to words, and back again.”
1. The Seq2Seq Paradigm
Traditional speech systems were pipelines:
- ASR: Acoustic Model + Language Model + Decoder.
- TTS: Text Analysis + Acoustic Model + Vocoder.
Seq2Seq unifies this into a single neural network:
- Input: Sequence (audio features or text).
- Output: Sequence (text or audio features).
- No hand-crafted features or rules.
2. ASR: Listen, Attend, and Spell (LAS)
Architecture:
- Listener (Encoder): Pyramid LSTM processes audio.
- Reduces time resolution (e.g., 100 frames → 25 frames).
- Attender: Attention mechanism focuses on relevant audio frames.
- Speller (Decoder): LSTM generates characters/words.
Attention: \(\alpha_t = \text{softmax}(e_t)\) \(e_{t,i} = \text{score}(s_{t-1}, h_i)\) \(c_t = \sum_i \alpha_{t,i} h_i\)
Where:
- $s_{t-1}$: Previous decoder state.
- $h_i$: Encoder hidden state at time $i$.
- $c_t$: Context vector (weighted sum of encoder states).
3. TTS: Tacotron 2
Goal: Text → Mel-Spectrogram → Waveform.
Architecture:
- Encoder: Character embeddings → LSTM.
- Attention: Decoder attends to encoder states.
- Decoder: Predicts mel-spectrogram frames.
- Vocoder (WaveNet/HiFi-GAN): Mel → Waveform.
Key Innovation: Predict multiple frames per step (faster).
4. Speech Translation: Direct S2ST
Traditional: Audio → ASR → Text → MT → Text → TTS → Audio. Direct: Audio → Audio (no text intermediate).
Advantages:
- Preserves prosody (emotion, emphasis).
- Works for unwritten languages.
Model: Translatotron (Google).
- Encoder: Audio (Spanish).
- Decoder: Spectrogram (English).
5. Attention Mechanisms
1. Content-Based Attention
- Decoder decides where to attend based on content.
- Problem: Can attend to same location twice (repetition).
2. Location-Aware Attention
- Uses previous attention weights to guide current attention.
- Encourages monotonic progression (left-to-right).
3. Monotonic Attention
- Enforces strict left-to-right alignment.
- Good for streaming ASR (can’t look ahead).
6. Challenges
1. Exposure Bias
- During training, decoder sees ground truth.
- During inference, decoder sees its own (possibly wrong) predictions.
- Fix: Scheduled sampling.
2. Alignment Failures
- Attention might skip words or repeat.
- Fix: Guided attention (force diagonal alignment early in training).
3. Long Sequences
- Attention is $O(N^2)$ in memory.
- Fix: Chunked attention, or use CTC for ASR.
7. Modern Approach: Transformer-Based
Whisper (OpenAI):
- Encoder-Decoder Transformer.
- Trained on 680k hours.
- Handles ASR, Translation, Language ID in one model.
Advantages:
- Parallel training (unlike RNN).
- Better long-range dependencies.
8. Summary
| Task | Model | Input | Output |
|---|---|---|---|
| ASR | LAS, Whisper | Audio | Text |
| TTS | Tacotron 2 | Text | Audio |
| ST | Translatotron | Audio (L1) | Audio (L2) |
9. Deep Dive: Attention Alignment Visualization
In ASR, attention should be monotonic (left-to-right).
Good Alignment:
Audio frames: [a1][a2][a3][a4][a5]
Text: [ h ][ e ][ l ][ l ][ o ]
Attention: ████
░░░░████
░░░░████
░░░░████
░░░░████
Bad Alignment (Skipping):
Audio frames: [a1][a2][a3][a4][a5]
Text: [ h ][ e ][ l ][ l ][ o ]
Attention: ████
░░░░████
░░░░████ ← Skipped a3!
Fix: Guided attention loss forces diagonal alignment early in training.
10. Deep Dive: Streaming ASR with Monotonic Attention
Problem: Standard attention looks at the entire input. Can’t stream.
Monotonic Chunkwise Attention (MoChA):
- At each decoder step, decide: “Should I move to the next audio chunk?”
- Use a sigmoid gate: $p_{\text{move}} = \sigma(g(h_t, s_{t-1}))$
- If yes, attend to next chunk. If no, stay.
Advantage: Can process audio in real-time (with bounded latency).
11. Deep Dive: Tacotron 2 Architecture Details
Encoder:
- Character embeddings → 3 Conv layers → Bidirectional LSTM.
- Output: Encoded representation of text.
Decoder:
- Prenet: 2 FC layers with dropout (helps with generalization).
- Attention RNN: 1-layer LSTM.
- Decoder RNN: 2-layer LSTM.
- Output: Predicts 80-dim mel-spectrogram frame.
Postnet:
- 5 Conv layers.
- Refines the mel-spectrogram (adds high-frequency details).
Stop Token:
- Decoder also predicts “Should I stop?” at each step.
- Training: Sigmoid BCE loss.
12. Deep Dive: Vocoder Evolution
Goal: Mel-Spectrogram → Waveform.
WaveNet (2016):
- Autoregressive CNN.
- Generates one sample at a time (16kHz = 16,000 samples/sec).
- Slow: 1 second of audio takes 10 seconds to generate.
WaveGlow (2018):
- Flow-based model.
- Parallel generation (all samples at once).
- Fast: Real-time on GPU.
HiFi-GAN (2020):
- GAN-based.
- Fastest: 167x faster than real-time on V100.
- Quality: Indistinguishable from ground truth.
13. System Design: Low-Latency TTS
Scenario: Voice assistant (Alexa, Siri).
Requirements:
- Latency: < 300ms from text to first audio chunk.
- Quality: Natural, expressive.
Architecture:
- Text Normalization: “Dr.” → “Doctor”, “$100” → “one hundred dollars”.
- Grapheme-to-Phoneme (G2P): “read” → /riːd/ or /rɛd/ (context-dependent).
- Prosody Prediction: Predict pitch, duration, energy.
- Acoustic Model: FastSpeech 2 (non-autoregressive, parallel).
- Vocoder: HiFi-GAN.
Streaming:
- Generate mel-spectrogram in chunks (e.g., 50ms).
- Vocoder processes each chunk independently.
14. Deep Dive: Translatotron (Direct S2ST)
Challenge: No parallel S2ST data (Audio_Spanish → Audio_English).
Solution: Weak supervision.
- Use ASR to get Spanish text.
- Use MT to get English text.
- Use TTS to get English audio.
- Train Translatotron on (Audio_Spanish, Audio_English) pairs.
Architecture:
- Encoder: Processes Spanish audio.
- Decoder: Generates English mel-spectrogram.
- Speaker Encoder: Preserves source speaker’s voice.
15. Code: Simple Seq2Seq ASR
import torch
import torch.nn as nn
class Seq2SeqASR(nn.Module):
def __init__(self, input_dim, hidden_dim, vocab_size):
super().__init__()
# Encoder: Bi-LSTM
self.encoder = nn.LSTM(input_dim, hidden_dim, num_layers=3,
batch_first=True, bidirectional=True)
# Attention
self.attention = nn.Linear(hidden_dim * 2, 1)
# Decoder: LSTM
self.decoder = nn.LSTM(vocab_size + hidden_dim * 2, hidden_dim,
num_layers=2, batch_first=True)
# Output projection
self.fc = nn.Linear(hidden_dim, vocab_size)
def forward(self, audio_features, text_input):
# Encode audio
encoder_out, _ = self.encoder(audio_features) # (B, T, 2*H)
# Decode
outputs = []
hidden = None
for t in range(text_input.size(1)):
# Compute attention
attn_scores = self.attention(encoder_out).squeeze(-1) # (B, T)
attn_weights = torch.softmax(attn_scores, dim=1) # (B, T)
context = torch.bmm(attn_weights.unsqueeze(1), encoder_out) # (B, 1, 2*H)
# Decoder input: previous char + context
decoder_input = torch.cat([text_input[:, t:t+1], context], dim=-1)
# Decoder step
decoder_out, hidden = self.decoder(decoder_input, hidden)
# Predict next char
logits = self.fc(decoder_out)
outputs.append(logits)
return torch.cat(outputs, dim=1)
16. Production Considerations
- Model Size: Quantize to INT8 for mobile deployment.
- Latency: Use non-autoregressive models (FastSpeech, Conformer-CTC).
- Personalization: Fine-tune on user’s voice (few-shot learning).
- Multilingual: Train on 100+ languages (mSLAM, Whisper).
17. Deep Dive: The Evolution of Seq2Seq in Speech
To understand where we are, we must understand the journey.
1. HMM-GMM (1980s-2010):
- Acoustic Model: Gaussian Mixture Models (GMM) modeled the probability of audio features given a phoneme state.
- Sequence Model: Hidden Markov Models (HMM) modeled the transition between states.
- Pros: Mathematically rigorous, worked on small data.
- Cons: Assumed independence of frames (false), complex training pipeline.
2. DNN-HMM (2010-2015):
- Replaced GMM with Deep Neural Networks (DNN).
- Impact: Massive drop in WER (30% relative).
- Cons: Still relied on HMM for alignment.
3. CTC (2006, popularized 2015):
- Connectionist Temporal Classification.
- First “End-to-End” loss.
- Allowed predicting sequences shorter than input without explicit alignment.
- Cons: Conditional independence assumption (output at time $t$ depends only on input, not previous outputs).
4. LAS (2016):
- Listen-Attend-Spell.
- Introduced Attention to speech.
- Pros: No independence assumption.
- Cons: Not streaming (needs full audio to attend).
5. RNN-T (2012, popularized 2018):
- RNN Transducer.
- Combined CTC (streaming) with LAS (label dependency).
- Result: The standard for streaming ASR (Pixel, Siri, Alexa).
6. Transformer & Conformer (2017-Present):
- Self-attention replaces RNNs.
- Conformer: Combines CNN (local patterns) with Transformer (global context).
18. Deep Dive: Connectionist Temporal Classification (CTC)
Problem: Audio has $T$ frames (e.g., 1000). Text has $L$ characters (e.g., 50). $T \gg L$. How do we map them?
CTC Solution:
- Introduce a blank token $\epsilon$.
- Output sequence length is $T$.
- Collapse: Remove repeats and blanks.
h h e e l l l l o→helloh h \epsilon e e l l \epsilon l l o→hello
Loss Function: \(P(Y|X) = \sum_{A \in \mathcal{B}^{-1}(Y)} P(A|X)\)
- Sum over all valid alignments $A$ that collapse to $Y$.
- Computed efficiently using Dynamic Programming (Forward-Backward algorithm).
Pros:
- Fast inference ($O(1)$ per step).
- Streaming friendly.
Cons:
- Can’t model language (e.g., “pair” vs “pear” sounds same). Needs external Language Model.
19. Deep Dive: RNN-Transducer (RNN-T)
Architecture:
- Encoder (Audio): Processes audio frames $x_t$. Produces $h_t^{enc}$.
- Prediction Network (Text): Processes previous non-blank output $y_{u-1}$. Produces $h_u^{pred}$. (Like an LM).
- Joint Network: Combines them. \(z_{t,u} = \text{ReLU}(W_{enc} h_t^{enc} + W_{pred} h_u^{pred})\) \(P(y|t,u) = \text{softmax}(W_{out} z_{t,u})\)
Inference (Greedy):
- If output is non-blank ($y$), feed to Prediction Network, increment $u$. Stay at audio frame $t$.
- If output is blank ($\epsilon$), move to next audio frame $t+1$.
Why it wins:
- Streaming: Encoder is causal.
- Accuracy: Prediction network models label dependencies (unlike CTC).
- Latency: Controllable.
20. Deep Dive: Transformer vs Conformer
Transformer:
- Great at global context (Self-Attention).
- Bad at local fine-grained details (needs deep layers to see local patterns).
- Positional encodings are brittle for varying audio lengths.
Conformer (Convolution-augmented Transformer):
- Macaron Style: Feed-Forward -> Multi-Head Attention -> Conv Module -> Feed-Forward.
- Conv Module: Pointwise Conv -> Gated Linear Unit (GLU) -> Depthwise Conv -> BatchNorm -> Swish -> Pointwise Conv.
- Why:
- CNNs capture local features (formants, transitions).
- Transformers capture global semantics.
- Result: SOTA on LibriSpeech.
21. Deep Dive: FastSpeech 2 (Non-Autoregressive TTS)
Problem with Tacotron 2:
- Autoregressive (slow).
- Unstable attention (skipping/repeating).
- Hard to control prosody (speed, pitch).
FastSpeech 2 Architecture:
- Encoder: Feed-Forward Transformer.
- Variance Adaptor:
- Duration Predictor: How many frames does this phoneme last? (Trained on forced alignment).
- Pitch Predictor: Predicts F0 contour.
- Energy Predictor: Predicts volume.
- Adds embeddings of these predictions to the encoder output.
- Length Regulator: Expands hidden states based on duration (e.g., “a” lasts 5 frames -> repeat vector 5 times).
- Decoder: Feed-Forward Transformer.
Pros:
- Fast: Parallel generation.
- Robust: No attention failures.
- Controllable: Can explicitly set “speak 1.2x faster” or “raise pitch”.
22. Deep Dive: VITS (Conditional Variational Autoencoder)
VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech) is the current SOTA for E2E TTS.
Key Idea: Combine Acoustic Model and Vocoder into one flow-based model.
Architecture:
- Posterior Encoder: Encodes linear spectrogram into latent $z$.
- Prior Encoder: Predicts distribution of $z$ from text (conditioned on alignment).
- Decoder (Generator): Transforms $z$ into waveform (HiFi-GAN style).
- Stochastic Duration Predictor: Models duration uncertainty.
Training:
- Reconstruction Loss: Mel-spectrogram L1 loss.
- KL Divergence: Between Posterior and Prior.
- Adversarial Loss: Discriminator tries to distinguish real vs generated audio.
Result: extremely natural, high-fidelity speech.
23. System Design: Real-Time Speech Translation System
Scenario: “Universal Translator” device. English Audio -> Spanish Audio.
Constraints:
- Latency: < 2 seconds lag.
- Compute: Edge device (limited).
Architecture Choices:
Option A: Cascade (ASR -> MT -> TTS)
- Pros: Modular. Can use SOTA for each.
- Cons: Error propagation. Latency adds up. Loss of paralinguistics (tone).
Option B: Direct S2ST (Audio -> Audio)
- Pros: Fast. Preserves tone.
- Cons: Data scarcity. Hard to train.
Hybrid Design (Production):
- Streaming ASR (RNN-T): Generates English text stream.
- Streaming MT: Translates partial sentences. “Hello” -> “Hola”.
- Incremental TTS: Generates audio for “Hola” immediately.
Wait-k Policy:
- MT model waits for $k$ words before translating.
- Balances context (accuracy) vs latency.
24. Case Study: OpenAI Whisper
Goal: Robust ASR that works on “in the wild” audio.
Data:
- 680,000 hours of web audio.
- Weak Supervision: Transcripts from ASR systems, subtitles (noisy).
- Multitask:
- English Transcription.
- Any-to-English Translation.
- Language Identification.
- Voice Activity Detection.
Architecture:
- Standard Encoder-Decoder Transformer.
- Input: Log-Mel Spectrogram (30 seconds).
- Output: Text tokens.
Key Features:
- Task Tokens:
<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>. - Long-form: Processes 30s chunks. Uses previous chunk’s text as prompt (context).
Impact:
- Zero-shot performance on many datasets matches supervised models.
- Proved that Data Scale > Model Architecture.
25. Case Study: Google USM (Universal Speech Model)
Goal: One model for 300+ languages.
Architecture:
- Conformer (2 Billion parameters).
- MOST (Multi-Objective Supervised Training):
- BEST-RQ: Self-supervised loss (BERT-style on audio).
- Text-Injection: Train on text-only data (using shared encoder layers).
- ASR: Supervised loss on labeled audio.
Result:
- SOTA on 73 languages.
- Enables ASR for languages with < 10 hours of data.
26. Deep Dive: Evaluation Metrics
ASR:
- WER (Word Error Rate): $\frac{S + D + I}{N}$
- S: Substitutions, D: Deletions, I: Insertions, N: Total words.
- CER (Character Error Rate): For languages without spaces (Chinese).
- RTF (Real-Time Factor): $\frac{\text{Processing Time}}{\text{Audio Duration}}$. RTF < 1 means real-time.
TTS:
- MOS (Mean Opinion Score): Humans rate naturalness 1-5.
- MCD (Mel Cepstral Distortion): Objective distance between generated and ground truth spectrograms.
ST (Speech Translation):
- BLEU: Standard MT metric.
27. Code: Implementing Beam Search for Seq2Seq
Greedy decoding (pick max prob) is suboptimal. Beam search explores $k$ paths.
def beam_search_decoder(model, encoder_out, beam_width=3, max_len=50):
# Start with [SOS] token
# Beam: list of (sequence, score, hidden_state)
start_token = model.vocab['<sos>']
beam = [([start_token], 0.0, None)]
for _ in range(max_len):
candidates = []
for seq, score, hidden in beam:
if seq[-1] == model.vocab['<eos>']:
candidates.append((seq, score, hidden))
continue
# Predict next token
last_token = torch.tensor([[seq[-1]]])
output, new_hidden = model.decoder(last_token, hidden, encoder_out)
log_probs = torch.log_softmax(output, dim=-1)
# Get top k
topk_probs, topk_ids = log_probs.topk(beam_width)
for i in range(beam_width):
token = topk_ids[0][i].item()
prob = topk_probs[0][i].item()
candidates.append((seq + [token], score + prob, new_hidden))
# Select top k candidates
beam = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_width]
# Check if all finished
if all(c[0][-1] == model.vocab['<eos>'] for c in beam):
break
return beam[0][0] # Return best sequence
28. Production: Serving Speech Models with Triton
NVIDIA Triton Inference Server is standard for deploying speech models.
Pipeline:
- Feature Extractor (Python Backend): Audio -> Mel-Spec.
- Encoder (TensorRT): Mel-Spec -> Hidden States.
- Decoder (TensorRT): Hidden States -> Text (Beam Search).
Dynamic Batching:
- Triton groups requests arriving within 5ms into a batch.
- Increases GPU utilization significantly.
Ensemble Model:
- Define a DAG of models.
- Client sends audio, Triton handles the flow.
29. Deep Dive: End-to-End Speech-to-Speech Translation (S2ST)
Unit-Based S2ST:
- Instead of spectrograms, predict discrete acoustic units (from HuBERT or k-means).
- Advantage: Discrete tokens allow using standard Transformer (like NLP).
- Vocoder: Unit HiFi-GAN converts discrete units to waveform.
SpeechMatrix:
- Mining parallel speech from 100k hours of multilingual audio.
- Uses LASER embeddings to find matching sentences in different languages.
30. Summary
| Task | Model | Input | Output |
|---|---|---|---|
| ASR | LAS, Whisper | Audio | Text |
| TTS | Tacotron 2 | Text | Audio |
| ST | Translatotron | Audio (L1) | Audio (L2) |
| Vocoder | HiFi-GAN | Mel-Spec | Waveform |
| Streaming | RNN-T | Audio Stream | Text Stream |
| Fast TTS | FastSpeech 2 | Text | Mel-Spec |
| Streaming | RNN-T | Audio Stream | Text Stream |
| Fast TTS | FastSpeech 2 | Text | Mel-Spec |
31. Deep Dive: Self-Supervised Learning (SSL) in Speech
Problem: Labeled data (audio + text) is expensive. Unlabeled audio is free.
Wav2Vec 2.0 (Meta):
- Idea: Mask parts of the audio latent space and predict the quantized representation of the masked part.
- Contrastive Loss: Identify the correct quantized vector among distractors.
- Result: Can reach SOTA with only 10 minutes of labeled data (after pre-training on 53k hours).
HuBERT (Hidden Unit BERT):
- Idea: Offline clustering (k-means) of MFCCs to generate pseudo-labels.
- Masked Prediction: BERT-style objective. Predict the cluster ID of masked frames.
- Result: More robust than Wav2Vec 2.0.
Impact on Seq2Seq:
- Initialize the Encoder with Wav2Vec 2.0 / HuBERT.
- Add a random Decoder.
- Fine-tune on ASR task.
- Benefit: Drastic reduction in required labeled data.
32. Deep Dive: Multilingual Seq2Seq
Goal: One model for 100 languages.
Challenges:
- Data Imbalance: English has 1M hours, Swahili has 100.
- Script Diversity: Latin, Cyrillic, Chinese, Arabic.
- Phonetic Overlap: “P” in English $\neq$ “P” in French.
Solutions:
- Language ID Token:
<|en|>,<|fr|>. - Shared Vocabulary: SentencePiece (BPE) trained on all languages.
- Balancing Sampling: Up-sample low-resource languages during training. \(p_l \propto (\frac{N_l}{N_{total}})^\alpha\) where $\alpha < 1$ (e.g., 0.3) flattens the distribution.
- Adapter Modules: Small, language-specific layers inserted into the frozen giant model.
33. Deep Dive: End-to-End SLU (Spoken Language Understanding)
Traditional: Audio → ASR → Text → NLP → Intent/Slots. E2E SLU: Audio → Intent/Slots.
Why E2E?
- Error Robustness: ASR might transcribe “play jazz” as “play jas”. NLP fails. E2E model learns acoustic features of “jazz”.
- Paralinguistics: “Yeah right” (sarcastic) -> Negative Sentiment. Text-only NLP misses this.
Architecture:
- Encoder: Pre-trained Wav2Vec 2.0.
- Decoder: Predicts semantic frame directly.
[INTENT: PlayMusic] [ARTIST: The Beatles] [GENRE: Rock]
Challenges:
- Lack of labeled Audio-to-Semantics data.
- Solution: Transfer learning from ASR models.
34. Deep Dive: Attention Visualization Code
Visualizing attention maps is the best way to debug Seq2Seq models.
import matplotlib.pyplot as plt
import seaborn as sns
def plot_attention(attention_matrix, input_text, output_text):
"""
attention_matrix: (Output_Len, Input_Len) numpy array
"""
plt.figure(figsize=(10, 8))
sns.heatmap(attention_matrix, cmap='viridis',
xticklabels=input_text, yticklabels=output_text)
plt.xlabel('Encoder Input')
plt.ylabel('Decoder Output')
plt.show()
# Example usage
# attn = model.get_attention_weights(...)
# plot_attention(attn, audio_frames, predicted_words)
What to look for:
- Diagonal: Good alignment.
- Vertical Line: Decoder is stuck on one frame (repeating output).
- Horizontal Line: Encoder frame is ignored.
- Fuzzy: Weak attention (low confidence).
35. Deep Dive: Handling Long-Form Audio
Standard Transformers have $O(N^2)$ attention complexity. 30s audio = 1500 frames. 1 hour = 180,000 frames.
Strategies:
- Chunking (Whisper):
- Slice audio into 30s segments.
- Transcribe independently.
- Problem: Context cut off at boundaries.
- Fix: Pass previous segment’s text as prompt.
- Streaming (RNN-T):
- Process frame-by-frame. Infinite length.
- Problem: No future context.
- Sparse Attention (BigBird / Longformer):
- Attend only to local window + global tokens.
- $O(N)$ complexity.
- Block-Processing (Emformer):
- Block-wise processing with memory bank for history.
36. Future Trends: Audio-Language Models
SpeechGPT / AudioLM:
- Treat audio tokens and text tokens as the same thing.
- Tokenizer: SoundStream / EnCodec (Neural Audio Codec).
- Model: Decoder-only Transformer (GPT).
- Training:
- Text-only data (Web).
- Audio-only data (Radio).
- Paired data (ASR).
Capabilities:
- Speech-to-Speech Translation: “Translate this to French” (Audio input) -> Audio output.
- Voice Continuation: Continue speaking in the user’s voice.
- Zero-shot TTS: “Say ‘Hello’ in this voice: [Audio Prompt]”.
37. Ethical Considerations: Voice Cloning & Deepfakes
Seq2Seq TTS (Vall-E) can clone a voice with 3 seconds of audio.
Risks:
- Fraud: Impersonating CEOs or relatives.
- Disinformation: Fake speeches by politicians.
- Harassment: Fake audio of individuals.
Mitigation:
- Watermarking: Embed inaudible signals in generated audio.
- Detection: Train classifiers to detect artifacts of synthesis.
-
Regulation: “Know Your Customer” for TTS APIs.
- Regulation: “Know Your Customer” for TTS APIs.
38. Deep Dive: The Mathematics of Transformers in Speech
Why did Transformers replace RNNs?
1. Self-Attention Complexity:
- RNN: $O(N)$ sequential operations. Cannot parallelize.
- Transformer: $O(1)$ sequential operations (parallelizable). $O(N^2)$ memory.
- Benefit: We can train on 1000 GPUs efficiently.
2. Positional Encodings in Speech:
- Text uses absolute sinusoidal encodings.
- Speech Problem: Audio length varies wildly. 10s vs 10m.
- Solution: Relative Positional Encoding.
- Instead of adding $P_i$ to input $X_i$, add a bias $b_{i-j}$ to the attention score $A_{ij}$.
- Allows the model to generalize to audio lengths unseen during training.
3. Subsampling:
- Audio is high-frequency (100 frames/sec). Text is low-frequency (3 chars/sec).
- Conv Subsampling: First 2 layers of Encoder are strided Convolutions (stride 2x2 = 4x reduction).
- Reduces sequence length $N \to N/4$. Reduces attention cost $N^2 \to (N/4)^2 = N^2/16$.
39. Deep Dive: Troubleshooting Seq2Seq Models
1. The “Hallucination” Problem:
- Symptom: Model outputs “Thank you Thank you Thank you” during silence.
- Cause: Decoder language model is too strong; it predicts likely text even without acoustic evidence.
- Fix:
- Voice Activity Detection (VAD): Filter out silence.
- Coverage Penalty: Penalize attending to the same frames repeatedly.
2. The “NaN” Loss:
- Symptom: Training crashes.
- Cause: Exploding gradients in LSTM or division by zero in BatchNorm.
- Fix:
- Gradient Clipping (norm 1.0).
- Warmup learning rate.
- Check for empty transcripts in data.
3. The “Babble” Problem:
- Symptom: Output is gibberish.
- Cause: CTC alignment failed or Attention didn’t converge.
- Fix:
- Curriculum Learning: Train on short utterances first, then long.
- Guided Attention Loss: Force diagonal alignment for first few epochs.
- Guided Attention Loss: Force diagonal alignment for first few epochs.
40. Deep Dive: The Future - Audio-Language Models
The boundary between Speech and NLP is blurring.
AudioLM (Google):
- Treats audio as a sequence of discrete tokens (using SoundStream codec).
- Uses a GPT-style decoder to generate audio tokens.
- Capabilities:
- Speech Continuation: Given 3s of speech, continue speaking in the same voice and style.
- Zero-Shot TTS: Generate speech from text in a target voice without fine-tuning.
SpeechGPT (Fudan University):
- Fine-tunes LLaMA on paired speech-text data.
- Modality Adaptation: Teaches the LLM to understand discrete audio tokens.
- Result: A chatbot you can talk to, which talks back with emotion and nuance.
Implication:
- Seq2Seq models (Encoder-Decoder) might be replaced by Decoder-only LLMs that handle all modalities (Text, Audio, Image) in a unified token space.
41. Ethical Considerations in Seq2Seq Speech
1. Deepfakes & Voice Cloning:
- Models like Vall-E can clone a voice from 3 seconds of audio.
- Risk: Fraud (fake CEO calls), harassment, disinformation.
- Mitigation:
- Watermarking: Embed inaudible signals in generated audio.
- Detection: Train classifiers to detect synthesis artifacts.
2. Bias in ASR:
- Models trained on LibriSpeech (audiobooks) fail on AAVE (African American Vernacular English) or Indian accents.
- Fix: Diverse training data. “Fairness-aware” loss functions that penalize disparity between groups.
3. Privacy:
- Smart speakers listen constantly.
- Fix: On-Device Processing. Run the Seq2Seq model on the phone’s NPU, never sending audio to the cloud.
42. Further Reading
To master Seq2Seq speech models, these papers are essential:
- “Listen, Attend and Spell” (Chan et al., 2016): The paper that introduced Attention to ASR.
- “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions” (Shen et al., 2018): The Tacotron 2 paper.
- “Attention Is All You Need” (Vaswani et al., 2017): The Transformer paper (foundation of modern speech).
- “Conformer: Convolution-augmented Transformer for Speech Recognition” (Gulati et al., 2020): The current SOTA architecture for ASR.
- “Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” (Baevski et al., 2020): The SSL revolution.
- “Whisper: Robust Speech Recognition via Large-Scale Weak Supervision” (Radford et al., 2022): How scale beats architecture.
43. Summary
| Task | Model | Input | Output |
|---|---|---|---|
| ASR | LAS, Whisper | Audio | Text |
| TTS | Tacotron 2 | Text | Audio |
| ST | Translatotron | Audio (L1) | Audio (L2) |
| Vocoder | HiFi-GAN | Mel-Spec | Waveform |
| Streaming | RNN-T | Audio Stream | Text Stream |
| Fast TTS | FastSpeech 2 | Text | Mel-Spec |
| SSL | Wav2Vec 2.0 | Masked Audio | Quantized Vector |
| E2E SLU | SLU-BERT | Audio | Intent/Slots |
| Troubleshooting | VAD, Gradient Clipping | - | - |
| Future | AudioLM | Discrete Tokens | Discrete Tokens |
Originally published at: arunbaby.com/speech-tech/0037-sequence-to-sequence-speech