Speech Architecture Search
Design neural architecture search systems for speech models that automatically discover optimal ASR/TTS architectures, using dynamic programming and path optimization to navigate exponential search spaces.
TL;DR
Speech architecture search automates the discovery of ASR/TTS model architectures that balance accuracy, latency, and size. The search space spans encoder types (Conformer, Transformer, LSTM), decoder types (CTC, RNN-T), attention mechanisms, and feature configs. Evolutionary algorithms and RL controllers explore this space efficiently, with progressive evaluation filtering 80% of bad candidates early. Google’s NAS found a 4-layer Conformer achieving 5.2% WER at 85ms latency in 80 GPU days. For serving the discovered architectures, see compute allocation for speech models, and for ensemble strategies that combine multiple architectures, see multi-model speech ensemble.

Problem Statement
Design a Speech Architecture Search System that:
- Automatically discovers ASR/TTS architectures optimized for accuracy, latency, and size
- Searches efficiently through speech-specific architecture spaces (encoders, decoders, attention)
- Handles speech constraints (streaming, long sequences, variable-length inputs)
- Optimizes for deployment (mobile, edge, server, different hardware)
- Supports multi-objective optimization (WER, latency, params, multilingual capability)
Functional Requirements
- Speech-specific search spaces:
- Encoder architectures (Conformer, Transformer, RNN, CNN)
- Decoder types (CTC, RNN-T, attention-based)
- Attention mechanisms (self-attention, cross-attention, relative positional)
- Feature extraction configs (mel-spec, MFCC, learnable features)
- Search strategies:
- Reinforcement learning
- Evolutionary algorithms
- Differentiable NAS (DARTS for speech)
- Bayesian optimization
- Transfer from vision NAS results
- Performance estimation:
- Train on subset of data (LibriSpeech-100h vs 960h)
- Early stopping based on validation WER
- Weight sharing across architectures
- WER prediction from architecture features
- Multi-objective optimization:
- WER vs latency (for real-time ASR)
- WER vs model size (for on-device)
- WER vs RTF (real-time factor)
- Multi-lingual capability vs params
- Streaming-aware search:
- Architectures must support chunk-wise processing
- Latency measured per chunk, not full utterance
- Look-ahead constraints (for causal models)
- Evaluation:
- WER/CER on multiple test sets
- Latency measurement on target hardware
- Parameter count and memory footprint
- Multi-lingual evaluation
Non-Functional Requirements
- Efficiency: Find good architecture in <50 GPU days
- Quality: WER competitive with hand-designed models
- Generalizability: Transfer across languages and domains
- Reproducibility: Same search produces same results
- Practicality: Discovered models deployable in production
Understanding the Requirements
Why Speech Architecture Search?
Manual speech model design challenges:
- Requires domain expertise (speech signal processing + deep learning)
- Hard to balance accuracy, latency, and size
- Difficult to optimize for specific hardware (mobile, server)
- Time-consuming to explore alternative designs
Speech NAS enables:
- Automated discovery of novel architectures
- Hardware-specific optimization (mobile, edge TPU, server GPU)
- Multi-lingual model optimization
- Systematic exploration of design space
Speech Architecture Challenges
- Long sequences: Audio is 100s-1000s of frames (vs images ~224×224)
- Temporal modeling: Need strong sequential modeling (RNNs, Transformers)
- Streaming requirements: Many applications need real-time processing
- Variable length: Utterances vary from 1s to 60s+
- Multi-lingual: Same architecture should work across languages
The Path Optimization Connection
Just like Unique Paths uses DP to count paths through a grid:
| Unique Paths | Neural Arch Search | Speech Arch Search |
|---|---|---|
| m×n grid | General model space | Speech-specific space |
| Count paths | Evaluate architectures | Evaluate speech models |
| DP: paths(i,j) = paths(i-1,j) + paths(i,j-1) | DP: Build from sub-architectures | DP: Build from encoder/decoder blocks |
| O(m×n) from O(2^(m+n)) | Polynomial from exponential | Efficient from exhaustive |
| Reconstruct optimal path | Extract best architecture | Extract best speech model |
Both use DP and path optimization to navigate exponentially large spaces.
High-Level Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Speech Architecture Search System │
└─────────────────────────────────────────────────────────────────┘
Search Controller
┌────────────────────────────────────┐
│ Strategy: RL / EA / DARTS │
│ - Propose speech architectures │
│ - Encoder + Decoder + Attention │
└──────────────┬─────────────────────┘
│
┌──────▼──────┐
│ Speech │
│ Search │
│ Space │
│ │
│ - Encoder │
│ - Decoder │
│ - Attention │
│ - Features │
└──────┬──────┘
│
┌──────────────┼──────────────┐
│ │ │
┌───────▼────────┐ ┌──▼────┐ ┌──────▼──────┐
│ Architecture │ │ WER │ │ Latency │
│ Evaluator │ │Predict│ │ Predictor │
│ │ │ │ │ │
│ - Train ASR/TTS│ │ - Skip│ │ - Hardware │
│ - Measure WER │ │ bad │ │ profile │
│ - Measure RTF │ │ archs│ │ - RTF est │
└───────┬────────┘ └───────┘ └──────┬──────┘
│ │
└─────────────┬─────────────┘
│
┌─────────▼────────┐
│ Distributed │
│ Training │
│ - Worker pool │
│ - GPU cluster │
│ - Multi-task eval│
└─────────┬────────┘
│
┌─────────▼────────┐
│ Results Database │
│ - Architectures │
│ - WER scores │
│ - Latency │
│ - Pareto front │
└──────────────────┘
Key Components
- Search Controller: Proposes speech architectures
- Speech Search Space: Defines encoder/decoder/attention options
- Architecture Evaluator: Trains and measures WER/latency
- Performance Predictors: Estimate WER and latency without full training
- Distributed Training: Parallel architecture evaluation
- Results Database: Track all evaluated architectures
Component Deep-Dives
1. Speech-Specific Search Space
Define search space for ASR models:
from dataclasses import dataclass
from typing import List
from enum import Enum
class EncoderType(Enum):
"""Encoder architecture options."""
CONFORMER = "conformer"
TRANSFORMER = "transformer"
LSTM = "lstm"
BLSTM = "blstm"
CNN_LSTM = "cnn_lstm"
CONTEXTNET = "contextnet"
class DecoderType(Enum):
"""Decoder architecture options."""
CTC = "ctc"
RNN_T = "rnn_t"
ATTENTION = "attention"
TRANSFORMER_DECODER = "transformer_decoder"
class AttentionType(Enum):
"""Attention mechanism options."""
MULTI_HEAD = "multi_head"
RELATIVE = "relative"
LOCAL = "local"
EFFICIENT = "efficient_attention"
@dataclass
class SpeechArchConfig:
"""
Speech model architecture configuration.
Similar to choosing path in Unique Paths:
- Each choice (encoder, decoder, etc.) is like a move
- Combination forms complete architecture (path)
"""
# Encoder config
encoder_type: EncoderType
encoder_layers: int
encoder_dim: int
encoder_heads: int # For attention-based encoders
# Decoder config
decoder_type: DecoderType
decoder_layers: int
decoder_dim: int
# Attention config
attention_type: AttentionType
attention_dim: int
# Feature extraction
n_mels: int
def count_parameters(self) -> int:
"""Estimate parameter count."""
# Simplified estimation
encoder_params = self.encoder_layers * (self.encoder_dim ** 2) * 4
decoder_params = self.decoder_layers * (self.decoder_dim ** 2) * 4
return encoder_params + decoder_params
def estimate_flops(self, sequence_length: int = 1000) -> int:
"""Estimate FLOPs for sequence of given length."""
# Encoder (attention is O(L^2 * D))
encoder_flops = sequence_length ** 2 * self.encoder_dim * self.encoder_layers
# Decoder
decoder_flops = sequence_length * self.decoder_dim * self.decoder_layers
return encoder_flops + decoder_flops
class SpeechSearchSpace:
"""
Search space for speech architectures.
Similar to grid in Unique Paths:
- Dimensions: encoder × decoder × attention × features
- Each dimension has multiple choices
- Total space is exponential
"""
def __init__(self):
# Define choices
self.encoder_types = list(EncoderType)
self.decoder_types = list(DecoderType)
self.encoder_layer_options = [4, 6, 8, 12]
self.encoder_dim_options = [256, 512, 768]
self.decoder_layer_options = [1, 2, 4]
def count_total_architectures(self) -> int:
"""
Count total architectures (like counting paths).
"""
count = (
len(self.encoder_types) *
len(self.encoder_layer_options) *
len(self.encoder_dim_options) *
len(self.decoder_types) *
len(self.decoder_layer_options)
)
return count
def sample_random_architecture(self) -> SpeechArchConfig:
"""Sample random architecture from space."""
import random
return SpeechArchConfig(
encoder_type=random.choice(self.encoder_types),
encoder_layers=random.choice(self.encoder_layer_options),
encoder_dim=random.choice(self.encoder_dim_options),
encoder_heads=8, # Fixed for simplicity
decoder_type=random.choice(self.decoder_types),
decoder_layers=random.choice(self.decoder_layer_options),
decoder_dim=256, # Fixed
attention_type=AttentionType.MULTI_HEAD,
attention_dim=256,
n_mels=80
)
2. Architecture Evaluation
import torch
import torch.nn as nn
def build_speech_model(config: SpeechArchConfig) -> nn.Module:
"""
Build speech model from configuration.
Args:
config: Architecture configuration
Returns:
PyTorch model
"""
# This would integrate with ESPnet or custom implementation
# Simplified example:
if config.encoder_type == EncoderType.CONFORMER:
from espnet.nets.pytorch_backend.conformer.encoder import Encoder
encoder = Encoder(
idim=config.n_mels,
attention_dim=config.encoder_dim,
attention_heads=config.encoder_heads,
linear_units=config.encoder_dim * 4,
num_blocks=config.encoder_layers
)
elif config.encoder_type == EncoderType.TRANSFORMER:
from espnet.nets.pytorch_backend.transformer.encoder import Encoder
encoder = Encoder(
idim=config.n_mels,
attention_dim=config.encoder_dim,
attention_heads=config.encoder_heads,
linear_units=config.encoder_dim * 4,
num_blocks=config.encoder_layers
)
else:
# LSTM, CNN-LSTM, etc.
encoder = create_encoder(config)
# Build decoder
if config.decoder_type == DecoderType.CTC:
decoder = nn.Linear(config.encoder_dim, num_tokens)
elif config.decoder_type == DecoderType.RNN_T:
decoder = create_rnnt_decoder(config)
else:
decoder = create_attention_decoder(config)
# Combine into full model
model = SpeechModel(encoder=encoder, decoder=decoder)
return model
def evaluate_speech_architecture(
config: SpeechArchConfig,
train_subset: str = "librispeech-100h",
val_subset: str = "librispeech-dev",
max_epochs: int = 20
) -> Dict:
"""
Evaluate speech architecture.
Args:
config: Architecture to evaluate
train_subset: Training data subset
val_subset: Validation data
max_epochs: Max training epochs
Returns:
Dictionary with WER, latency, params, etc.
"""
# Build model
model = build_speech_model(config)
# Count parameters
num_params = sum(p.numel() for p in model.parameters())
# Train
best_wer = train_and_evaluate(
model,
train_data=train_subset,
val_data=val_subset,
max_epochs=max_epochs
)
# Measure latency
latency_ms = measure_inference_latency(model)
# Measure RTF (real-time factor)
rtf = measure_rtf(model)
return {
"config": config,
"wer": best_wer,
"latency_ms": latency_ms,
"rtf": rtf,
"params": num_params,
"flops": config.estimate_flops()
}
3. Search Strategy for Speech
class SpeechNASController:
"""
NAS controller for speech architectures.
Uses DP-like building:
- Build encoder → choose decoder → optimize jointly
- Like building path: choose direction at each step
"""
def __init__(self, search_space: SpeechSearchSpace):
self.search_space = search_space
self.evaluated_archs = {}
self.best_archs = []
def search_with_evolutionary(self, population_size: int = 20, generations: int = 50):
"""
Evolutionary search for speech architectures.
Similar to exploring paths in Unique Paths:
- Generate population (multiple paths)
- Evaluate fitness (WER)
- Mutate and crossover (create new paths)
- Select best (optimal paths)
"""
# Initialize population
population = [
self.search_space.sample_random_architecture()
for _ in range(population_size)
]
for generation in range(generations):
# Evaluate all architectures
fitness_scores = []
for arch in population:
if encode_architecture(arch) not in self.evaluated_archs:
result = evaluate_speech_architecture(arch)
self.evaluated_archs[encode_architecture(arch)] = result
fitness = 1.0 / (result['wer'] + 0.01) # Lower WER = higher fitness
else:
result = self.evaluated_archs[encode_architecture(arch)]
fitness = 1.0 / (result['wer'] + 0.01)
fitness_scores.append((arch, fitness, result))
# Sort by fitness
fitness_scores.sort(key=lambda x: x[1], reverse=True)
# Track best
self.best_archs.append(fitness_scores[0])
# Selection: keep top 50%
survivors = [arch for arch, _, _ in fitness_scores[:population_size // 2]]
# Mutation and crossover to create next generation
offspring = []
while len(offspring) < population_size // 2:
# Select parents
parent1 = random.choice(survivors)
parent2 = random.choice(survivors)
# Crossover
child = self._crossover(parent1, parent2)
# Mutation
if random.random() < 0.3:
child = self._mutate(child)
offspring.append(child)
# New population
population = survivors + offspring
# Return best architecture found
best = max(self.best_archs, key=lambda x: x[1])
return best[0], best[2]
def _crossover(self, arch1: SpeechArchConfig, arch2: SpeechArchConfig) -> SpeechArchConfig:
"""
Crossover two architectures.
Randomly inherit components from parents.
"""
return SpeechArchConfig(
encoder_type=random.choice([arch1.encoder_type, arch2.encoder_type]),
encoder_layers=random.choice([arch1.encoder_layers, arch2.encoder_layers]),
encoder_dim=random.choice([arch1.encoder_dim, arch2.encoder_dim]),
encoder_heads=random.choice([arch1.encoder_heads, arch2.encoder_heads]),
decoder_type=random.choice([arch1.decoder_type, arch2.decoder_type]),
decoder_layers=random.choice([arch1.decoder_layers, arch2.decoder_layers]),
decoder_dim=random.choice([arch1.decoder_dim, arch2.decoder_dim]),
attention_type=random.choice([arch1.attention_type, arch2.attention_type]),
attention_dim=random.choice([arch1.attention_dim, arch2.attention_dim]),
n_mels=random.choice([arch1.n_mels, arch2.n_mels])
)
def _mutate(self, arch: SpeechArchConfig) -> SpeechArchConfig:
"""
Mutate architecture.
Randomly change one component.
"""
mutation_choice = random.randint(0, 3)
new_arch = SpeechArchConfig(**arch.__dict__)
if mutation_choice == 0:
# Mutate encoder
new_arch.encoder_layers = random.choice(self.search_space.encoder_layer_options)
elif mutation_choice == 1:
# Mutate encoder dim
new_arch.encoder_dim = random.choice(self.search_space.encoder_dim_options)
elif mutation_choice == 2:
# Mutate decoder
new_arch.decoder_layers = random.choice(self.search_space.decoder_layer_options)
else:
# Mutate encoder type
new_arch.encoder_type = random.choice(self.search_space.encoder_types)
return new_arch
4. Multi-Objective Optimization
class MultiObjectiveSpeechNAS:
"""
Multi-objective NAS for speech.
Optimize for:
- WER (minimize)
- Latency (minimize)
- Model size (minimize)
Find Pareto frontier of optimal trade-offs.
"""
def __init__(self, search_space: SpeechSearchSpace):
self.search_space = search_space
self.pareto_front = []
def search(self, num_candidates: int = 100):
"""Search for Pareto-optimal architectures."""
evaluated = []
for i in range(num_candidates):
# Sample architecture
arch = self.search_space.sample_random_architecture()
# Evaluate
result = evaluate_speech_architecture(arch)
evaluated.append({
"arch": arch,
"wer": result['wer'],
"latency": result['latency_ms'],
"params": result['params']
})
# Find Pareto frontier
self.pareto_front = self._compute_pareto_front(evaluated)
return self.pareto_front
def _compute_pareto_front(self, candidates: List[Dict]) -> List[Dict]:
"""
Compute Pareto frontier.
An architecture is Pareto-optimal if no other architecture
is better in all objectives.
"""
pareto = []
for i, cand1 in enumerate(candidates):
is_dominated = False
for j, cand2 in enumerate(candidates):
if i == j:
continue
# Check if cand2 dominates cand1
# (better or equal in all objectives, strictly better in at least one)
if (cand2['wer'] <= cand1['wer'] and
cand2['latency'] <= cand1['latency'] and
cand2['params'] <= cand1['params'] and
(cand2['wer'] < cand1['wer'] or
cand2['latency'] < cand1['latency'] or
cand2['params'] < cand1['params'])):
is_dominated = True
break
if not is_dominated:
pareto.append(cand1)
return pareto
def select_for_target(self, max_latency_ms: float, max_params: int) -> Optional[Dict]:
"""
Select best architecture meeting constraints.
Args:
max_latency_ms: Maximum acceptable latency
max_params: Maximum model size
Returns:
Best architecture meeting constraints, or None
"""
candidates = [
arch for arch in self.pareto_front
if arch['latency'] <= max_latency_ms and arch['params'] <= max_params
]
if not candidates:
return None
# Return lowest WER among candidates
return min(candidates, key=lambda x: x['wer'])
Scaling Strategies
Efficient Evaluation
1. Progressive training:
def progressive_evaluation(arch: SpeechArchConfig):
"""
Evaluate architecture progressively.
Start with small dataset/short training.
Only continue if promising.
"""
# Stage 1: Train on LibriSpeech-100h for 5 epochs
wer_stage1 = quick_train(arch, data="librispeech-100h", epochs=5)
if wer_stage1 > 0.20: # 20% WER threshold
return {"wer": wer_stage1, "early_stopped": True}
# Stage 2: Train on LibriSpeech-100h for 20 epochs
wer_stage2 = quick_train(arch, data="librispeech-100h", epochs=20)
if wer_stage2 > 0.10:
return {"wer": wer_stage2, "early_stopped": True}
# Stage 3: Full training on LibriSpeech-960h
wer_final = full_train(arch, data="librispeech-960h", epochs=100)
return {"wer": wer_final, "early_stopped": False}
2. Weight sharing (supernet for speech):
class SpeechSuperNet(nn.Module):
"""
Super-network for speech NAS.
Contains all possible operations.
Different architectures share weights.
"""
def __init__(self, search_space: SpeechSearchSpace):
super().__init__()
# Create all encoder options
self.encoders = nn.ModuleDict({
enc_type.value: create_encoder(enc_type, max_layers=12, max_dim=768)
for enc_type in EncoderType
})
# Create all decoder options
self.decoders = nn.ModuleDict({
dec_type.value: create_decoder(dec_type)
for dec_type in DecoderType
})
def forward(self, audio_features: torch.Tensor, arch: SpeechArchConfig):
"""Forward with specific architecture."""
# Select encoder
encoder = self.encoders[arch.encoder_type.value]
# Select decoder
decoder = self.decoders[arch.decoder_type.value]
# Forward pass
encoder_out = encoder(audio_features)
output = decoder(encoder_out)
return output
Real-World Case Study: Google’s Speech NAS
Google’s Approach for Mobile ASR
Goal: Find ASR architecture for on-device deployment with <100ms latency.
Search space:
- Encoder: RNN, LSTM, GRU, Conformer variants
- Layers: 2-8
- Hidden dim: 128-512
- Decoder: CTC, RNN-T
Search strategy:
- Reinforcement learning controller
- Multi-objective: WER + latency + model size
- Progressive training (100h → 960h dataset)
Results:
- Discovered architecture: 4-layer Conformer + RNN-T
- WER: 5.2% on LibriSpeech test-clean (vs 6.1% baseline)
- Latency: 85ms on Pixel 6 (vs 120ms baseline)
- Size: 45M params (vs 80M baseline LSTM)
- Search cost: 80 GPU days (vs months of manual tuning)
Key insights:
- Conformer with fewer layers beats deep LSTM
- RNN-T decoder better latency than attention for streaming
- Smaller models with better architecture beat larger hand-designed ones
Lessons Learned
- Speech-specific constraints matter: Streaming, variable length, long sequences
- Multi-objective is essential: Can’t just optimize WER
- Progressive evaluation saves compute: 80% of candidates filtered early
- Transfer works: ImageNet NAS insights transfer to speech (depth vs width)
- Hardware-in-the-loop: Measure latency on actual target device
Cost Analysis
NAS vs Manual Design
| Approach | Time | GPU Cost | Quality (WER) | Notes |
|---|---|---|---|---|
| Manual design | 6 months | 50 GPU days | 6.5% | Expert-dependent |
| Random search | N/A | 500 GPU days | 7.0% | Baseline |
| Evolutionary NAS | 2 months | 100 GPU days | 5.8% | Robust |
| RL-based NAS | 1 month | 80 GPU days | 5.2% | Google’s approach |
| DARTS for speech | 2 weeks | 10 GPU days | 6.0% | Fast but less stable |
| Transfer + fine-tune | 1 week | 5 GPU days | 5.5% | Use vision NAS results |
ROI:
- Manual:
120K (engineer time) +15K (GPUs) = $135K - NAS:
40K (engineer time) +24K (GPUs) = $64K - Savings: $71K + better model + faster iteration
Advanced Topics
1. Multi-Lingual NAS
Search for architectures that work across languages:
def multi_lingual_nas(languages: List[str] = ["en", "zh", "es"]):
"""
Search for architecture that works well across languages.
Fitness = average WER across all languages.
"""
def evaluate_multilingual(arch: SpeechArchConfig) -> float:
wers = []
for lang in languages:
wer = train_and_evaluate(
arch,
train_data=f"common_voice_{lang}",
val_data=f"common_voice_{lang}_dev"
)
wers.append(wer)
# Average WER across languages
return sum(wers) / len(wers)
# Search with multi-lingual fitness
# ... (use evolutionary or RL search)
2. Streaming-Aware NAS
Optimize for streaming ASR:
def streaming_aware_evaluation(arch: SpeechArchConfig) -> Dict:
"""
Evaluate architecture for streaming capability.
Metrics:
- Per-chunk latency (not full utterance)
- Look-ahead requirement
- Chunk size vs WER trade-off
"""
model = build_speech_model(arch)
# Test streaming performance
chunk_size_ms = 100 # 100ms chunks
chunk_latency = measure_chunk_latency(model, chunk_size_ms)
streaming_wer = evaluate_streaming_wer(model, chunk_size_ms)
return {
"chunk_latency_ms": chunk_latency,
"streaming_wer": streaming_wer,
"supports_streaming": chunk_latency < chunk_size_ms
}
3. Transfer from Vision NAS
Leverage insights from ImageNet NAS:
def transfer_vision_to_speech(vision_arch_config):
"""
Transfer successful vision architectures to speech.
Example: EfficientNet principles → EfficientConformer
- Depth scaling
- Width scaling
- Compound scaling
"""
# Extract architectural principles
depth_factor = vision_arch_config.depth_coefficient
width_factor = vision_arch_config.width_coefficient
# Apply to speech
speech_config = SpeechArchConfig(
encoder_type=EncoderType.CONFORMER,
encoder_layers=int(6 * depth_factor),
encoder_dim=int(256 * width_factor),
encoder_heads=8,
decoder_type=DecoderType.RNN_T,
decoder_layers=2,
decoder_dim=256,
attention_type=AttentionType.RELATIVE,
attention_dim=256,
n_mels=80
)
return speech_config
Monitoring & Debugging
Key Metrics
Search Progress:
- Best WER found so far vs iterations
- Pareto frontier evolution
- Architecture diversity (entropy of designs explored)
- GPU utilization during search
Architecture Analysis:
- Most common encoder/decoder types in top performers
- Depth vs width trade-offs
- Correlation between architecture features and WER
Resource Tracking:
- Total GPU hours consumed
- Average training time per architecture
- Early stopping rate (% of archs stopped early)
Debugging Tools
- Visualize architecture graphs
- Compare top-N architectures side-by-side
- Ablation studies (which components matter most?)
- Error analysis (where do discovered archs fail?)
Key Takeaways
✅ Speech NAS automates architecture design for ASR/TTS models
✅ Search space is exponential - like paths in a grid, need smart search
✅ DP and smart search make NAS practical - from infeasible to 50-100 GPU days
✅ Multi-objective optimization essential - WER, latency, size must be balanced
✅ Progressive evaluation saves compute - filter bad candidates early
✅ Weight sharing (supernet) enables evaluating 1000s of architectures
✅ Speech-specific constraints - streaming, variable length, multi-lingual
✅ Transfer from vision accelerates speech NAS
✅ Hardware-aware search critical for deployment
✅ Same DP pattern as Unique Paths - build optimal solution from sub-solutions
Connection to Thematic Link: Dynamic Programming and Path Optimization
All three topics use DP to optimize paths through exponential spaces:
DSA (Unique Paths):
- Navigate m×n grid using DP
- Recurrence: paths(i,j) = paths(i-1,j) + paths(i,j-1)
- Build solution from optimal sub-solutions
ML System Design (Neural Architecture Search):
- Navigate exponential architecture space
- Use DP/RL/gradient methods to find optimal
- Build full model from optimal components
Speech Tech (Speech Architecture Search):
- Navigate encoder×decoder×attention space
- Use DP-inspired search to find optimal speech models
- Build ASR/TTS from optimal sub-architectures
The unifying principle: decompose exponentially large search spaces into manageable subproblems, solve optimally using DP or DP-inspired methods, and construct the best overall solution.
FAQ
Why is neural architecture search useful for speech models specifically?
Speech models face unique constraints like streaming requirements, variable-length inputs up to 60+ seconds, and long sequences of hundreds to thousands of frames that make manual architecture design particularly difficult. NAS can systematically explore encoder, decoder, and attention combinations while simultaneously optimizing for WER, latency, and model size across target hardware.
How does progressive evaluation make speech NAS practical?
Progressive evaluation trains candidate architectures in stages: first on a small data subset (e.g., LibriSpeech-100h) for 5 epochs, then for 20 epochs, and only promising candidates receive full training on the complete dataset. This filters out 80% of bad architectures before expensive training, reducing total search cost from hundreds to 50-100 GPU days.
What is multi-objective NAS and how does the Pareto frontier work?
Multi-objective NAS optimizes for multiple goals simultaneously – minimizing WER, latency, and model size. The Pareto frontier contains all architectures where no other design is better in every objective. Engineers select from this frontier based on deployment constraints, choosing the lowest-WER architecture that meets their maximum latency and model size requirements.
What results has Google achieved with speech NAS?
Google’s mobile ASR NAS discovered a 4-layer Conformer plus RNN-T architecture that achieved 5.2% WER on LibriSpeech test-clean (vs 6.1% baseline), 85ms latency on Pixel 6 (vs 120ms baseline), and 45M parameters (vs 80M baseline LSTM). The search took 80 GPU days, costing approximately $64K total including engineer time, versus $135K for manual design.
Originally published at: arunbaby.com/speech-tech/0021-speech-architecture-search
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch