Speech Architecture Search

Q: Why is neural architecture search useful for speech models specifically?

Speech models face unique constraints like streaming requirements, variable-length inputs, and long sequences that make manual architecture design particularly difficult. NAS can systematically explore encoder, decoder, and attention combinations while optimizing for WER, latency, and model size simultaneously.

Q: How does progressive evaluation make speech NAS practical?

Progressive evaluation trains candidate architectures in stages, starting with a small data subset for few epochs. Candidates exceeding a WER threshold are eliminated early. This filters out 80% of bad architectures before expensive full training, reducing total search cost from hundreds to 50-100 GPU days.

Q: What is multi-objective NAS and how does the Pareto frontier work?

Multi-objective NAS optimizes for multiple goals simultaneously, such as minimizing WER, latency, and model size. The Pareto frontier contains architectures where no other design is better in all objectives. Engineers select from this frontier based on deployment constraints like maximum latency or model size.

Q: What results has Google achieved with speech NAS?

Google's mobile ASR NAS discovered a 4-layer Conformer plus RNN-T architecture that achieved 5.2% WER (vs 6.1% baseline), 85ms latency on Pixel 6 (vs 120ms baseline), and 45M parameters (vs 80M baseline), all found in 80 GPU days versus months of manual tuning.

19 minute read

Design neural architecture search systems for speech models that automatically discover optimal ASR/TTS architectures, using dynamic programming and path optimization to navigate exponential search spaces.

TL;DR

Speech architecture search automates the discovery of ASR/TTS model architectures that balance accuracy, latency, and size. The search space spans encoder types (Conformer, Transformer, LSTM), decoder types (CTC, RNN-T), attention mechanisms, and feature configs. Evolutionary algorithms and RL controllers explore this space efficiently, with progressive evaluation filtering 80% of bad candidates early. Google’s NAS found a 4-layer Conformer achieving 5.2% WER at 85ms latency in 80 GPU days. For serving the discovered architectures, see compute allocation for speech models, and for ensemble strategies that combine multiple architectures, see multi-model speech ensemble.

Problem Statement

Design a Speech Architecture Search System that:

Automatically discovers ASR/TTS architectures optimized for accuracy, latency, and size
Searches efficiently through speech-specific architecture spaces (encoders, decoders, attention)
Handles speech constraints (streaming, long sequences, variable-length inputs)
Optimizes for deployment (mobile, edge, server, different hardware)
Supports multi-objective optimization (WER, latency, params, multilingual capability)

Functional Requirements

Speech-specific search spaces:
- Encoder architectures (Conformer, Transformer, RNN, CNN)
- Decoder types (CTC, RNN-T, attention-based)
- Attention mechanisms (self-attention, cross-attention, relative positional)
- Feature extraction configs (mel-spec, MFCC, learnable features)
Search strategies:
- Reinforcement learning
- Evolutionary algorithms
- Differentiable NAS (DARTS for speech)
- Bayesian optimization
- Transfer from vision NAS results
Performance estimation:
- Train on subset of data (LibriSpeech-100h vs 960h)
- Early stopping based on validation WER
- Weight sharing across architectures
- WER prediction from architecture features
Multi-objective optimization:
- WER vs latency (for real-time ASR)
- WER vs model size (for on-device)
- WER vs RTF (real-time factor)
- Multi-lingual capability vs params
Streaming-aware search:
- Architectures must support chunk-wise processing
- Latency measured per chunk, not full utterance
- Look-ahead constraints (for causal models)
Evaluation:
- WER/CER on multiple test sets
- Latency measurement on target hardware
- Parameter count and memory footprint
- Multi-lingual evaluation

Non-Functional Requirements

Efficiency: Find good architecture in <50 GPU days
Quality: WER competitive with hand-designed models
Generalizability: Transfer across languages and domains
Reproducibility: Same search produces same results
Practicality: Discovered models deployable in production

Understanding the Requirements

Why Speech Architecture Search?

Manual speech model design challenges:

Requires domain expertise (speech signal processing + deep learning)
Hard to balance accuracy, latency, and size
Difficult to optimize for specific hardware (mobile, server)
Time-consuming to explore alternative designs

Speech NAS enables:

Automated discovery of novel architectures
Hardware-specific optimization (mobile, edge TPU, server GPU)
Multi-lingual model optimization
Systematic exploration of design space

Speech Architecture Challenges

Long sequences: Audio is 100s-1000s of frames (vs images ~224×224)
Temporal modeling: Need strong sequential modeling (RNNs, Transformers)
Streaming requirements: Many applications need real-time processing
Variable length: Utterances vary from 1s to 60s+
Multi-lingual: Same architecture should work across languages

The Path Optimization Connection

Just like Unique Paths uses DP to count paths through a grid:

Unique Paths	Neural Arch Search	Speech Arch Search
m×n grid	General model space	Speech-specific space
Count paths	Evaluate architectures	Evaluate speech models
DP: paths(i,j) = paths(i-1,j) + paths(i,j-1)	DP: Build from sub-architectures	DP: Build from encoder/decoder blocks
O(m×n) from O(2^(m+n))	Polynomial from exponential	Efficient from exhaustive
Reconstruct optimal path	Extract best architecture	Extract best speech model

Both use DP and path optimization to navigate exponentially large spaces.

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│ Speech Architecture Search System │
└─────────────────────────────────────────────────────────────────┘

 Search Controller
 ┌────────────────────────────────────┐
 │ Strategy: RL / EA / DARTS │
 │ - Propose speech architectures │
 │ - Encoder + Decoder + Attention │
 └──────────────┬─────────────────────┘
 │
 ┌──────▼──────┐
 │ Speech │
 │ Search │
 │ Space │
 │ │
 │ - Encoder │
 │ - Decoder │
 │ - Attention │
 │ - Features │
 └──────┬──────┘
 │
 ┌──────────────┼──────────────┐
 │ │ │
┌───────▼────────┐ ┌──▼────┐ ┌──────▼──────┐
│ Architecture │ │ WER │ │ Latency │
│ Evaluator │ │Predict│ │ Predictor │
│ │ │ │ │ │
│ - Train ASR/TTS│ │ - Skip│ │ - Hardware │
│ - Measure WER │ │ bad │ │ profile │
│ - Measure RTF │ │ archs│ │ - RTF est │
└───────┬────────┘ └───────┘ └──────┬──────┘
 │ │
 └─────────────┬─────────────┘
 │
 ┌─────────▼────────┐
 │ Distributed │
 │ Training │
 │ - Worker pool │
 │ - GPU cluster │
 │ - Multi-task eval│
 └─────────┬────────┘
 │
 ┌─────────▼────────┐
 │ Results Database │
 │ - Architectures │
 │ - WER scores │
 │ - Latency │
 │ - Pareto front │
 └──────────────────┘

Key Components

Search Controller: Proposes speech architectures
Speech Search Space: Defines encoder/decoder/attention options
Architecture Evaluator: Trains and measures WER/latency
Performance Predictors: Estimate WER and latency without full training
Distributed Training: Parallel architecture evaluation
Results Database: Track all evaluated architectures

Component Deep-Dives

1. Speech-Specific Search Space

Define search space for ASR models:

from dataclasses import dataclass
from typing import List
from enum import Enum

class EncoderType(Enum):
    """Encoder architecture options."""
    CONFORMER = "conformer"
    TRANSFORMER = "transformer"
    LSTM = "lstm"
    BLSTM = "blstm"
    CNN_LSTM = "cnn_lstm"
    CONTEXTNET = "contextnet"

    class DecoderType(Enum):
        """Decoder architecture options."""
        CTC = "ctc"
        RNN_T = "rnn_t"
        ATTENTION = "attention"
        TRANSFORMER_DECODER = "transformer_decoder"

    class AttentionType(Enum):
        """Attention mechanism options."""
        MULTI_HEAD = "multi_head"
        RELATIVE = "relative"
        LOCAL = "local"
        EFFICIENT = "efficient_attention"

        @dataclass
    class SpeechArchConfig:
        """
        Speech model architecture configuration.

        Similar to choosing path in Unique Paths:
            - Each choice (encoder, decoder, etc.) is like a move
            - Combination forms complete architecture (path)
            """
            # Encoder config
            encoder_type: EncoderType
            encoder_layers: int
            encoder_dim: int
            encoder_heads: int # For attention-based encoders

            # Decoder config
            decoder_type: DecoderType
            decoder_layers: int
            decoder_dim: int

            # Attention config
            attention_type: AttentionType
            attention_dim: int

            # Feature extraction
            n_mels: int

    def count_parameters(self) -> int:
        """Estimate parameter count."""
        # Simplified estimation
        encoder_params = self.encoder_layers * (self.encoder_dim ** 2) * 4
        decoder_params = self.decoder_layers * (self.decoder_dim ** 2) * 4
        return encoder_params + decoder_params

    def estimate_flops(self, sequence_length: int = 1000) -> int:
        """Estimate FLOPs for sequence of given length."""
        # Encoder (attention is O(L^2 * D))
        encoder_flops = sequence_length ** 2 * self.encoder_dim * self.encoder_layers

        # Decoder
        decoder_flops = sequence_length * self.decoder_dim * self.decoder_layers

        return encoder_flops + decoder_flops


    class SpeechSearchSpace:
        """
        Search space for speech architectures.

        Similar to grid in Unique Paths:
            - Dimensions: encoder × decoder × attention × features
            - Each dimension has multiple choices
            - Total space is exponential
            """

    def __init__(self):
        # Define choices
        self.encoder_types = list(EncoderType)
        self.decoder_types = list(DecoderType)
        self.encoder_layer_options = [4, 6, 8, 12]
        self.encoder_dim_options = [256, 512, 768]
        self.decoder_layer_options = [1, 2, 4]

    def count_total_architectures(self) -> int:
        """
        Count total architectures (like counting paths).
        """
        count = (
        len(self.encoder_types) *
        len(self.encoder_layer_options) *
        len(self.encoder_dim_options) *
        len(self.decoder_types) *
        len(self.decoder_layer_options)
        )
        return count

    def sample_random_architecture(self) -> SpeechArchConfig:
        """Sample random architecture from space."""
        import random

        return SpeechArchConfig(
        encoder_type=random.choice(self.encoder_types),
        encoder_layers=random.choice(self.encoder_layer_options),
        encoder_dim=random.choice(self.encoder_dim_options),
        encoder_heads=8, # Fixed for simplicity
        decoder_type=random.choice(self.decoder_types),
        decoder_layers=random.choice(self.decoder_layer_options),
        decoder_dim=256, # Fixed
        attention_type=AttentionType.MULTI_HEAD,
        attention_dim=256,
        n_mels=80
        )

2. Architecture Evaluation

import torch
import torch.nn as nn

def build_speech_model(config: SpeechArchConfig) -> nn.Module:
    """
    Build speech model from configuration.

    Args:
        config: Architecture configuration

        Returns:
            PyTorch model
            """
            # This would integrate with ESPnet or custom implementation
            # Simplified example:

            if config.encoder_type == EncoderType.CONFORMER:
                from espnet.nets.pytorch_backend.conformer.encoder import Encoder
                encoder = Encoder(
                idim=config.n_mels,
                attention_dim=config.encoder_dim,
                attention_heads=config.encoder_heads,
                linear_units=config.encoder_dim * 4,
                num_blocks=config.encoder_layers
                )
            elif config.encoder_type == EncoderType.TRANSFORMER:
                from espnet.nets.pytorch_backend.transformer.encoder import Encoder
                encoder = Encoder(
                idim=config.n_mels,
                attention_dim=config.encoder_dim,
                attention_heads=config.encoder_heads,
                linear_units=config.encoder_dim * 4,
                num_blocks=config.encoder_layers
                )
            else:
                # LSTM, CNN-LSTM, etc.
                encoder = create_encoder(config)

                # Build decoder
                if config.decoder_type == DecoderType.CTC:
                    decoder = nn.Linear(config.encoder_dim, num_tokens)
                elif config.decoder_type == DecoderType.RNN_T:
                    decoder = create_rnnt_decoder(config)
                else:
                    decoder = create_attention_decoder(config)

                    # Combine into full model
                    model = SpeechModel(encoder=encoder, decoder=decoder)

                    return model


    def evaluate_speech_architecture(
    config: SpeechArchConfig,
    train_subset: str = "librispeech-100h",
    val_subset: str = "librispeech-dev",
    max_epochs: int = 20
    ) -> Dict:
        """
        Evaluate speech architecture.

        Args:
            config: Architecture to evaluate
            train_subset: Training data subset
            val_subset: Validation data
            max_epochs: Max training epochs

            Returns:
                Dictionary with WER, latency, params, etc.
                """
                # Build model
                model = build_speech_model(config)

                # Count parameters
                num_params = sum(p.numel() for p in model.parameters())

                # Train
                best_wer = train_and_evaluate(
                model,
                train_data=train_subset,
                val_data=val_subset,
                max_epochs=max_epochs
                )

                # Measure latency
                latency_ms = measure_inference_latency(model)

                # Measure RTF (real-time factor)
                rtf = measure_rtf(model)

                return {
                "config": config,
                "wer": best_wer,
                "latency_ms": latency_ms,
                "rtf": rtf,
                "params": num_params,
                "flops": config.estimate_flops()
                }

3. Search Strategy for Speech

class SpeechNASController:
    """
    NAS controller for speech architectures.

    Uses DP-like building:
        - Build encoder → choose decoder → optimize jointly
        - Like building path: choose direction at each step
        """

    def __init__(self, search_space: SpeechSearchSpace):
        self.search_space = search_space
        self.evaluated_archs = {}
        self.best_archs = []

    def search_with_evolutionary(self, population_size: int = 20, generations: int = 50):
        """
        Evolutionary search for speech architectures.

        Similar to exploring paths in Unique Paths:
            - Generate population (multiple paths)
            - Evaluate fitness (WER)
            - Mutate and crossover (create new paths)
            - Select best (optimal paths)
            """
            # Initialize population
            population = [
            self.search_space.sample_random_architecture()
            for _ in range(population_size)
            ]

            for generation in range(generations):
                # Evaluate all architectures
                fitness_scores = []

                for arch in population:
                    if encode_architecture(arch) not in self.evaluated_archs:
                        result = evaluate_speech_architecture(arch)
                        self.evaluated_archs[encode_architecture(arch)] = result
                        fitness = 1.0 / (result['wer'] + 0.01) # Lower WER = higher fitness
                    else:
                        result = self.evaluated_archs[encode_architecture(arch)]
                        fitness = 1.0 / (result['wer'] + 0.01)

                        fitness_scores.append((arch, fitness, result))

                        # Sort by fitness
                        fitness_scores.sort(key=lambda x: x[1], reverse=True)

                        # Track best
                        self.best_archs.append(fitness_scores[0])

                        # Selection: keep top 50%
                        survivors = [arch for arch, _, _ in fitness_scores[:population_size // 2]]

                        # Mutation and crossover to create next generation
                        offspring = []

                        while len(offspring) < population_size // 2:
                            # Select parents
                            parent1 = random.choice(survivors)
                            parent2 = random.choice(survivors)

                            # Crossover
                            child = self._crossover(parent1, parent2)

                            # Mutation
                            if random.random() < 0.3:
                                child = self._mutate(child)

                                offspring.append(child)

                                # New population
                                population = survivors + offspring

                                # Return best architecture found
                                best = max(self.best_archs, key=lambda x: x[1])
                                return best[0], best[2]

    def _crossover(self, arch1: SpeechArchConfig, arch2: SpeechArchConfig) -> SpeechArchConfig:
        """
        Crossover two architectures.

        Randomly inherit components from parents.
        """
        return SpeechArchConfig(
        encoder_type=random.choice([arch1.encoder_type, arch2.encoder_type]),
        encoder_layers=random.choice([arch1.encoder_layers, arch2.encoder_layers]),
        encoder_dim=random.choice([arch1.encoder_dim, arch2.encoder_dim]),
        encoder_heads=random.choice([arch1.encoder_heads, arch2.encoder_heads]),
        decoder_type=random.choice([arch1.decoder_type, arch2.decoder_type]),
        decoder_layers=random.choice([arch1.decoder_layers, arch2.decoder_layers]),
        decoder_dim=random.choice([arch1.decoder_dim, arch2.decoder_dim]),
        attention_type=random.choice([arch1.attention_type, arch2.attention_type]),
        attention_dim=random.choice([arch1.attention_dim, arch2.attention_dim]),
        n_mels=random.choice([arch1.n_mels, arch2.n_mels])
        )

    def _mutate(self, arch: SpeechArchConfig) -> SpeechArchConfig:
        """
        Mutate architecture.

        Randomly change one component.
        """
        mutation_choice = random.randint(0, 3)

        new_arch = SpeechArchConfig(**arch.__dict__)

        if mutation_choice == 0:
            # Mutate encoder
            new_arch.encoder_layers = random.choice(self.search_space.encoder_layer_options)
        elif mutation_choice == 1:
            # Mutate encoder dim
            new_arch.encoder_dim = random.choice(self.search_space.encoder_dim_options)
        elif mutation_choice == 2:
            # Mutate decoder
            new_arch.decoder_layers = random.choice(self.search_space.decoder_layer_options)
        else:
            # Mutate encoder type
            new_arch.encoder_type = random.choice(self.search_space.encoder_types)

            return new_arch

4. Multi-Objective Optimization

class MultiObjectiveSpeechNAS:
    """
    Multi-objective NAS for speech.

    Optimize for:
        - WER (minimize)
        - Latency (minimize)
        - Model size (minimize)

        Find Pareto frontier of optimal trade-offs.
        """

    def __init__(self, search_space: SpeechSearchSpace):
        self.search_space = search_space
        self.pareto_front = []

    def search(self, num_candidates: int = 100):
        """Search for Pareto-optimal architectures."""
        evaluated = []

        for i in range(num_candidates):
            # Sample architecture
            arch = self.search_space.sample_random_architecture()

            # Evaluate
            result = evaluate_speech_architecture(arch)

            evaluated.append({
            "arch": arch,
            "wer": result['wer'],
            "latency": result['latency_ms'],
            "params": result['params']
            })

            # Find Pareto frontier
            self.pareto_front = self._compute_pareto_front(evaluated)

            return self.pareto_front

    def _compute_pareto_front(self, candidates: List[Dict]) -> List[Dict]:
        """
        Compute Pareto frontier.

        An architecture is Pareto-optimal if no other architecture
        is better in all objectives.
        """
        pareto = []

        for i, cand1 in enumerate(candidates):
            is_dominated = False

            for j, cand2 in enumerate(candidates):
                if i == j:
                    continue

                    # Check if cand2 dominates cand1
                    # (better or equal in all objectives, strictly better in at least one)
                    if (cand2['wer'] <= cand1['wer'] and
                    cand2['latency'] <= cand1['latency'] and
                    cand2['params'] <= cand1['params'] and
                    (cand2['wer'] < cand1['wer'] or
                    cand2['latency'] < cand1['latency'] or
                    cand2['params'] < cand1['params'])):
                        is_dominated = True
                        break

                        if not is_dominated:
                            pareto.append(cand1)

                            return pareto

    def select_for_target(self, max_latency_ms: float, max_params: int) -> Optional[Dict]:
        """
        Select best architecture meeting constraints.

        Args:
            max_latency_ms: Maximum acceptable latency
            max_params: Maximum model size

            Returns:
                Best architecture meeting constraints, or None
                """
                candidates = [
                arch for arch in self.pareto_front
                if arch['latency'] <= max_latency_ms and arch['params'] <= max_params
                ]

                if not candidates:
                    return None

                    # Return lowest WER among candidates
                    return min(candidates, key=lambda x: x['wer'])

Scaling Strategies

Efficient Evaluation

1. Progressive training:

def progressive_evaluation(arch: SpeechArchConfig):
    """
    Evaluate architecture progressively.

    Start with small dataset/short training.
    Only continue if promising.
    """
    # Stage 1: Train on LibriSpeech-100h for 5 epochs
    wer_stage1 = quick_train(arch, data="librispeech-100h", epochs=5)

    if wer_stage1 > 0.20: # 20% WER threshold
        return {"wer": wer_stage1, "early_stopped": True}

        # Stage 2: Train on LibriSpeech-100h for 20 epochs
        wer_stage2 = quick_train(arch, data="librispeech-100h", epochs=20)

        if wer_stage2 > 0.10:
            return {"wer": wer_stage2, "early_stopped": True}

            # Stage 3: Full training on LibriSpeech-960h
            wer_final = full_train(arch, data="librispeech-960h", epochs=100)

            return {"wer": wer_final, "early_stopped": False}

2. Weight sharing (supernet for speech):

class SpeechSuperNet(nn.Module):
    """
    Super-network for speech NAS.

    Contains all possible operations.
    Different architectures share weights.
    """

    def __init__(self, search_space: SpeechSearchSpace):
        super().__init__()

        # Create all encoder options
        self.encoders = nn.ModuleDict({
        enc_type.value: create_encoder(enc_type, max_layers=12, max_dim=768)
        for enc_type in EncoderType
        })

        # Create all decoder options
        self.decoders = nn.ModuleDict({
        dec_type.value: create_decoder(dec_type)
        for dec_type in DecoderType
        })

    def forward(self, audio_features: torch.Tensor, arch: SpeechArchConfig):
        """Forward with specific architecture."""
        # Select encoder
        encoder = self.encoders[arch.encoder_type.value]

        # Select decoder
        decoder = self.decoders[arch.decoder_type.value]

        # Forward pass
        encoder_out = encoder(audio_features)
        output = decoder(encoder_out)

        return output

Real-World Case Study: Google’s Speech NAS

Google’s Approach for Mobile ASR

Goal: Find ASR architecture for on-device deployment with <100ms latency.

Search space:

Encoder: RNN, LSTM, GRU, Conformer variants
Layers: 2-8
Hidden dim: 128-512
Decoder: CTC, RNN-T

Search strategy:

Reinforcement learning controller
Multi-objective: WER + latency + model size
Progressive training (100h → 960h dataset)

Results:

Discovered architecture: 4-layer Conformer + RNN-T
WER: 5.2% on LibriSpeech test-clean (vs 6.1% baseline)
Latency: 85ms on Pixel 6 (vs 120ms baseline)
Size: 45M params (vs 80M baseline LSTM)
Search cost: 80 GPU days (vs months of manual tuning)

Key insights:

Conformer with fewer layers beats deep LSTM
RNN-T decoder better latency than attention for streaming
Smaller models with better architecture beat larger hand-designed ones

Lessons Learned

Speech-specific constraints matter: Streaming, variable length, long sequences
Multi-objective is essential: Can’t just optimize WER
Progressive evaluation saves compute: 80% of candidates filtered early
Transfer works: ImageNet NAS insights transfer to speech (depth vs width)
Hardware-in-the-loop: Measure latency on actual target device

Cost Analysis

NAS vs Manual Design

Approach	Time	GPU Cost	Quality (WER)	Notes
Manual design	6 months	50 GPU days	6.5%	Expert-dependent
Random search	N/A	500 GPU days	7.0%	Baseline
Evolutionary NAS	2 months	100 GPU days	5.8%	Robust
RL-based NAS	1 month	80 GPU days	5.2%	Google’s approach
DARTS for speech	2 weeks	10 GPU days	6.0%	Fast but less stable
Transfer + fine-tune	1 week	5 GPU days	5.5%	Use vision NAS results

ROI:

Manual: 120K (engineer time) + 15K (GPUs) = $135K
NAS: 40K (engineer time) + 24K (GPUs) = $64K
Savings: $71K + better model + faster iteration

Advanced Topics

1. Multi-Lingual NAS

Search for architectures that work across languages:

def multi_lingual_nas(languages: List[str] = ["en", "zh", "es"]):
    """
    Search for architecture that works well across languages.

    Fitness = average WER across all languages.
    """
def evaluate_multilingual(arch: SpeechArchConfig) -> float:
    wers = []

    for lang in languages:
        wer = train_and_evaluate(
        arch,
        train_data=f"common_voice_{lang}",
        val_data=f"common_voice_{lang}_dev"
        )
        wers.append(wer)

        # Average WER across languages
        return sum(wers) / len(wers)

        # Search with multi-lingual fitness
        # ... (use evolutionary or RL search)

2. Streaming-Aware NAS

Optimize for streaming ASR:

def streaming_aware_evaluation(arch: SpeechArchConfig) -> Dict:
    """
    Evaluate architecture for streaming capability.

    Metrics:
        - Per-chunk latency (not full utterance)
        - Look-ahead requirement
        - Chunk size vs WER trade-off
        """
        model = build_speech_model(arch)

        # Test streaming performance
        chunk_size_ms = 100 # 100ms chunks

        chunk_latency = measure_chunk_latency(model, chunk_size_ms)
        streaming_wer = evaluate_streaming_wer(model, chunk_size_ms)

        return {
        "chunk_latency_ms": chunk_latency,
        "streaming_wer": streaming_wer,
        "supports_streaming": chunk_latency < chunk_size_ms
        }

3. Transfer from Vision NAS

Leverage insights from ImageNet NAS:

def transfer_vision_to_speech(vision_arch_config):
    """
    Transfer successful vision architectures to speech.

    Example: EfficientNet principles → EfficientConformer
    - Depth scaling
    - Width scaling
    - Compound scaling
    """
    # Extract architectural principles
    depth_factor = vision_arch_config.depth_coefficient
    width_factor = vision_arch_config.width_coefficient

    # Apply to speech
    speech_config = SpeechArchConfig(
    encoder_type=EncoderType.CONFORMER,
    encoder_layers=int(6 * depth_factor),
    encoder_dim=int(256 * width_factor),
    encoder_heads=8,
    decoder_type=DecoderType.RNN_T,
    decoder_layers=2,
    decoder_dim=256,
    attention_type=AttentionType.RELATIVE,
    attention_dim=256,
    n_mels=80
    )

    return speech_config

Monitoring & Debugging

Key Metrics

Search Progress:

Best WER found so far vs iterations
Pareto frontier evolution
Architecture diversity (entropy of designs explored)
GPU utilization during search

Architecture Analysis:

Most common encoder/decoder types in top performers
Depth vs width trade-offs
Correlation between architecture features and WER

Resource Tracking:

Total GPU hours consumed
Average training time per architecture
Early stopping rate (% of archs stopped early)

Debugging Tools

Visualize architecture graphs
Compare top-N architectures side-by-side
Ablation studies (which components matter most?)
Error analysis (where do discovered archs fail?)

Key Takeaways

✅ Speech NAS automates architecture design for ASR/TTS models

✅ Search space is exponential - like paths in a grid, need smart search

✅ DP and smart search make NAS practical - from infeasible to 50-100 GPU days

✅ Multi-objective optimization essential - WER, latency, size must be balanced

✅ Progressive evaluation saves compute - filter bad candidates early

✅ Weight sharing (supernet) enables evaluating 1000s of architectures

✅ Speech-specific constraints - streaming, variable length, multi-lingual

✅ Transfer from vision accelerates speech NAS

✅ Hardware-aware search critical for deployment

✅ Same DP pattern as Unique Paths - build optimal solution from sub-solutions

Connection to Thematic Link: Dynamic Programming and Path Optimization

All three topics use DP to optimize paths through exponential spaces:

DSA (Unique Paths):

Navigate m×n grid using DP
Recurrence: paths(i,j) = paths(i-1,j) + paths(i,j-1)
Build solution from optimal sub-solutions

ML System Design (Neural Architecture Search):

Navigate exponential architecture space
Use DP/RL/gradient methods to find optimal
Build full model from optimal components

Speech Tech (Speech Architecture Search):

Navigate encoder×decoder×attention space
Use DP-inspired search to find optimal speech models
Build ASR/TTS from optimal sub-architectures

The unifying principle: decompose exponentially large search spaces into manageable subproblems, solve optimally using DP or DP-inspired methods, and construct the best overall solution.

FAQ

Why is neural architecture search useful for speech models specifically?

Speech models face unique constraints like streaming requirements, variable-length inputs up to 60+ seconds, and long sequences of hundreds to thousands of frames that make manual architecture design particularly difficult. NAS can systematically explore encoder, decoder, and attention combinations while simultaneously optimizing for WER, latency, and model size across target hardware.

How does progressive evaluation make speech NAS practical?

Progressive evaluation trains candidate architectures in stages: first on a small data subset (e.g., LibriSpeech-100h) for 5 epochs, then for 20 epochs, and only promising candidates receive full training on the complete dataset. This filters out 80% of bad architectures before expensive training, reducing total search cost from hundreds to 50-100 GPU days.

What is multi-objective NAS and how does the Pareto frontier work?

Multi-objective NAS optimizes for multiple goals simultaneously – minimizing WER, latency, and model size. The Pareto frontier contains all architectures where no other design is better in every objective. Engineers select from this frontier based on deployment constraints, choosing the lowest-WER architecture that meets their maximum latency and model size requirements.

What results has Google achieved with speech NAS?

Google’s mobile ASR NAS discovered a 4-layer Conformer plus RNN-T architecture that achieved 5.2% WER on LibriSpeech test-clean (vs 6.1% baseline), 85ms latency on Pixel 6 (vs 120ms baseline), and 45M parameters (vs 80M baseline LSTM). The search took 80 GPU days, costing approximately $64K total including engineer time, versus $135K for manual design.

Originally published at: arunbaby.com/speech-tech/0021-speech-architecture-search

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

Speech Architecture Search

TL;DR

Problem Statement

Functional Requirements

Non-Functional Requirements

Understanding the Requirements

Why Speech Architecture Search?

Speech Architecture Challenges

The Path Optimization Connection

High-Level Architecture

Key Components

Component Deep-Dives

1. Speech-Specific Search Space

2. Architecture Evaluation

3. Search Strategy for Speech

4. Multi-Objective Optimization

Scaling Strategies

Efficient Evaluation

Real-World Case Study: Google’s Speech NAS

Google’s Approach for Mobile ASR

Lessons Learned

Cost Analysis

NAS vs Manual Design

Advanced Topics

1. Multi-Lingual NAS

2. Streaming-Aware NAS

3. Transfer from Vision NAS

Monitoring & Debugging

Key Metrics

Debugging Tools

Key Takeaways

Connection to Thematic Link: Dynamic Programming and Path Optimization

FAQ

Related across topics

Share on

TL;DR

Problem Statement

Functional Requirements

Non-Functional Requirements

Understanding the Requirements

Why Speech Architecture Search?

Speech Architecture Challenges

The Path Optimization Connection

High-Level Architecture

Key Components

Component Deep-Dives

1. Speech-Specific Search Space

2. Architecture Evaluation

3. Search Strategy for Speech

4. Multi-Objective Optimization

Scaling Strategies

Efficient Evaluation

Real-World Case Study: Google’s Speech NAS

Google’s Approach for Mobile ASR

Lessons Learned

Cost Analysis

NAS vs Manual Design

Advanced Topics

1. Multi-Lingual NAS

2. Streaming-Aware NAS

3. Transfer from Vision NAS

Monitoring & Debugging

Key Metrics

Debugging Tools

Key Takeaways

Connection to Thematic Link: Dynamic Programming and Path Optimization

FAQ

Related across topics

Unique Paths

Neural Architecture Search

Vision Agent Fundamentals

Share on