Speech Hyperparameter Tuning

22 minute read

“Tuning speech models for peak performance.”

1. Speech-Specific Hyperparameters

Speech models have unique hyperparameters beyond standard ML:

Audio Processing:

Sample Rate: 8kHz (telephony) vs. 16kHz (standard) vs. 48kHz (high-quality).
Window Size: 25ms? 50ms?
Hop Length: 10ms? 20ms?
Num Mel Bins: 40? 80? 128?

Model Architecture:

Encoder Type: LSTM? Transformer? Conformer?
Num Layers: 6? 12? 24?
Attention Heads: 4? 8? 16?

Training:

SpecAugment: Mask how many time/frequency bins?
CTC vs. Attention: Which loss weight?

2. The Cost Problem

Challenge: Speech models are expensive to train.

Whisper Large: 1 week on 256 GPUs.
Conformer-XXL: 3 days on 64 GPUs.

Implication: We can’t afford 100 trials. Need smart search.

3. Multi-Fidelity Tuning for ASR

Idea: Use smaller datasets/models as proxies.

Fidelity Levels:

Low: Train on 1 hour of data, 3 layers, 1 epoch.
Medium: Train on 10 hours, 6 layers, 5 epochs.
High: Train on 100 hours, 12 layers, 50 epochs.

Hyperband Strategy:

Start 64 trials at low fidelity.
Promote top 16 to medium.
Promote top 4 to high.

4. Optuna for Speech

import optuna

def objective(trial):
    # Audio hyperparameters
    n_mels = trial.suggest_int('n_mels', 40, 128, step=8)
    win_length = trial.suggest_int('win_length', 20, 50, step=5)
    
    # Model hyperparameters
    num_layers = trial.suggest_int('num_layers', 4, 12)
    d_model = trial.suggest_categorical('d_model', [256, 512, 1024])
    
    # Training hyperparameters
    lr = trial.suggest_float('lr', 1e-5, 1e-3, log=True)
    
    # Build and train model
    model = build_asr_model(n_mels, win_length, num_layers, d_model)
    wer = train_and_evaluate(model, lr)
    
    return wer  # Minimize WER

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=50)

5. Neural Architecture Search (NAS)

Goal: Automatically find the best architecture.

Search Space:

Encoder: LSTM, GRU, Transformer, Conformer.
Decoder: CTC, Attention, Transducer.
Connections: Skip connections? Residual?

Search Algorithm:

ENAS (Efficient NAS): Share weights across architectures.
DARTS (Differentiable): Make architecture choices continuous, use gradient descent.

6. Case Study: ESPnet Tuning

ESPnet (End-to-End Speech Processing toolkit) has built-in tuning.

# Define search space in YAML
espnet_tune.py \
  --config conf/tuning.yaml \
  --n-trials 100 \
  --backend optuna

conf/tuning.yaml:

search_space:
  encoder_layers: [4, 6, 8, 12]
  attention_heads: [4, 8]
  dropout: [0.1, 0.2, 0.3]
  learning_rate: [1e-4, 5e-4, 1e-3]

7. Summary

Aspect	Strategy
Search	Bayesian Optimization (Optuna)
Fidelity	Hyperband (small data first)
Architecture	NAS (ENAS, DARTS)
Parallelization	Ray Tune (multi-GPU)

8. Deep Dive: Audio Augmentation Hyperparameters

SpecAugment is crucial for speech models. But how much augmentation?

Hyperparameters:

Time Masking: How many time steps to mask? (10? 20? 50?)
Frequency Masking: How many mel bins? (5? 10? 20?)
Num Masks: How many masks per spectrogram? (1? 2? 3?)

Tuning Strategy:

def objective(trial):
    time_mask = trial.suggest_int('time_mask', 10, 100, step=10)
    freq_mask = trial.suggest_int('freq_mask', 5, 30, step=5)
    num_masks = trial.suggest_int('num_masks', 1, 3)
    
    augmenter = SpecAugment(time_mask, freq_mask, num_masks)
    model = train_with_augmentation(augmenter)
    
    return model.wer

Insight: More augmentation helps on small datasets, hurts on large ones.

9. Deep Dive: Conformer Architecture Search

Conformer is SOTA for ASR. But which variant?

Search Space:

Num Layers: 12? 18? 24?
d_model: 256? 512? 1024?
Conv Kernel Size: 15? 31? 63?
Attention Heads: 4? 8? 16?

Cost: Training a 24-layer Conformer takes 3 days on 8 GPUs.

Multi-Fidelity Strategy:

Proxy: Train 6-layer model on 10 hours.
Correlation: Check if proxy WER correlates with full model WER.
Transfer: Top configs from proxy → Full training.

10. Deep Dive: Learning Rate Schedules

Speech models are sensitive to LR schedules.

Options:

Warmup + Decay:
- Warmup: Linear increase for 10k steps.
- Decay: Cosine or exponential.
Noam Scheduler (Transformer): $\text{LR} = d_{\text{model}}^{-0.5} \cdot \min(step^{-0.5}, step \cdot warmup^{-1.5})$
ReduceLROnPlateau: Reduce when validation loss plateaus.

Tuning:

def objective(trial):
    warmup_steps = trial.suggest_int('warmup_steps', 5000, 25000, step=5000)
    peak_lr = trial.suggest_float('peak_lr', 1e-4, 1e-3, log=True)
    
    scheduler = NoamScheduler(warmup_steps, peak_lr)
    model = train_with_scheduler(scheduler)
    
    return model.wer

11. System Design: Distributed Tuning for TTS

Scenario: Tune a multi-speaker TTS model.

Challenges:

Long Training: 1M steps = 1 week on 1 GPU.
Many Hyperparameters: 20+ (encoder, decoder, vocoder).

Solution:

Stage 1: Tune encoder/decoder (freeze vocoder).
Stage 2: Tune vocoder (freeze encoder/decoder).
Parallelization: Ray Tune with 64 GPUs.

Code:

from ray import tune

def train_tts(config):
    model = build_tts(
        encoder_layers=config['encoder_layers'],
        decoder_layers=config['decoder_layers'],
        lr=config['lr']
    )
    
    for step in range(100000):
        loss = train_step(model)
        if step % 1000 == 0:
            tune.report(loss=loss)

config = {
    'encoder_layers': tune.choice([4, 6, 8]),
    'decoder_layers': tune.choice([4, 6]),
    'lr': tune.loguniform(1e-5, 1e-3)
}

tune.run(train_tts, config=config, num_samples=50, resources_per_trial={'gpu': 1})

12. Deep Dive: Transfer Learning from Pre-Tuned Models

Idea: Start from a model that’s already tuned for a similar task.

Example:

Source: English ASR (tuned on LibriSpeech).
Target: Spanish ASR.
Transfer: Use English hyperparameters as starting point.

Fine-Tuning Search Space:

Keep architecture fixed.
Only tune learning rate and data augmentation.

Speedup: 5x fewer trials needed.

13. Production Tuning Workflow

Step 1: Baseline

Train with default hyperparameters.
Measure WER/MOS.

Step 2: Coarse Search

Use Random Search with 20 trials.
Identify promising regions.

Step 3: Fine Search

Use Bayesian Optimization with 30 trials.
Focus on promising region.

Step 4: Validation

Train best config 3 times (different seeds).
Report mean ± std.

Step 5: A/B Test

Deploy to 5% of users.
Monitor real-world metrics.

14. Deep Dive: Batch Size and Gradient Accumulation

Problem: Larger batch sizes improve training stability but require more GPU memory.

Hyperparameters:

Batch Size: 8? 16? 32? 64?
Gradient Accumulation Steps: 1? 2? 4? 8?

Effective Batch Size = batch_size × gradient_accumulation_steps × num_gpus

Tuning Strategy:

def objective(trial):
    batch_size = trial.suggest_categorical('batch_size', [8, 16, 32])
    grad_accum = trial.suggest_categorical('grad_accum', [1, 2, 4])
    
    effective_bs = batch_size * grad_accum
    
    # Adjust learning rate proportionally
    base_lr = 1e-4
    lr = base_lr * (effective_bs / 32)
    
    model = train(batch_size, grad_accum, lr)
    return model.wer

Insight: Effective batch size of 128-256 works best for most speech models.

15. Deep Dive: Optimizer Selection

Options:

Adam: Default choice. Adaptive learning rates.
AdamW: Adam with weight decay decoupling. Better generalization.
SGD + Momentum: Simpler, sometimes better for very large models.
Adafactor: Memory-efficient (no momentum buffer). Good for TPUs.

Hyperparameters:

Beta1, Beta2: Momentum parameters for Adam.
Weight Decay: L2 regularization strength.
Epsilon: Numerical stability constant.

Tuning:

def objective(trial):
    optimizer_name = trial.suggest_categorical('optimizer', ['adam', 'adamw', 'sgd'])
    
    if optimizer_name in ['adam', 'adamw']:
        beta1 = trial.suggest_float('beta1', 0.85, 0.95)
        beta2 = trial.suggest_float('beta2', 0.95, 0.999)
        weight_decay = trial.suggest_float('weight_decay', 1e-6, 1e-2, log=True)
        
        optimizer = AdamW(params, lr=lr, betas=(beta1, beta2), weight_decay=weight_decay)
    else:
        momentum = trial.suggest_float('momentum', 0.8, 0.99)
        optimizer = SGD(params, lr=lr, momentum=momentum)
    
    return train_with_optimizer(optimizer)

16. Case Study: Google’s Conformer Tuning

Background: Google trained Conformer models for production ASR.

Search Space:

144 hyperparameter combinations.
Trained on 60,000 hours of audio.

Key Findings:

Convolution Kernel Size: 31 was optimal (not 15 or 63).
Dropout: 0.1 for large datasets, 0.3 for small.
SpecAugment: Time mask 100, freq mask 27.

Cost: $500,000 in GPU hours.

Result: 5% relative WER improvement over baseline.

17. Case Study: Meta’s Wav2Vec 2.0 Self-Supervised Tuning

Challenge: Pre-training on 60,000 hours of unlabeled audio.

Hyperparameters Tuned:

Masking Probability: 0.065 (6.5% of time steps masked).
Mask Length: 10 time steps.
Contrastive Temperature: 0.1.
Quantizer Codebook Size: 320.

Search Method: Grid search with 20 configurations.

Key Insight: Masking probability is the most sensitive hyperparameter. 6.5% is optimal; 5% or 8% degrades performance significantly.

18. Deep Dive: Early Stopping Strategies

Problem: How do we know when to stop a trial?

Strategies:

Validation Loss Plateau: Stop if loss doesn’t improve for N epochs.
Hyperband: Stop bottom 50% of trials at each rung.
Median Stopping: Stop if current performance is below median of all trials at this step.

Optuna Pruner:

import optuna

def objective(trial):
    model = build_model(trial.params)
    
    for epoch in range(100):
        val_loss = train_epoch(model)
        
        # Report intermediate value
        trial.report(val_loss, epoch)
        
        # Prune if not promising
        if trial.should_prune():
            raise optuna.TrialPruned()
    
    return val_loss

study = optuna.create_study(
    pruner=optuna.pruners.MedianPruner(n_startup_trials=5, n_warmup_steps=10)
)
study.optimize(objective, n_trials=100)

Speedup: 3-5x faster by killing bad trials early.

19. Deep Dive: Hyperparameter Importance Analysis

Question: Which hyperparameters matter most?

Optuna Importance:

import optuna.importance

# After study completes
importance = optuna.importance.get_param_importances(study)

for param, score in importance.items():
    print(f"{param}: {score:.3f}")

Example Output:

learning_rate: 0.45
num_layers: 0.25
dropout: 0.15
batch_size: 0.10
optimizer: 0.05

Insight: Focus future tuning on top 2-3 hyperparameters.

20. Production Deployment: Model Registry

Problem: Track 100s of tuning experiments.

Solution: MLflow Model Registry.

import mlflow

def objective(trial):
    params = {
        'lr': trial.suggest_float('lr', 1e-5, 1e-3, log=True),
        'num_layers': trial.suggest_int('num_layers', 4, 12)
    }
    
    with mlflow.start_run():
        # Log hyperparameters
        mlflow.log_params(params)
        
        # Train model
        model = train(params)
        wer = evaluate(model)
        
        # Log metrics
        mlflow.log_metric('wer', wer)
        
        # Log model
        mlflow.pytorch.log_model(model, 'model')
    
    return wer

Benefits:

Reproducibility: Every experiment is tracked.
Comparison: Compare trials in UI.
Deployment: Promote best model to production.

21. Advanced: Population-Based Training (PBT)

Idea: Evolve hyperparameters during training (like genetic algorithms).

Algorithm:

Start with N models (population) with random hyperparameters.
Train all models for T steps.
Exploit: Replace worst 20% with copies of best 20%.
Explore: Perturb hyperparameters of copied models (e.g., lr *= random.choice([0.8, 1.2])).
Repeat steps 2-4.

Ray Tune PBT:

from ray.tune.schedulers import PopulationBasedTraining

pbt = PopulationBasedTraining(
    time_attr='training_iteration',
    metric='wer',
    mode='min',
    perturbation_interval=5,
    hyperparam_mutations={
        'lr': lambda: np.random.uniform(1e-5, 1e-3),
        'dropout': lambda: np.random.uniform(0.1, 0.5)
    }
)

tune.run(train_model, scheduler=pbt, num_samples=20)

Advantage: Adapts hyperparameters online. Can find schedules that static tuning misses.

22. Deep Dive: Handling Noisy Objectives

Problem: WER varies due to randomness (data shuffling, weight initialization).

Solution: Run each config multiple times, report mean.

def objective(trial):
    wers = []
    for seed in [42, 123, 456]:
        set_seed(seed)
        model = train(trial.params)
        wers.append(evaluate(model))
    
    return np.mean(wers)

Trade-off: 3x slower, but more reliable.

Alternative: Use larger validation set to reduce variance.

24. Deep Dive: Bayesian Optimization Internals for Speech

Speech hyperparameter spaces are often high-dimensional and continuous. Random search is inefficient. Bayesian Optimization (BO) builds a probabilistic model of the objective function $f(x)$ (e.g., WER) and uses it to select the most promising hyperparameters to evaluate next.

1. Gaussian Processes (GP): BO typically uses a GP as a surrogate model. A GP defines a distribution over functions.

Prior: Before seeing any data, we assume $f(x)$ follows a multivariate normal distribution.
Posterior: After observing data points $D = {(x_1, y_1), …, (x_n, y_n)}$, we update the distribution.
Mean Function $\mu(x)$: The expected value of WER at hyperparameter configuration $x$.
Covariance Function $k(x, x’)$: Encodes assumptions about smoothness. If $x$ and $x’$ are similar, $f(x)$ and $f(x’)$ should be similar.
- Common kernel: Matern 5/2 (allows for some roughness, suitable for non-smooth deep learning landscapes).

2. Acquisition Functions: How do we choose the next $x_{n+1}$? We maximize an acquisition function $\alpha(x)$.

Expected Improvement (EI): $EI(x) = \mathbb{E}[\max(f(x^*) - f(x), 0)]$ where $f(x^*)$ is the best WER observed so far. This balances exploring high-uncertainty regions and exploiting low-mean regions.
Upper Confidence Bound (UCB): $UCB(x) = \mu(x) - \kappa \sigma(x)$ (Note: minus because we minimize WER). $\kappa$ controls the exploration-exploitation trade-off.

3. Tree-Structured Parzen Estimator (TPE): Standard GPs scale cubically $O(n^3)$ with the number of trials. TPE (used by Optuna) scales linearly.

Instead of modeling $p(y x)$, TPE models $p(x y)$ and $p(y)$.
It defines two densities for hyperparameters $x$:
- $l(x)$ if $y < y^*$ (promising configs)
- $g(x)$ if $y \ge y^*$ (bad configs)
It chooses $x$ to maximize the ratio $l(x) / g(x)$.
Why it works for Speech: Speech pipelines have conditional hyperparameters (e.g., “If optimizer=SGD, tune momentum. If Adam, ignore momentum”). TPE handles this tree structure naturally.

25. Deep Dive: Ray Tune Architecture

When tuning massive speech models, we need distributed compute. Ray Tune is the industry standard.

Architecture:

Driver: The script where you define the search space and tune.run().
Trial Executor: Manages the lifecycle of trials.
Search Algorithm: (e.g., Optuna, HyperOpt) Suggests new configurations.
Scheduler: (e.g., ASHA, PBT) Decides whether to pause, stop, or resume trials based on intermediate results.
Trainable (Actor): A Ray Actor (process) that runs the training loop.

Resource Management:

Ray abstracts resources (CPU, GPU).
resources_per_trial={"cpu": 4, "gpu": 1}.
If you have 8 GPUs, Ray Tune runs 8 concurrent trials.
Fractional GPUs: {"gpu": 0.5} allows running 2 small trials on one GPU (useful for small ASR models or proxy tasks).

Fault Tolerance:

Speech training takes days. Nodes fail.
Ray Tune automatically checkpoints trials.
If a node dies, Ray reschedules the trial on a healthy node and resumes from the last checkpoint.

Code Example: Custom Stopper

from ray.tune import Stopper

class WERPlateauStopper(Stopper):
    def __init__(self, patience=5, metric="wer"):
        self.patience = patience
        self.metric = metric
        self.best_wer = float("inf")
        self.no_improve_count = 0

    def __call__(self, trial_id, result):
        current_wer = result[self.metric]
        if current_wer < self.best_wer:
            self.best_wer = current_wer
            self.no_improve_count = 0
        else:
            self.no_improve_count += 1
        return self.no_improve_count >= self.patience

    def stop_all(self):
        return False

26. System Design: Auto-Tuning Pipeline for Production

Scenario: A company constantly ingests new audio data (call center logs). They need to retrain and retune models weekly.

Pipeline:

Data Ingestion:
- New audio lands in S3.
- Airflow job triggers preprocessing (MFCC extraction, text normalization).
Proxy Dataset Creation:
- Randomly sample 5% of the new data (~100 hours) for hyperparameter tuning.
- Full dataset (2000 hours) reserved for final training.
Hyperparameter Search (Ray Tune + K8s):
- Spin up ephemeral K8s cluster with Spot Instances (cheaper).
- Run 50 trials of ASHA on the Proxy Dataset.
- Search space: LR, SpecAugment, Dropout.
- Output: Best configuration JSON.
Full Training:
- Launch a distributed training job (PyTorch DDP) on the full dataset using the Best Configuration.
- No tuning here, just training.
Evaluation & Gating:
- Evaluate on a held-out Golden Set.
- If WER < Current Production Model, promote to Staging.
Deployment:
- TorchServe loads the new model.
- Canary deployment to 1% traffic.

Benefit: This decouples the expensive “Search” phase (done on small data) from the expensive “Train” phase (done once).

27. Deep Dive: Tuning for Edge Deployment

Deploying ASR on mobile devices (Android/iOS) introduces new constraints: Model Size and Latency.

Hyperparameters to Tune:

Quantization Aware Training (QAT):
- Bit Width: 8-bit vs 4-bit weights.
- Observer Type: MinMax vs MovingAverage.
Pruning:
- Sparsity Level: 50%? 75%? 90%?
- Pruning Schedule: Linear vs Cubic.
Architecture:
- Depth Multiplier: Scale down channel dimensions (MobileNet style).
- Streaming Chunk Size: 100ms vs 400ms (Latency vs Accuracy trade-off).

Multi-Objective Optimization: We want to minimize WER and minimize Latency. $Loss = WER + \lambda \times Latency$

Pareto Frontier: Instead of a single best model, we want a set of models that represent optimal trade-offs.

Model A: WER 5%, Latency 200ms.
Model B: WER 6%, Latency 50ms.

Optuna Multi-Objective:

def objective(trial):
    # ... build and evaluate model ...
    wer = evaluate_wer(model)
    latency = measure_latency(model)
    return wer, latency

study = optuna.create_study(directions=["minimize", "minimize"])
study.optimize(objective, n_trials=100)

# Plot Pareto Frontier
optuna.visualization.plot_pareto_front(study)

28. Deep Dive: Hyperparameters for Low-Resource Speech

When you only have 10 hours of Swahili audio, tuning is different.

Key Hyperparameters:

Dropout: Needs to be much higher (0.3 - 0.5) to prevent overfitting.
SpecAugment: Aggressive masking helps significantly.
Freezing Layers:
- Start with a pre-trained English Wav2Vec 2.0.
- Hyperparam: How many bottom layers to freeze? (Freeze 0? 6? 12?)
- Tuning often shows freezing the feature extractor (CNN) is crucial, but fine-tuning top Transformer layers is necessary.
Learning Rate: Needs to be smaller for fine-tuning ($1e-5$) compared to pre-training ($1e-3$).

Few-Shot Tuning:

Use MAML (Model-Agnostic Meta-Learning) to find initial hyperparameters that adapt quickly to new languages.

29. Case Study: Tuning Whisper for Code-Switching

Problem: “Hinglish” (Hindi + English) ASR. Base Model: OpenAI Whisper Large-v2.

Tuning Strategy:

LoRA (Low-Rank Adaptation): Fine-tuning 1.5B parameters is too slow. Tune low-rank matrices instead.
Hyperparameters:
- Rank (r): 8? 16? 64? (Higher = more capacity, slower).
- Alpha: Scaling factor.
- Target Modules: Query/Value projections? Or all linear layers?

Results:

Tuning r=16 on q_proj and v_proj yielded best results.
Tuning all linear layers led to overfitting on the small Hinglish dataset.
Learning Rate: $1e-4$ was optimal (standard fine-tuning uses $1e-5$, but LoRA allows higher LR).

30. Advanced: Differentiable Architecture Search (DARTS) Math

NAS is usually discrete (try architecture A, then B). DARTS relaxes this to a continuous space.

Concept:

Construct a super-graph containing all possible operations (Conv3x3, Conv5x5, MaxPool, Identity) between nodes.
Assign a weight $\alpha_o$ to each operation $o$.
The output of a node is a weighted sum: $\bar{o}(x) = \sum_{o \in O} \frac{\exp(\alpha_o)}{\sum_{o’} \exp(\alpha_{o’})} o(x)$.
We can now differentiate WER with respect to $\alpha$!

Bilevel Optimization: $\min_\alpha \mathcal{L}_{val}(w^*(\alpha), \alpha)$ $\text{s.t. } w^*(\alpha) = \text{argmin}_w \mathcal{L}_{train}(w, \alpha)$

Inner loop: Train weights $w$ (standard SGD).
Outer loop: Update architecture $\alpha$ (gradient descent on validation loss).

Application to Speech:

Used to discover optimal Convolution cells for ASR encoders.
Result: Found architectures that outperform manually designed ResNets with fewer parameters.

31. Deep Dive: The Interaction of Hyperparameters

Hyperparameters are not independent.

Batch Size & Learning Rate:
- Linear Scaling Rule: If you double batch size, double learning rate.
- Square Root Rule: Multiply LR by $\sqrt{2}$.
- Tuning implication: Don’t tune them independently. Tune base_lr and scale it dynamically based on batch_size.
Model Depth & Residual Scale:
- Deeper models (50+ layers) are harder to train.
- DeepNorm / ReZero: Scale weights by $\frac{1}{\sqrt{2N}}$.
- Tuning: If num_layers is a hyperparameter, the initialization scale must also be a function of it.
Regularization & Data Size:
- If you increase SpecAugment, you might need to decrease Dropout. They both provide regularization. Too much leads to underfitting.

32. Best Practices & Pitfalls

Do’s:

Log Everything: Use W&B / MLflow. You will forget what Config #34 was.
Set Random Seeds: For reproducibility.
Use Log Scale: For LR and Weight Decay. $1e-4$ to $1e-2$ is a huge range linearly, but reasonable logarithmically.
Monitor Gradient Norm: If gradients explode, your LR is too high, regardless of what the tuner says.

Don’ts:

Don’t Tune on Test Set: The cardinal sin. You will overfit to the test set. Use a Validation set.
Don’t Grid Search: It’s a waste of compute. Random Search is better. Bayesian is best.
Don’t Ignore Defaults: Start with SOTA defaults (e.g., from ESPnet recipes). Tune around them.
Don’t Tune Everything: Focus on LR, Batch Size, Regularization. Architecture tuning yields diminishing returns compared to data cleaning.

33. Cost-Benefit Analysis of Tuning

Is it worth it?

Scenario:

Baseline Model: WER 10.0%. Training cost $100.
Tuned Model: WER 9.5%. Tuning cost $2000 (20 trials).

ROI Calculation:

If this is a hobby project: No.
If this is a call center transcribing 1M hours/year:
- 0.5% WER reduction = 5% fewer human corrections.
- Human correction cost = $100/hour.
- Savings = Huge. Yes.

Green AI:

Hyperparameter tuning has a massive carbon footprint.
Mitigation: Use Transfer Learning, Multi-Fidelity tuning, and share best configs (Model Cards).

34. Future Trends: LLM-driven Tuning

AutoML-Zero: Can we evolve the code of the algorithm?

LLM as Tuner:

Feed the training logs (loss curves) to GPT-4.
Ask: “The loss is oscillating. What should I change?”
GPT-4: “Decrease learning rate by half and increase beta2.”
Why it works: LLMs have read millions of ML papers and GitHub issues. They understand the physics of training dynamics better than random search.
OMNI (OpenAI): Future systems might just take data + metric and output a deployed API, handling all tuning internally.

35. Deep Dive: Troubleshooting Common Tuning Failures

Even with Optuna, things go wrong.

1. The “Flatline” Loss:

Symptom: Loss stays constant from epoch 0.
Cause: LR too high (gradients exploded) or too low (stuck in local minima).
Fix: Tune LR on a logarithmic scale from $1e-6$ to $1e-1$.

2. The “Divergence” Spike:

Symptom: Loss decreases, then suddenly shoots to NaN.
Cause: Batch size too small for the LR, or bad data batch.
Fix: Gradient Clipping (clip_grad_norm_). Tune clipping threshold (1.0 vs 5.0).

3. The “Overfitting” Gap:

Symptom: Train loss 0.1, Val loss 5.0.
Cause: Model too big, not enough regularization.
Fix: Increase Dropout, Weight Decay, and SpecAugment.

4. The “OOM” (Out of Memory):

Symptom: CUDA OOM error.
Cause: Batch size too large.
Fix: Prune trials that OOM. Ray Tune handles this by marking the trial as failed.

36. Deep Dive: The Mathematics of Learning Rate Schedules

Why do we need schedules?

SGD Update: $w_{t+1} = w_t - \eta \nabla L(w_t)$

The Landscape:

Early training: Landscape is rough. Large steps help escape local valleys.
Late training: We are near the minimum. Large steps oscillate. We need to decay $\eta$.

Cosine Annealing: $\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{T_{cur}}{T_{max}}\pi))$

Smooth decay. No sharp drops.
Hyperparams: $T_{max}$ (cycle length), $\eta_{min}$.

Cyclic Learning Rates (CLR):

Oscillate LR between base and max.
Intuition: “Pop” the model out of sharp minima (poor generalization) into flat minima (good generalization).
Intuition: “Pop” the model out of sharp minima (poor generalization) into flat minima (good generalization).

37. Deep Dive: The Future - Quantum Hyperparameter Optimization

As classical computers hit Moore’s Law limits, Quantum Computing offers a new frontier.

Quantum Annealing:

D-Wave systems can solve optimization problems by finding the ground state of a Hamiltonian.
Application: Finding the optimal discrete architecture (NAS) can be mapped to a QUBO (Quadratic Unconstrained Binary Optimization) problem.
Speedup: Potentially exponential speedup for discrete search spaces.

Grover’s Search:

Can search an unstructured database of $N$ items in $O(\sqrt{N})$ time.
Implication: Random search could become quadratically faster.

38. Ethical Considerations in Hyperparameter Tuning

Tuning is not value-neutral.

1. Bias Amplification:

If you tune for global WER, the model might sacrifice accuracy on minority accents to improve the majority.
Fix: Tune for Worst-Case WER across demographic groups, not Average WER.

2. Energy Consumption:

Training a large Transformer with NAS emits as much CO2 as 5 cars in their lifetime.
Responsibility: Report “CO2e” alongside WER in papers. Use Green algorithms.

3. Accessibility:

Only Big Tech has the compute to tune 100B parameter models.
Democratization: Release pre-tuned checkpoints and “recipes” so smaller labs don’t have to re-tune from scratch.
Transparency: Disclose the carbon footprint of your tuning process.

39. Further Reading

To dive deeper into the mathematics and systems of hyperparameter tuning, check out these seminal papers:

“Algorithms for Hyper-Parameter Optimization” (Bergstra et al., 2011): The paper that introduced TPE and showed Random Search beats Grid Search.
“Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization” (Li et al., 2018): The foundation of modern early-stopping strategies.
“Ray Tune: A Framework for Distributed Hyperparameter Tuning” (Liaw et al., 2018): The system design behind scalable tuning.
“Optuna: A Next-generation Hyperparameter Optimization Framework” (Akiba et al., 2019): Introduced the define-by-run API that we use today.
“Neural Architecture Search with Reinforcement Learning” (Zoph & Le, 2017): The paper that started the NAS craze (and burned a lot of GPU hours).
“SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition” (Park et al., 2019): Essential reading for speech regularization.

40. Summary

Aspect	Strategy
Search	Bayesian Optimization (Optuna)
Fidelity	Hyperband (small data first)
Architecture	NAS (ENAS, DARTS)
Parallelization	Ray Tune (multi-GPU)
Transfer	Use pre-tuned models
Early Stopping	Median Pruner
Tracking	MLflow Registry
Edge	Multi-objective (WER + Latency)
Production	Auto-tuning pipelines on K8s
Troubleshooting	Log-scale LR, Gradient Clipping
Ethics	Tune for Worst-Case WER (Fairness)

Originally published at: arunbaby.com/speech-tech/0038-speech-hyperparameter-tuning

Share on

Twitter Facebook LinkedIn