Hyperparameter Optimization
“Finding the perfect knobs to turn.”
1. The Problem: Too Many Knobs
Training a neural network involves many hyperparameters:
- Learning rate: 0.001? 0.01? 0.0001?
- Batch size: 32? 64? 128?
- Number of layers: 3? 5? 10?
- Dropout rate: 0.1? 0.3? 0.5?
- Optimizer: Adam? SGD? AdamW?
Challenge: The search space is exponential. For 10 hyperparameters with 5 values each, that’s $5^{10} = 9.7$ million combinations!
2. Search Strategies
1. Grid Search
- Idea: Try all combinations.
- Pros: Exhaustive, guaranteed to find best in grid.
- Cons: Exponentially expensive.
from sklearn.model_selection import GridSearchCV
param_grid = {
'learning_rate': [0.001, 0.01, 0.1],
'batch_size': [32, 64, 128],
'num_layers': [3, 5, 7]
}
# Total trials: 3 × 3 × 3 = 27
2. Random Search
- Idea: Sample random combinations.
- Pros: More efficient than grid search.
- Insight: Most hyperparameters don’t matter much. Random search explores more of the important ones.
from sklearn.model_selection import RandomizedSearchCV
param_distributions = {
'learning_rate': [0.0001, 0.001, 0.01, 0.1],
'batch_size': [16, 32, 64, 128, 256]
}
# Try 20 random combinations
3. Bayesian Optimization
- Idea: Build a probabilistic model of the objective function.
- Acquisition Function: Decides where to sample next (balance exploration vs. exploitation).
- Pros: Sample-efficient. Converges faster than random search.
3. Bayesian Optimization Deep Dive
Algorithm:
- Surrogate Model: Gaussian Process (GP) models $f(\theta) \approx \text{validation accuracy}$.
- Acquisition Function: Expected Improvement (EI) or Upper Confidence Bound (UCB). \(\text{EI}(\theta) = \mathbb{E}[\max(f(\theta) - f(\theta^*), 0)]\) Where $\theta^*$ is the current best.
- Optimize Acquisition: Find $\theta$ that maximizes EI.
- Evaluate: Train model with $\theta$, observe accuracy.
- Update GP: Add new observation, repeat.
Libraries:
- Optuna: Most popular in ML.
- Hyperopt: Tree-structured Parzen Estimator (TPE).
- Ray Tune: Distributed tuning at scale.
4. Optuna Example
import optuna
def objective(trial):
# Suggest hyperparameters
lr = trial.suggest_float('lr', 1e-5, 1e-1, log=True)
batch_size = trial.suggest_categorical('batch_size', [16, 32, 64])
dropout = trial.suggest_float('dropout', 0.1, 0.5)
# Train model
model = build_model(lr, batch_size, dropout)
val_acc = train_and_evaluate(model)
return val_acc
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(f"Best params: {study.best_params}")
print(f"Best value: {study.best_value}")
5. Advanced: Multi-Fidelity Optimization
Problem: Evaluating each trial is expensive (train for 100 epochs).
Solution: Successive Halving (Hyperband).
- Start with many trials, train for 1 epoch.
- Keep top 50%, train for 2 epochs.
- Keep top 50%, train for 4 epochs.
- Repeat until 1 trial remains, train for 100 epochs.
Speedup: 10-100x faster than full evaluation.
6. Ray Tune for Distributed Tuning
from ray import tune
def train_model(config):
model = build_model(config['lr'], config['batch_size'])
for epoch in range(10):
loss = train_epoch(model)
tune.report(loss=loss)
config = {
'lr': tune.loguniform(1e-5, 1e-1),
'batch_size': tune.choice([16, 32, 64])
}
analysis = tune.run(
train_model,
config=config,
num_samples=100,
resources_per_trial={'gpu': 1}
)
print(f"Best config: {analysis.best_config}")
7. Summary
| Method | Trials Needed | Pros | Cons |
|---|---|---|---|
| Grid | $O(k^n)$ | Exhaustive | Exponential |
| Random | $O(100)$ | Simple | Inefficient |
| Bayesian | $O(50)$ | Sample-efficient | Complex |
| Hyperband | $O(20)$ | Very fast | Needs early stopping |
8. Deep Dive: Acquisition Functions
Acquisition functions decide where to sample next in Bayesian Optimization.
1. Expected Improvement (EI)
\(\text{EI}(\theta) = \mathbb{E}[\max(f(\theta) - f(\theta^*), 0)]\)
- Intuition: How much better can we expect this point to be?
- Pros: Balances exploration (high uncertainty) and exploitation (high mean).
2. Upper Confidence Bound (UCB)
\(\text{UCB}(\theta) = \mu(\theta) + \kappa \sigma(\theta)\)
- $\mu$: Predicted mean.
- $\sigma$: Predicted std dev (uncertainty).
- $\kappa$: Exploration parameter (typically 2-3).
- Intuition: Optimistic estimate. “This could be really good!”
3. Probability of Improvement (PI)
\(\text{PI}(\theta) = P(f(\theta) > f(\theta^*))\)
- Intuition: What’s the chance this beats the current best?
- Cons: Too greedy, doesn’t care how much better.
9. Deep Dive: Hyperband Algorithm
Problem: Training to convergence is expensive. Can we stop bad trials early?
Hyperband (Successive Halving + Adaptive Resource Allocation):
def hyperband(max_iter=81, eta=3):
# max_iter: max epochs
# eta: downsampling rate
s_max = int(np.log(max_iter) / np.log(eta))
B = (s_max + 1) * max_iter
for s in reversed(range(s_max + 1)):
n = int(np.ceil(B / max_iter / (s + 1) * eta**s))
r = max_iter * eta**(-s)
# Generate n random configurations
configs = [random_config() for _ in range(n)]
for i in range(s + 1):
n_i = int(n * eta**(-i))
r_i = int(r * eta**i)
# Train each config for r_i epochs
results = [train(c, r_i) for c in configs]
# Keep top 1/eta
configs = top_k(configs, results, int(n_i / eta))
return best_config
Example: max_iter=81, eta=3
- Round 1: 81 configs, 1 epoch each.
- Round 2: 27 configs (top 1/3), 3 epochs each.
- Round 3: 9 configs, 9 epochs each.
- Round 4: 3 configs, 27 epochs each.
- Round 5: 1 config, 81 epochs.
10. Deep Dive: Parallel Hyperparameter Tuning
Challenge: Bayesian Optimization is sequential (needs previous results to decide next point).
Solution 1: Batch Bayesian Optimization
- Use acquisition function to select top-$k$ points.
- Evaluate them in parallel.
- Update GP with all $k$ results.
Solution 2: Asynchronous Successive Halving (ASHA)
- Don’t wait for all trials to finish.
- As soon as a trial completes an epoch, decide: promote or kill.
# Ray Tune with ASHA
from ray.tune.schedulers import ASHAScheduler
scheduler = ASHAScheduler(
max_t=100, # Max epochs
grace_period=10, # Min epochs before stopping
reduction_factor=3
)
tune.run(
train_model,
config=config,
num_samples=100,
scheduler=scheduler,
resources_per_trial={'gpu': 1}
)
11. System Design: Hyperparameter Tuning Platform
Scenario: Build a platform for 100 ML engineers to tune models.
Requirements:
- Scalability: 1000s of concurrent trials.
- Reproducibility: Track all experiments.
- Visualization: Compare trials easily.
Architecture:
- Scheduler: Ray Tune (distributed).
- Tracking: Weights & Biases (W&B) or MLflow.
- Storage: S3 for checkpoints.
- Compute: Kubernetes cluster with autoscaling.
Code:
import wandb
from ray import tune
def train_with_logging(config):
wandb.init(project='hyperparameter-tuning', config=config)
model = build_model(config)
for epoch in range(100):
loss = train_epoch(model)
wandb.log({'loss': loss, 'epoch': epoch})
tune.report(loss=loss)
tune.run(
train_with_logging,
config=search_space,
num_samples=1000
)
12. Deep Dive: Transfer Learning for Hyperparameters
Idea: If we tuned hyperparameters for Task A, can we use them for Task B?
Meta-Learning Approach:
- Collect tuning history from many tasks.
- Train a model: $f(\text{task features}) \rightarrow \text{good hyperparameters}$.
- For new task, predict good starting point.
Example: Google Vizier uses this internally.
13. Production Considerations
- Cost: Each trial costs GPU hours. Set a budget.
- Reproducibility: Always set random seeds.
- Monitoring: Track resource usage (GPU util, memory).
- Checkpointing: Save model every N epochs (for Hyperband).
- Early Stopping: Don’t waste time on diverging models.
- Early Stopping: Don’t waste time on diverging models.
14. Deep Dive: Population-Based Training (PBT)
Origin: DeepMind (2017). Used to train AlphaStar and Waymo agents.
Concept:
- Combines Random Search (exploration) with Greedy Selection (exploitation).
- Instead of fixed hyperparameters, PBT evolves them during training.
Algorithm:
- Initialize: Start a population of $N$ models with random hyperparameters.
- Train: Train all models for $k$ steps.
- Eval: Evaluate performance.
- Exploit: Replace the bottom 20% of models with copies of the top 20%.
- Explore: Perturb the hyperparameters of the copied models (mutation).
lr = lr * random.choice([0.8, 1.2])
- Repeat: Continue training.
Benefits:
- Dynamic Schedules: Discovers complex schedules (e.g., “start with high LR, then decay, then spike”).
- Efficiency: No wasted compute on bad trials (they get killed).
- Single Run: You get a fully trained model at the end, not just a config.
15. Deep Dive: Neural Architecture Search (NAS)
Hyperparameters aren’t just numbers (LR, Batch Size). They can be the architecture itself.
Search Space:
- Number of layers.
- Operation type (Conv3x3, Conv5x5, MaxPool).
- Skip connections.
Algorithms:
- Reinforcement Learning (RL):
- Controller (RNN) generates an architecture string.
- Train child network, get accuracy (Reward).
- Update Controller using Policy Gradient.
- Cons: Extremely slow (2000 GPU-days for original NAS).
- Evolutionary Algorithms (EA):
- Mutate architectures (add layer, change filter size).
- Select best, repeat.
- Example: AmoebaNet.
- Differentiable NAS (DARTS):
- Relax discrete choices into continuous weights (softmax).
- Train architecture weights $\alpha$ and model weights $w$ simultaneously using gradient descent.
- Pros: Fast (single GPU-day).
16. Deep Dive: The Math of Gaussian Processes (GP)
Bayesian Optimization relies on GPs. What are they?
Definition: A GP is a distribution over functions, defined by a mean function $m(x)$ and a covariance function (kernel) $k(x, x’)$.
\[f(x) \sim GP(m(x), k(x, x'))\]Kernels:
- RBF (Radial Basis Function): Smooth functions. \(k(x, x') = \sigma^2 \exp(-\frac{||x - x'||^2}{2l^2})\)
- Matern: Rougher functions (better for deep learning landscapes).
Posterior Update: Given observed data $D = {(x_i, y_i)}$, the predictive distribution for a new point $x_*$ is Gaussian: \(P(f_* | D, x_*) = \mathcal{N}(\mu_*, \Sigma_*)\)
\(\mu_* = K_*^T (K + \sigma_n^2 I)^{-1} y\) \(\Sigma_* = K_{**} - K_*^T (K + \sigma_n^2 I)^{-1} K_*\)
- $\mu_*$: Predicted value (Exploitation).
- $\Sigma_*$: Uncertainty (Exploration).
17. Deep Dive: Tree-Structured Parzen Estimator (TPE)
Optuna uses TPE by default. It’s faster than GPs for high dimensions.
| Idea: Instead of modeling $P(y | x)$ (GP), model $P(x | y)$ and $P(y)$. |
- Split Data: Divide observations into two groups:
- Top 20% (Good): $l(x)$
- Bottom 80% (Bad): $g(x)$
- Density Estimation: Fit Kernel Density Estimators (KDE) to $l(x)$ and $g(x)$.
- “What do good hyperparameters look like?”
- “What do bad hyperparameters look like?”
- Acquisition: Maximize Expected Improvement, which simplifies to maximizing: \(\frac{l(x)}{g(x)}\)
Intuition: Pick $x$ that is highly likely under the “Good” distribution and unlikely under the “Bad” distribution.
18. Case Study: Tuning BERT for Production
Scenario: Fine-tuning BERT-Large for Sentiment Analysis.
Search Space:
- Learning Rate: $1e-5, 2e-5, 3e-5, 5e-5$.
- Batch Size: 16, 32.
- Epochs: 2, 3, 4.
- Warmup Steps: 0, 100, 500.
Key Findings (RoBERTa paper):
- Batch Size: Larger is better (up to a point).
- Training Duration: Training longer with smaller LR is better than short/high LR.
- Layer-wise LR Decay: Lower layers (closer to input) capture general features, need smaller LR. Higher layers need larger LR.
- $\text{LR}{layer} = \text{LR}{base} \times \xi^{L - layer}$ where $\xi = 0.95$.
19. Case Study: AlphaGo Zero Tuning
Problem: Tuning Monte Carlo Tree Search (MCTS) + Neural Network.
Hyperparameters:
- $c_{puct}$: Exploration constant in MCTS.
- Dirichlet Noise $\alpha$: Noise added to root node for exploration.
- Self-play games: How many games before retraining?
Strategy:
- Self-Play Evaluation: New model plays 400 games against old model.
- Gating: Only promote if win rate > 55%.
- Massive Parallelism: Thousands of TPUs generating self-play data.
20. System Design: Scalable Tuning Infrastructure
Components:
- Experiment Manager (Katib / Ray Tune):
- Stores search space config.
- Generates trials.
- Trial Runner (Kubernetes Pods):
- Pulls Docker image.
- Runs training code.
- Reports metrics to Manager.
- Database (MySQL / PostgreSQL):
- Stores trial history (params, metrics).
- Dashboard (Vizier / W&B):
- Visualizes parallel coordinate plots.
Scalability Challenges:
- Database Bottleneck: 1000 concurrent trials reporting metrics every second.
- Fix: Buffer metrics in Redis, flush to DB periodically.
- Pod Startup Latency: K8s takes 30s to start a pod.
- Fix: Use a pool of warm pods (Ray Actors).
21. Deep Dive: Multi-Objective Optimization
Real World: We don’t just want accuracy. We want:
- Maximize Accuracy.
- Minimize Latency.
- Minimize Model Size.
Pareto Frontier:
- A set of solutions where you cannot improve one objective without hurting another.
- Dominated Solution: Worse than another solution in all objectives.
- Non-Dominated Solution: Better in at least one objective.
Scalarization:
- Convert to single objective: $L = w_1 \cdot Acc + w_2 \cdot \frac{1}{Lat}$.
- Problem: Need to tune weights $w$.
NSGA-II (Non-dominated Sorting Genetic Algorithm):
- Used by Optuna for multi-objective search.
- Maintains a population of Pareto-optimal solutions.
22. Code: Implementing a Simple Bayesian Optimizer
Let’s build a toy BO from scratch using scikit-learn.
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
from scipy.stats import norm
class SimpleBayesianOptimizer:
def __init__(self, objective_func, bounds):
self.objective = objective_func
self.bounds = bounds
self.X = []
self.y = []
self.gp = GaussianProcessRegressor(kernel=Matern(nu=2.5))
def expected_improvement(self, X_candidates):
mu, sigma = self.gp.predict(X_candidates, return_std=True)
mu_sample_opt = np.max(self.y)
with np.errstate(divide='warn'):
imp = mu - mu_sample_opt
Z = imp / sigma
ei = imp * norm.cdf(Z) + sigma * norm.pdf(Z)
ei[sigma == 0.0] = 0.0
return ei
def optimize(self, n_iters=10):
# Initial random samples
for _ in range(2):
x = np.random.uniform(self.bounds[0], self.bounds[1], 1).reshape(-1, 1)
y = self.objective(x)
self.X.append(x)
self.y.append(y)
for i in range(n_iters):
# Fit GP
self.gp.fit(np.array(self.X).reshape(-1, 1), np.array(self.y))
# Find point with max EI
X_grid = np.linspace(self.bounds[0], self.bounds[1], 100).reshape(-1, 1)
ei = self.expected_improvement(X_grid)
next_x = X_grid[np.argmax(ei)]
# Evaluate
next_y = self.objective(next_x)
self.X.append(next_x)
self.y.append(next_y)
print(f"Iter {i}: Best y = {np.max(self.y):.4f}")
# Usage
def objective(x): return -1 * (x - 2)**2 + 10 # Max at x=2
opt = SimpleBayesianOptimizer(objective, bounds=(-5, 5))
opt.optimize(n_iters=10)
23. Future Trends: AutoML-Zero
Goal: Evolve the algorithms themselves, not just parameters.
Method:
- Represent ML algorithms as a sequence of basic math operations (add, multiply, sin, cos).
- Use evolutionary algorithms to discover “Gradient Descent” or “Neural Networks” from scratch.
- Result: Rediscovered backpropagation and linear regression.
Implication: Future ML engineers might tune “Search Space Definitions” rather than models.
24. Summary
| ASHA | $O(20)$ | Parallel | Requires Ray |
25. Deep Dive: Hyperparameter Importance Analysis
After running 100 trials, you want to know: Which knob actually mattered?
Methods:
- fANOVA (Functional Analysis of Variance):
- Decomposes the variance of the objective function into additive components.
- “60% of variance comes from Learning Rate, 10% from Batch Size, 5% from interaction between LR and Batch Size.”
- Tool:
optuna.importance.get_param_importances(study).
- SHAP (SHapley Additive exPlanations):
- Treats hyperparameter values as “features” and the objective value as the “prediction”.
- Calculates the marginal contribution of each hyperparameter.
- Parallel Coordinate Plots:
- Visualizes the high-dimensional relationships.
- Useful for spotting “bad regions” (e.g., “High LR + Low Batch Size always crashes”).
Actionable Insight:
- If
num_layershas 1% importance, stop tuning it! Fix it to a reasonable default and save compute.
26. Deep Dive: Handling Categorical & Conditional Hyperparameters
Real-world search spaces are messy.
Categorical:
optimizer: [“Adam”, “SGD”, “RMSprop”]- Problem: GPs assume continuous distance. Distance(“Adam”, “SGD”) is undefined.
- Solution: One-hot encoding or using Tree-based models (Random Forests, TPE) which handle splits naturally.
Conditional (Nested):
- IF
optimizer == "SGD"THEN tunemomentum. - IF
optimizer == "Adam"THEN tunebeta1,beta2. - Problem:
momentumis irrelevant ifoptimizeris Adam. - Solution:
- TPE: Handles this naturally by splitting the tree.
- ConfigSpace: A library specifically for defining DAG-structured search spaces.
27. Deep Dive: Warm-Starting Optimization
Problem: Every time we tune a new model, we start from scratch (random sampling). Reality: We have tuned 50 similar models before.
Strategies:
- Initial Points:
- Instead of random initialization, seed the optimizer with the best configs from previous studies.
study.enqueue_trial({'lr': 1e-3, 'batch_size': 32}).
- Transfer Learning for GPs:
- Use data from previous tasks to learn a “prior” for the GP mean function.
- Multi-Task Bayesian Optimization: Model the correlation between Task A and Task B. If they are correlated, observations in A reduce uncertainty in B.
- Meta-Learning (Auto-Sklearn):
- Compute meta-features of the dataset (num_rows, num_cols, class_balance).
- Find nearest neighbors in the “dataset space”.
- Reuse their best hyperparameters.
28. Case Study: Tuning XGBoost vs Neural Networks
XGBoost / LightGBM:
- Key Params:
max_depth,learning_rate,subsample,colsample_bytree,min_child_weight. - Landscape: Rugged but convex-ish locally.
- Strategy: Random Search is often “good enough”. TPE works very well.
- Cost: Fast to train (seconds/minutes). Can run 1000s of trials.
Neural Networks (ResNet/Transformer):
- Key Params:
lr,batch_size,optimizer,scheduler. - Landscape: Non-convex, saddle points, noise.
- Strategy: Must use Learning Rate Schedules. Tuning the schedule is more important than tuning the fixed LR.
- Cost: Slow (hours/days). Must use Early Stopping (Hyperband).
29. Ethical Considerations: The Carbon Footprint of Tuning
The Cost:
- Training a Transformer with NAS can emit 600,000 lbs of CO2 (equivalent to 5 cars’ lifetime).
- “Red AI” (buying performance with massive compute) vs “Green AI” (efficiency).
Mitigation Strategies:
- Green NAS: Penalize energy consumption in the objective function. \(L = \text{Error} + \lambda \cdot \text{Energy}\)
- Proxy Tasks: Tune on a subset of data (10%), then transfer to full data.
- Share Configs: Publish the best hyperparameters so others don’t have to re-tune. (Hugging Face Model Cards).
30. Further Reading
- “Algorithms for Hyper-Parameter Optimization” (Bergstra et al., 2011): Introduced TPE.
- “Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization” (Li et al., 2018): The standard for resource allocation.
- “Google Vizier: A Service for Black-Box Optimization” (Golovin et al., 2017): How Google does it at scale.
- “Neural Architecture Search with Reinforcement Learning” (Zoph & Le, 2017): The paper that started the NAS craze.
- “On the Importance of On-Manifold Regularization” (Mixup): Data augmentation as a hyperparameter.
-
NAS: $O(1000)$ Finds architecture Very expensive
32. Deep Dive: The Future of Tuning - LLMs as Optimizers
OptiMus (2023):
- Uses an LLM (GPT-4) to suggest hyperparameters.
- Prompt: “I am training a ResNet-50. The loss is oscillating. Current LR is 0.1. What should I try next?”
- Response: “Try reducing LR to 0.01 and adding a scheduler.”
- Why it works: LLMs have read millions of papers and GitHub issues. They have “common sense” about training dynamics that Bayesian Optimization lacks.
OMNI (OpenAI):
- Future systems will likely abstract tuning away entirely. You provide data + metric, the system handles the rest.
33. Deep Dive: Tuning for Robustness and Fairness
Robustness (Adversarial Training):
- Hyperparams: Epsilon (perturbation size), Alpha (step size).
- Trade-off: Increasing robustness often decreases clean accuracy.
- Tuning Goal: Find the Pareto frontier between Accuracy and Robustness.
Fairness:
- Hyperparams: Regularization strength for fairness constraints (e.g., Equalized Odds).
- Objective: Minimize Error + $\lambda \cdot \text{Disparity}$.
- Tuning: We need to find the $\lambda$ that satisfies legal/ethical requirements while maximizing utility.
34. Code: Grid Search from Scratch
To understand why Grid Search is bad, let’s implement it.
import itertools
def grid_search(objective, param_grid):
keys = param_grid.keys()
values = param_grid.values()
combinations = list(itertools.product(*values))
best_score = -float('inf')
best_params = None
print(f"Total combinations: {len(combinations)}")
for combo in combinations:
params = dict(zip(keys, combo))
score = objective(params)
if score > best_score:
best_score = score
best_params = params
return best_params, best_score
# Usage
grid = {
'lr': [0.1, 0.01, 0.001],
'batch_size': [32, 64, 128],
'dropout': [0.1, 0.5]
}
# 3 * 3 * 2 = 18 trials.
# If we add one more parameter with 5 options -> 90 trials.
# Exponential explosion!
35. Production Checklist for Hyperparameter Tuning
Before you launch a tuning job:
- Define Metric: Is it Accuracy? F1? AUC? Latency?
- Define Budget: How many GPU hours can I afford?
- Choose Algorithm:
- < 10 params: Bayesian Optimization (Optuna).
-
10 params: Random Search or Hyperband.
- Neural Net: Hyperband / ASHA.
- Set Search Space:
- Use Log Scale for LR and Regularization.
- Don’t tune things that don’t matter (e.g., random seed).
- Enable Early Stopping: Don’t waste compute.
- Log Everything: Use W&B / MLflow.
- Verify on Test Set: Evaluate the single best model on the held-out test set.
-
NAS: $O(1000)$ Finds architecture Very expensive
36. Deep Dive: Bayesian Optimization Hyperband (BOHB)
Problem:
- Bayesian Optimization is great at finding good configs but slow (doesn’t kill bad trials).
- Hyperband is fast (kills bad trials) but random (doesn’t learn from history).
Solution: BOHB (2018)
- Combines the best of both.
- Uses Hyperband to determine how many resources (epochs) to allocate.
- Uses Bayesian Optimization (TPE) to select the configurations to run at each step.
- Result: Converges faster than either method alone. SOTA for many problems.
37. Deep Dive: The “No Free Lunch” Theorem in Tuning
Theorem: Averaged over all possible problems, every optimization algorithm performs equally well (same as random search).
Implication:
- There is no “Best Optimizer” for every problem.
- TPE might be best for XGBoost.
- CMA-ES might be best for Reinforcement Learning.
- Adam might be best for CNNs.
- Lesson: Try multiple optimizers if you are stuck.
38. Deep Dive: Tuning Generative Models (GANs / Diffusion)
Tuning GANs is notoriously hard.
Challenges:
- Mode Collapse: Generator produces only one image.
- Non-Convergence: Discriminator becomes too strong too fast.
Key Hyperparameters:
- Learning Rate Ratio: Often we set TTUR (Two-Time-Scale Update Rule).
- $LR_{disc} = 4 \times LR_{gen}$.
- Beta1: Momentum. Often set to 0.0 or 0.5 (instead of default 0.9).
- Gradient Penalty: Weight $\lambda$ for WGAN-GP.
Diffusion Models:
- Noise Schedule: Linear vs Cosine.
- Timesteps: 1000? 4000?
- EMA Decay: Exponential Moving Average of weights (crucial for quality).
39. Case Study: Tuning Stable Diffusion
Goal: Fine-tune Stable Diffusion on a specific style (e.g., “Disney Style”).
Method: Dreambooth / LoRA.
Hyperparameters:
- Learning Rate: Extremely sensitive. $1e-6$ works, $1e-5$ destroys the model.
- Text Encoder Training: Train it or freeze it? (Training = better likeness, Freezing = better editing).
-
Prior Preservation Loss: Weight of the class images (to prevent forgetting what a “dog” looks like).
- Prior Preservation Loss: Weight of the class images (to prevent forgetting what a “dog” looks like).
40. The Psychology of Tuning
Why do humans struggle with tuning?
- Confirmation Bias: We try 3 things, one works, and we assume it’s the “Golden Config”. We stop searching.
- Sunk Cost Fallacy: “I spent 3 days tuning this ResNet. I can’t switch to EfficientNet now.”
- Dimensionality Curse: Humans can visualize 2D/3D. We cannot intuit 10D spaces. We miss interactions (e.g., “LR is only bad if Batch Size is small”).
Lesson: Trust the algorithm. Don’t “babysit” the tuner.
41. Checklist for Debugging Tuning Failures
If your tuner isn’t finding good results:
- Is the search space too big? Prune irrelevant parameters.
- Are the ranges correct? Is LR
[1e-5, 1e-1]or[1, 10]? (Common bug). - Is the metric noisy? If running the same config twice gives $\pm 5\%$ accuracy, the tuner is confused. Fix the seed or average over runs.
- Is the budget too small? 10 trials is not enough for 10 parameters.
- Is the model broken? Does it train with default parameters? If not, fix the code first.
42. Summary
| Method | Trials Needed | Pros | Cons |
|---|---|---|---|
| Grid | $O(k^n)$ | Exhaustive | Exponential |
| Random | $O(100)$ | Simple | Inefficient |
| Bayesian | $O(50)$ | Sample-efficient | Complex |
| Hyperband | $O(20)$ | Very fast | Needs early stopping |
| ASHA | $O(20)$ | Parallel | Requires Ray |
| PBT | $O(20)$ | Dynamic schedules | Complex setup |
| NAS | $O(1000)$ | Finds architecture | Very expensive |
| BOHB | $O(30)$ | Best of both worlds | Complex |
| LLM | $O(10)$ | Uses “common sense” | New, experimental |
43. Further Reading
- “Algorithms for Hyper-Parameter Optimization” (Bergstra et al., 2011): The paper that introduced TPE.
- “Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization” (Li et al., 2018): The standard for resource allocation.
- “Google Vizier: A Service for Black-Box Optimization” (Golovin et al., 2017): How Google does it at scale.
- “Neural Architecture Search with Reinforcement Learning” (Zoph & Le, 2017): The paper that started the NAS craze.
- “Optuna: A Next-generation Hyperparameter Optimization Framework” (Akiba et al., 2019): The define-by-run philosophy.
44. Conclusion
Hyperparameter optimization is no longer a “nice to have”—it is a critical component of the modern ML stack. As models grow larger and compute becomes more expensive, the ability to efficiently navigate the search space becomes a competitive advantage. Whether you are using simple Random Search for a baseline or deploying massive Population-Based Training on a Kubernetes cluster, the principles remain the same: Explore the unknown, Exploit the promising, and Automate everything.
Originally published at: arunbaby.com/ml-system-design/0038-hyperparameter-optimization ```