Speech Experiment Management

16 minute read

Design experiment management systems tailored for speech research—tracking audio data, models, metrics, and multi-dimensional experiments at scale.

Problem Statement

Design a Speech Experiment Management System that:

Tracks speech-specific metadata: Audio datasets, speaker distributions, language configs, acoustic features
Manages complex experiment spaces: Model architecture × training data × augmentation × decoding hyperparameters
Enables systematic evaluation: WER/CER on multiple test sets, multi-lingual benchmarks, speaker-level analysis
Supports reproducibility: Re-run experiments with exact data/model/environment state
Integrates with speech toolkits: ESPnet, Kaldi, SpeechBrain, Fairseq
Handles large-scale artifacts: Audio files, spectrograms, language models, acoustic models

Functional Requirements

Experiment tracking:
- Log hyperparameters (learning rate, batch size, model architecture)
- Track data configs (train/val/test splits, languages, speaker sets)
- Log metrics (WER, CER, latency, RTF, MOS for TTS)
- Store artifacts (model checkpoints, attention plots, decoded outputs)
- Track augmentation policies (SpecAugment, noise, speed perturbation)
Multi-dimensional organization:
- Organize by task (ASR, TTS, diarization, KWS)
- Group by language (en, zh, es, multi-lingual)
- Tag by domain (broadcast, conversational, read speech)
- Link related experiments (ablations, ensembles, fine-tuning)
Evaluation and comparison:
- Compute WER/CER on multiple test sets
- Speaker-level and utterance-level breakdowns
- Compare across languages, domains, noise conditions
- Visualize attention maps, spectrograms, learning curves
Data versioning:
- Track dataset versions (hashes, splits, preprocessing)
- Track audio feature configs (sample rate, n_mels, hop_length)
- Link experiments to specific data versions
Model versioning:
- Track model checkpoints (epoch, step, metric value)
- Store encoder/decoder weights separately (for transfer learning)
- Link to pre-trained models (HuggingFace, ESPnet zoo)
Collaboration and sharing:
- Share experiments with team
- Export to papers (LaTeX tables, plots)
- Integration with notebooks (Jupyter, Colab)

Non-Functional Requirements

Scale: 1000s of experiments, 100K+ hours of audio, 10+ languages
Performance: Fast queries (<1s), efficient artifact storage (deduplication)
Reliability: No data loss, support for resuming failed experiments
Integration: Minimal code changes to existing training scripts
Cost efficiency: Optimize storage for large audio datasets and models

Understanding the Requirements

Why Speech Experiment Management Is Different

General ML experiment tracking (like MLflow) works, but speech has unique challenges:

Audio data is large:
- Raw audio: ~1 MB/min at 16 kHz
- Spectrograms: even larger for high-resolution features
- Solution: Track data by reference (paths/hashes), not by copying
Multi-lingual and multi-domain:
- Same architecture, different languages/domains
- Need systematic organization and comparison across dimensions
Complex evaluation:
- WER/CER on multiple test sets (clean, noisy, far-field, accented)
- Speaker-level and utterance-level analysis
- Not just a single “accuracy” metric
Decoding hyperparameters matter:
- Beam width, language model weight, length penalty
- Often swept post-training → need to track separately
Long training times:
- ASR models can train for days/weeks
- Need robust checkpointing and resumability

The Systematic Iteration Connection

Just like Spiral Matrix systematically explores a 2D grid:

Speech experiment management explores multi-dimensional spaces:
- Model (architecture, size) × Data (language, domain, augmentation) × Hyperparameters (LR, batch size) × Decoding (beam, LM weight)
Both require state tracking:
- Spiral: track boundaries and current position
- Experiments: track which configs have been tried, which are running, which failed
Both enable resumability:
- Spiral: pause and resume traversal from any boundary state
- Experiments: resume training from checkpoints, resume hyperparameter sweeps

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│              Speech Experiment Management System                │
└─────────────────────────────────────────────────────────────────┘

                          Client Layer
        ┌────────────────────────────────────────────┐
        │  Python SDK  │  CLI  │  Web UI  │  API    │
        │  (ESPnet /   │       │          │         │
        │   SpeechBrain│       │          │         │
        │   integration)│      │          │         │
        └─────────────────────┬──────────────────────┘
                              │
                       API Gateway
                ┌──────────────┴──────────────┐
                │  - Auth & access control     │
                │  - Request routing           │
                │  - Metrics & logging         │
                └──────────────┬──────────────┘
                               │
              ┌────────────────┼────────────────┐
              │                │                │
      ┌───────▼────────┐ ┌────▼─────┐ ┌───────▼────────┐
      │  Metadata      │ │  Metrics │ │  Artifact      │
      │  Service       │ │  Service │ │  Service       │
      │                │ │          │ │                │
      │ - Experiments  │ │ - WER    │ │ - Models       │
      │ - Runs         │ │ - Loss   │ │ - Checkpoints  │
      │ - Data configs │ │ - Curves │ │ - Spectrograms │
      │ - Model configs│ │ - Tables │ │ - Decoded logs │
      │ - Speaker info │ │ - Speaker│ │ - Audio files  │
      │                │ │   metrics│ │   (references) │
      └───────┬────────┘ └────┬─────┘ └───────┬────────┘
              │               │                │
      ┌───────▼────────┐ ┌────▼─────┐ ┌───────▼────────┐
      │  SQL DB        │ │  Time-   │ │  Object Store │
      │  (Postgres)    │ │  Series  │ │  + Data Lake  │
      │                │ │  DB      │ │                │
      │ - Experiments  │ │ - Metrics│ │ - Models       │
      │ - Runs         │ │ - Fast   │ │ - Audio refs   │
      │ - Configs      │ │   queries│ │ - Spectrograms │
      └────────────────┘ └──────────┘ └────────────────┘

Key Components

Metadata Service:
- Stores experiment metadata (model, data, hyperparameters)
- Speech-specific fields: language, domain, speaker count, sample rate
- Relational DB for structured queries
Metrics Service:
- Stores training metrics (loss, learning rate schedule)
- Stores evaluation metrics (WER, CER, per-test-set, per-speaker)
- Time-series DB for efficient queries
Artifact Service:
- Stores models, checkpoints, attention plots, spectrograms
- References to audio files (not copies—audio stays in data lake)
- Deduplication for repeated artifacts (e.g., pre-trained encoders)
Data Lake / Audio Storage:
- Centralized storage for audio datasets
- Organized by language, domain, speaker
- Accessed via paths/URIs, not copied into experiment artifacts
Web UI:
- Dashboard for experiments
- WER comparison tables across test sets
- Attention plot visualization
- Decoded output inspection

Component Deep-Dive

1. Speech-Specific Metadata Schema

Extend general experiment tracking schema with speech-specific fields:

CREATE TABLE speech_experiments (
    experiment_id UUID PRIMARY KEY,
    name VARCHAR(255),
    task VARCHAR(50),  -- 'asr', 'tts', 'diarization', 'kws'
    description TEXT,
    created_at TIMESTAMP,
    user_id VARCHAR(255)
);

CREATE TABLE speech_runs (
    run_id UUID PRIMARY KEY,
    experiment_id UUID REFERENCES speech_experiments(experiment_id),
    name VARCHAR(255),
    status VARCHAR(50),  -- 'running', 'completed', 'failed'
    start_time TIMESTAMP,
    end_time TIMESTAMP,
    user_id VARCHAR(255),
    
    -- Model config
    model_type VARCHAR(100),  -- 'conformer', 'transformer', 'rnn-t'
    num_params BIGINT,
    encoder_layers INT,
    decoder_layers INT,
    
    -- Data config
    train_dataset VARCHAR(255),
    train_hours FLOAT,
    languages TEXT[],  -- array of language codes
    domains TEXT[],    -- ['broadcast', 'conversational']
    sample_rate INT,   -- 16000, 8000, etc.
    
    -- Feature config
    feature_type VARCHAR(50),  -- 'log-mel', 'mfcc', 'fbank'
    n_mels INT,
    hop_length INT,
    win_length INT,
    
    -- Augmentation config
    augmentation_policy JSONB,  -- SpecAugment, noise, speed
    
    -- Training config
    optimizer VARCHAR(50),
    learning_rate FLOAT,
    batch_size INT,
    epochs INT,
    
    -- Environment
    git_commit VARCHAR(40),
    docker_image VARCHAR(255),
    num_gpus INT
);

CREATE TABLE speech_metrics (
    run_id UUID REFERENCES speech_runs(run_id),
    test_set VARCHAR(255),  -- 'librispeech-test-clean', 'common-voice-en'
    metric VARCHAR(50),     -- 'wer', 'cer', 'rtf', 'latency'
    value FLOAT,
    speaker_id VARCHAR(255),  -- optional, for per-speaker metrics
    utterance_id VARCHAR(255),  -- optional, for per-utterance metrics
    timestamp TIMESTAMP,
    PRIMARY KEY (run_id, test_set, metric, speaker_id, utterance_id)
);

CREATE TABLE speech_artifacts (
    artifact_id UUID PRIMARY KEY,
    run_id UUID REFERENCES speech_runs(run_id),
    type VARCHAR(50),  -- 'model', 'checkpoint', 'attention_plot', 'decoded_text'
    path VARCHAR(1024),
    size_bytes BIGINT,
    content_hash VARCHAR(64),
    storage_uri TEXT,
    epoch INT,  -- optional, for checkpoints
    step INT,   -- optional, for checkpoints
    created_at TIMESTAMP
);

2. Python SDK Integration (ESPnet Example)

import speech_experiment_tracker as set

# Initialize client
client = set.Client(api_url="https://speech-tracking.example.com", api_key="...")

# Create experiment
experiment = client.create_experiment(
    name="Conformer ASR - LibriSpeech + Common Voice",
    task="asr",
    description="Multi-dataset training with SpecAugment"
)

# Start a run
run = experiment.start_run(
    name="conformer_12layers_specaug",
    tags={"language": "en", "domain": "read_speech"}
)

# Log model and data configs
run.log_config({
    "model_type": "conformer",
    "encoder_layers": 12,
    "decoder_layers": 6,
    "num_params": 120_000_000,
    "train_dataset": "librispeech-960h + common-voice-en",
    "train_hours": 1200,
    "languages": ["en"],
    "sample_rate": 16000,
    "feature_type": "log-mel",
    "n_mels": 80,
    "hop_length": 160,
    "augmentation_policy": {
        "spec_augment": True,
        "time_mask": 30,
        "freq_mask": 13,
        "speed_perturb": [0.9, 1.0, 1.1]
    },
    "optimizer": "adam",
    "learning_rate": 0.001,
    "batch_size": 32,
    "epochs": 100
})

# Training loop
for epoch in range(100):
    train_loss = train_one_epoch(model, train_loader)
    val_wer = evaluate(model, val_loader)
    
    # Log training metrics
    run.log_metrics({
        "train_loss": train_loss,
        "val_wer": val_wer
    }, step=epoch)
    
    # Save checkpoint
    if epoch % 10 == 0:
        checkpoint_path = f"checkpoints/epoch{epoch}.pt"
        save_checkpoint(model, optimizer, checkpoint_path)
        run.log_artifact(checkpoint_path, type="checkpoint", epoch=epoch)

# Final evaluation on multiple test sets
test_sets = [
    "librispeech-test-clean",
    "librispeech-test-other",
    "common-voice-en-test"
]

for test_set in test_sets:
    wer, cer = evaluate_on_test_set(model, test_set)
    run.log_metrics({
        f"{test_set}_wer": wer,
        f"{test_set}_cer": cer
    })

# Save final model
run.log_artifact("final_model.pt", type="model")

# Mark run as complete
run.finish()

3. Multi-Test-Set Evaluation Tracking

A key speech-specific need: evaluate on multiple test sets and track per-test-set metrics.

def evaluate_and_log(model, test_sets, run):
    """Evaluate model on multiple test sets and log detailed metrics."""
    results = {}
    
    for test_set in test_sets:
        loader = get_test_loader(test_set)
        total_words = 0
        total_errors = 0
        
        utterance_results = []
        
        for batch in loader:
            hyps = model.decode(batch['audio'])
            refs = batch['text']
            
            for hyp, ref, utt_id in zip(hyps, refs, batch['utterance_ids']):
                errors, words = compute_wer_details(hyp, ref)
                total_errors += errors
                total_words += words
                
                utterance_results.append({
                    "utterance_id": utt_id,
                    "wer": errors / max(1, words),
                    "hyp": hyp,
                    "ref": ref
                })
        
        wer = total_errors / max(1, total_words)
        results[test_set] = {
            "wer": wer,
            "utterances": utterance_results
        }
        
        # Log aggregate metric
        run.log_metric(f"{test_set}_wer", wer)
        
        # Log per-utterance results as artifact
        run.log_json(f"results/{test_set}_utterances.json", utterance_results)
    
    return results

4. Data Versioning for Speech

Track dataset versions by hashing audio file lists + preprocessing configs:

import hashlib
import json

def compute_dataset_hash(audio_file_list, preprocessing_config):
    """
    Compute a deterministic hash for a dataset.
    
    Args:
        audio_file_list: List of audio file paths
        preprocessing_config: Dict with sample_rate, n_mels, etc.
    """
    # Sort file list for determinism
    sorted_files = sorted(audio_file_list)
    
    # Combine file list + config
    content = {
        "files": sorted_files,
        "preprocessing": preprocessing_config
    }
    
    # Compute hash
    content_str = json.dumps(content, sort_keys=True)
    dataset_hash = hashlib.sha256(content_str.encode()).hexdigest()
    
    return dataset_hash

# Usage
train_files = glob.glob("/data/librispeech/train-960h/**/*.flac", recursive=True)
preprocessing_config = {
    "sample_rate": 16000,
    "n_mels": 80,
    "hop_length": 160,
    "win_length": 400
}

dataset_hash = compute_dataset_hash(train_files, preprocessing_config)

run.log_config({
    "train_dataset_hash": dataset_hash,
    "train_dataset_files": len(train_files),
    "preprocessing": preprocessing_config
})

5. Decoding Hyperparameter Sweeps

ASR decoding often involves sweeping beam width and language model weight:

def decode_sweep(model, test_set, run):
    """
    Sweep decoding hyperparameters and log results.
    """
    beam_widths = [1, 5, 10, 20]
    lm_weights = [0.0, 0.3, 0.5, 0.7, 1.0]
    
    results = []
    
    for beam in beam_widths:
        for lm_weight in lm_weights:
            wer = evaluate_with_decoding_params(
                model, test_set, beam_width=beam, lm_weight=lm_weight
            )
            
            results.append({
                "beam_width": beam,
                "lm_weight": lm_weight,
                "wer": wer
            })
            
            # Log each config
            run.log_metrics({
                f"wer_beam{beam}_lm{lm_weight}": wer
            })
    
    # Find best config
    best = min(results, key=lambda x: x['wer'])
    run.log_config({"best_decoding_config": best})
    
    # Log full sweep results as artifact
    run.log_json("decoding_sweep.json", results)

Scaling Strategies

1. Efficient Audio Data Handling

Challenge: Copying audio files into each experiment is expensive and redundant.

Solution:

Store audio in a centralized data lake (e.g., S3, HDFS).
Track by reference: Experiments store paths/URIs, not copies.
Use hashing for deduplication: If multiple experiments use the same dataset, hash it once.

# Don't do this:
run.log_artifact("train_audio.tar.gz")  # 100 GB upload per experiment!

# Do this:
run.log_config({
    "train_audio_path": "s3://speech-data/librispeech/train-960h/",
    "train_audio_hash": "sha256:abc123..."
})

2. Checkpoint Deduplication

Challenge: Models can be GBs. Saving every checkpoint is expensive.

Solution:

Content-based deduplication: Hash checkpoint files.
Incremental checkpoints: Store only parameter diffs if possible.
Tiered storage: Recent checkpoints on fast storage, old checkpoints on Glacier.

3. Distributed Evaluation

For large-scale evaluation (100+ test sets, 10+ languages):

Use a distributed evaluation service (Ray, Spark).
Parallelize across test sets and languages.
Aggregate results and log back to experiment tracker.

import ray

@ray.remote
def evaluate_test_set(model_path, test_set):
    model = load_model(model_path)
    wer = evaluate(model, test_set)
    return test_set, wer

# Distribute evaluation
test_sets = ["test_clean", "test_other", "cv_en", "cv_es", ...]
futures = [evaluate_test_set.remote(model_path, ts) for ts in test_sets]
results = ray.get(futures)

# Log all results
for test_set, wer in results:
    run.log_metric(f"{test_set}_wer", wer)

Monitoring & Observability

Key Metrics

System metrics:

Request latency (API, artifact upload/download)
Storage usage (models, audio references, metadata)
Error rates (failed experiments, upload failures)

User metrics:

Active experiments and runs
Average experiment duration
Most common model architectures, datasets, languages
WER distribution across runs

Dashboards:

Experiment dashboard (running/completed/failed, recent results)
System health (latency, storage, errors)
Cost dashboard (storage, compute, data transfer)

Alerts

Experiment failed after >12 hours of training
Storage usage >90% capacity
API error rate >1%
WER degradation on key test sets

Failure Modes & Mitigations

Failure Mode	Impact	Mitigation
Training crash mid-run	Lost progress, wasted compute	Robust checkpointing, auto-resume
Artifact upload failure	Model/checkpoint not saved	Retry with exponential backoff, local backup
Dataset hash collision	Wrong dataset used	Use strong hash (SHA-256), validate file count
Metric not logged	Incomplete evaluation	Client-side buffering, fail-safe logging
Evaluation script bug	Wrong WER reported	Unit tests for evaluation, log decoded outputs

Real-World Case Study: Multi-Lingual ASR Team

Scenario:

Team of 20 researchers
10+ languages (en, zh, es, fr, de, ja, ko, ar, hi, ru)
100K+ hours of audio data
1000+ experiments over 2 years

Architecture:

Metadata: PostgreSQL with indexes on language, domain, test_set
Metrics: InfluxDB for fast time-series queries
Artifacts: S3 with lifecycle policies (checkpoints >6 months → Glacier)
Audio data: Centralized S3 bucket, organized by language/domain
API: Kubernetes cluster with auto-scaling

Key optimizations:

Audio stored by reference (saved ~10 TB of redundant uploads)
Checkpoint deduplication (saved ~30% storage)
Distributed evaluation (Ray cluster, 10x speedup on multi-test-set evaluation)

Outcomes:

99.9% uptime
Median query latency: 150ms
Complete audit trail for model releases
Reproducibility: 95% of experiments re-runnable from metadata

Cost Analysis

Example: Medium-Sized Speech Team

Assumptions:

10 researchers
50 experiments/month, 500 runs/month
Average run: 5 GB model, 10K metrics, 10 test sets
Audio data: 50K hours, stored centrally (not per-experiment)
Retention: 2 years

Component	Cost/Month
Metadata DB (PostgreSQL RDS)	$300
Metrics DB (InfluxDB)	$400
Model storage (S3, 5 TB)	$115
Audio data storage (S3, 50 TB, reference-only)	$1,150
API compute (Kubernetes)	$600
Data transfer	$150
Total	$2,715

Cost savings from best practices:

Audio by reference (vs copying per-experiment): -$50K/year
Checkpoint deduplication: -$500/month
Tiered storage (Glacier for old artifacts): -$200/month

Advanced Topics

1. Integration with ESPnet

ESPnet is a popular speech toolkit. Integrate experiment tracking:

# In ESPnet training script
import speech_experiment_tracker as set

# Initialize tracker
client = set.Client(...)
run = client.start_run(...)

# Log ESPnet config
run.log_config(vars(args))  # args from argparse

# Hook into ESPnet trainer
class TrackerCallback:
    def on_epoch_end(self, epoch, metrics):
        run.log_metrics(metrics, step=epoch)
    
    def on_checkpoint_save(self, epoch, checkpoint_path):
        run.log_artifact(checkpoint_path, type="checkpoint", epoch=epoch)

trainer.add_callback(TrackerCallback())

2. Attention Map Visualization

For Transformer-based ASR, log and visualize attention maps:

import matplotlib.pyplot as plt

def plot_attention(attention_weights, hyp_tokens, ref_tokens):
    """Plot attention matrix."""
    fig, ax = plt.subplots(figsize=(10, 10))
    ax.imshow(attention_weights, cmap='Blues')
    ax.set_xticks(range(len(ref_tokens)))
    ax.set_xticklabels(ref_tokens, rotation=90)
    ax.set_yticks(range(len(hyp_tokens)))
    ax.set_yticklabels(hyp_tokens)
    ax.set_xlabel("Reference Tokens")
    ax.set_ylabel("Hypothesis Tokens")
    return fig

# During evaluation
attention_weights = model.get_attention_weights(audio)
fig = plot_attention(attention_weights, hyp_tokens, ref_tokens)
run.log_figure("attention_plot.png", fig)

3. Speaker-Level Analysis

Track per-speaker WER to identify performance gaps:

def speaker_level_analysis(model, test_set, run):
    """Compute and log per-speaker WER."""
    speaker_stats = {}
    
    for batch in test_loader:
        hyps = model.decode(batch['audio'])
        refs = batch['text']
        speakers = batch['speaker_ids']
        
        for hyp, ref, speaker in zip(hyps, refs, speakers):
            errors, words = compute_wer_details(hyp, ref)
            
            if speaker not in speaker_stats:
                speaker_stats[speaker] = {"errors": 0, "words": 0}
            
            speaker_stats[speaker]["errors"] += errors
            speaker_stats[speaker]["words"] += words
    
    # Compute per-speaker WER
    for speaker, stats in speaker_stats.items():
        wer = stats["errors"] / max(1, stats["words"])
        run.log_metric(f"speaker_{speaker}_wer", wer)
    
    # Log summary statistics
    wers = [stats["errors"] / max(1, stats["words"]) for stats in speaker_stats.values()]
    run.log_metrics({
        "avg_speaker_wer": np.mean(wers),
        "median_speaker_wer": np.median(wers),
        "worst_speaker_wer": np.max(wers),
        "best_speaker_wer": np.min(wers)
    })

Practical Operations Checklist

For Speech Researchers

Always log data config: Dataset, hours, languages, speaker count.
Track augmentation policies: SpecAugment, noise, speed perturbation.
Evaluate on multiple test sets: Clean, noisy, accented, domain-specific.
Log decoded outputs: For error analysis and debugging.
Track decoding hyperparameters: Beam width, LM weight.
Use descriptive run names: conformer_12l_960h_specaug > run_42.

For Platform Engineers

Monitor storage growth: Audio and models can grow quickly.
Set up tiered storage: Move old checkpoints to Glacier.
Implement checkpoint cleanup: Delete intermediate checkpoints after final model is saved.
Monitor evaluation queue: Distributed eval should not be a bottleneck.
Test disaster recovery: Can you restore experiments from backups?

Key Takeaways

✅ Speech experiment management requires domain-specific extensions (audio data, WER/CER, multi-test-set evaluation).

✅ Store audio by reference, not by copying—saves massive storage costs.

✅ Track data versions (hashes, preprocessing configs) for reproducibility.

✅ Multi-dimensional evaluation (language, domain, noise condition) is critical for speech.

✅ Checkpoint deduplication and tiered storage reduce costs significantly.

✅ Systematic iteration through experiment spaces (model × data × hyperparameters) mirrors structured traversal patterns like Spiral Matrix.

✅ Integration with speech toolkits (ESPnet, Kaldi, SpeechBrain) is key for adoption.

Connection to Thematic Link: Systematic Iteration and State Tracking

All three Day 19 topics converge on systematic, stateful exploration:

DSA (Spiral Matrix):

Layer-by-layer traversal with boundary tracking
Explicit state management (top, bottom, left, right)
Resume/pause friendly

ML System Design (Experiment Tracking Systems):

Systematic exploration of hyperparameter/architecture spaces
Track state of experiments (running, completed, failed)
Resume from checkpoints, recover from failures

Speech Tech (Speech Experiment Management):

Organize speech experiments across model × data × hyperparameters × decoding dimensions
Track training state (checkpoints, metrics, evaluation results)
Enable reproducibility and multi-test-set comparison

The unifying pattern: structured, stateful iteration through complex, multi-dimensional spaces with clear persistence and recoverability.

Originally published at: arunbaby.com/speech-tech/0019-speech-experiment-management

If you found this helpful, consider sharing it with others who might benefit.