16 minute read

Design experiment management systems tailored for speech research—tracking audio data, models, metrics, and multi-dimensional experiments at scale.

Problem Statement

Design a Speech Experiment Management System that:

  1. Tracks speech-specific metadata: Audio datasets, speaker distributions, language configs, acoustic features
  2. Manages complex experiment spaces: Model architecture × training data × augmentation × decoding hyperparameters
  3. Enables systematic evaluation: WER/CER on multiple test sets, multi-lingual benchmarks, speaker-level analysis
  4. Supports reproducibility: Re-run experiments with exact data/model/environment state
  5. Integrates with speech toolkits: ESPnet, Kaldi, SpeechBrain, Fairseq
  6. Handles large-scale artifacts: Audio files, spectrograms, language models, acoustic models

Functional Requirements

  1. Experiment tracking:
    • Log hyperparameters (learning rate, batch size, model architecture)
    • Track data configs (train/val/test splits, languages, speaker sets)
    • Log metrics (WER, CER, latency, RTF, MOS for TTS)
    • Store artifacts (model checkpoints, attention plots, decoded outputs)
    • Track augmentation policies (SpecAugment, noise, speed perturbation)
  2. Multi-dimensional organization:
    • Organize by task (ASR, TTS, diarization, KWS)
    • Group by language (en, zh, es, multi-lingual)
    • Tag by domain (broadcast, conversational, read speech)
    • Link related experiments (ablations, ensembles, fine-tuning)
  3. Evaluation and comparison:
    • Compute WER/CER on multiple test sets
    • Speaker-level and utterance-level breakdowns
    • Compare across languages, domains, noise conditions
    • Visualize attention maps, spectrograms, learning curves
  4. Data versioning:
    • Track dataset versions (hashes, splits, preprocessing)
    • Track audio feature configs (sample rate, n_mels, hop_length)
    • Link experiments to specific data versions
  5. Model versioning:
    • Track model checkpoints (epoch, step, metric value)
    • Store encoder/decoder weights separately (for transfer learning)
    • Link to pre-trained models (HuggingFace, ESPnet zoo)
  6. Collaboration and sharing:
    • Share experiments with team
    • Export to papers (LaTeX tables, plots)
    • Integration with notebooks (Jupyter, Colab)

Non-Functional Requirements

  1. Scale: 1000s of experiments, 100K+ hours of audio, 10+ languages
  2. Performance: Fast queries (<1s), efficient artifact storage (deduplication)
  3. Reliability: No data loss, support for resuming failed experiments
  4. Integration: Minimal code changes to existing training scripts
  5. Cost efficiency: Optimize storage for large audio datasets and models

Understanding the Requirements

Why Speech Experiment Management Is Different

General ML experiment tracking (like MLflow) works, but speech has unique challenges:

  1. Audio data is large:
    • Raw audio: ~1 MB/min at 16 kHz
    • Spectrograms: even larger for high-resolution features
    • Solution: Track data by reference (paths/hashes), not by copying
  2. Multi-lingual and multi-domain:
    • Same architecture, different languages/domains
    • Need systematic organization and comparison across dimensions
  3. Complex evaluation:
    • WER/CER on multiple test sets (clean, noisy, far-field, accented)
    • Speaker-level and utterance-level analysis
    • Not just a single “accuracy” metric
  4. Decoding hyperparameters matter:
    • Beam width, language model weight, length penalty
    • Often swept post-training → need to track separately
  5. Long training times:
    • ASR models can train for days/weeks
    • Need robust checkpointing and resumability

The Systematic Iteration Connection

Just like Spiral Matrix systematically explores a 2D grid:

  • Speech experiment management explores multi-dimensional spaces:
    • Model (architecture, size) × Data (language, domain, augmentation) × Hyperparameters (LR, batch size) × Decoding (beam, LM weight)
  • Both require state tracking:
    • Spiral: track boundaries and current position
    • Experiments: track which configs have been tried, which are running, which failed
  • Both enable resumability:
    • Spiral: pause and resume traversal from any boundary state
    • Experiments: resume training from checkpoints, resume hyperparameter sweeps

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│              Speech Experiment Management System                │
└─────────────────────────────────────────────────────────────────┘

                          Client Layer
        ┌────────────────────────────────────────────┐
        │  Python SDK  │  CLI  │  Web UI  │  API    │
        │  (ESPnet /   │       │          │         │
        │   SpeechBrain│       │          │         │
        │   integration)│      │          │         │
        └─────────────────────┬──────────────────────┘
                              │
                       API Gateway
                ┌──────────────┴──────────────┐
                │  - Auth & access control     │
                │  - Request routing           │
                │  - Metrics & logging         │
                └──────────────┬──────────────┘
                               │
              ┌────────────────┼────────────────┐
              │                │                │
      ┌───────▼────────┐ ┌────▼─────┐ ┌───────▼────────┐
      │  Metadata      │ │  Metrics │ │  Artifact      │
      │  Service       │ │  Service │ │  Service       │
      │                │ │          │ │                │
      │ - Experiments  │ │ - WER    │ │ - Models       │
      │ - Runs         │ │ - Loss   │ │ - Checkpoints  │
      │ - Data configs │ │ - Curves │ │ - Spectrograms │
      │ - Model configs│ │ - Tables │ │ - Decoded logs │
      │ - Speaker info │ │ - Speaker│ │ - Audio files  │
      │                │ │   metrics│ │   (references) │
      └───────┬────────┘ └────┬─────┘ └───────┬────────┘
              │               │                │
      ┌───────▼────────┐ ┌────▼─────┐ ┌───────▼────────┐
      │  SQL DB        │ │  Time-   │ │  Object Store │
      │  (Postgres)    │ │  Series  │ │  + Data Lake  │
      │                │ │  DB      │ │                │
      │ - Experiments  │ │ - Metrics│ │ - Models       │
      │ - Runs         │ │ - Fast   │ │ - Audio refs   │
      │ - Configs      │ │   queries│ │ - Spectrograms │
      └────────────────┘ └──────────┘ └────────────────┘

Key Components

  1. Metadata Service:
    • Stores experiment metadata (model, data, hyperparameters)
    • Speech-specific fields: language, domain, speaker count, sample rate
    • Relational DB for structured queries
  2. Metrics Service:
    • Stores training metrics (loss, learning rate schedule)
    • Stores evaluation metrics (WER, CER, per-test-set, per-speaker)
    • Time-series DB for efficient queries
  3. Artifact Service:
    • Stores models, checkpoints, attention plots, spectrograms
    • References to audio files (not copies—audio stays in data lake)
    • Deduplication for repeated artifacts (e.g., pre-trained encoders)
  4. Data Lake / Audio Storage:
    • Centralized storage for audio datasets
    • Organized by language, domain, speaker
    • Accessed via paths/URIs, not copied into experiment artifacts
  5. Web UI:
    • Dashboard for experiments
    • WER comparison tables across test sets
    • Attention plot visualization
    • Decoded output inspection

Component Deep-Dive

1. Speech-Specific Metadata Schema

Extend general experiment tracking schema with speech-specific fields:

CREATE TABLE speech_experiments (
    experiment_id UUID PRIMARY KEY,
    name VARCHAR(255),
    task VARCHAR(50),  -- 'asr', 'tts', 'diarization', 'kws'
    description TEXT,
    created_at TIMESTAMP,
    user_id VARCHAR(255)
);

CREATE TABLE speech_runs (
    run_id UUID PRIMARY KEY,
    experiment_id UUID REFERENCES speech_experiments(experiment_id),
    name VARCHAR(255),
    status VARCHAR(50),  -- 'running', 'completed', 'failed'
    start_time TIMESTAMP,
    end_time TIMESTAMP,
    user_id VARCHAR(255),
    
    -- Model config
    model_type VARCHAR(100),  -- 'conformer', 'transformer', 'rnn-t'
    num_params BIGINT,
    encoder_layers INT,
    decoder_layers INT,
    
    -- Data config
    train_dataset VARCHAR(255),
    train_hours FLOAT,
    languages TEXT[],  -- array of language codes
    domains TEXT[],    -- ['broadcast', 'conversational']
    sample_rate INT,   -- 16000, 8000, etc.
    
    -- Feature config
    feature_type VARCHAR(50),  -- 'log-mel', 'mfcc', 'fbank'
    n_mels INT,
    hop_length INT,
    win_length INT,
    
    -- Augmentation config
    augmentation_policy JSONB,  -- SpecAugment, noise, speed
    
    -- Training config
    optimizer VARCHAR(50),
    learning_rate FLOAT,
    batch_size INT,
    epochs INT,
    
    -- Environment
    git_commit VARCHAR(40),
    docker_image VARCHAR(255),
    num_gpus INT
);

CREATE TABLE speech_metrics (
    run_id UUID REFERENCES speech_runs(run_id),
    test_set VARCHAR(255),  -- 'librispeech-test-clean', 'common-voice-en'
    metric VARCHAR(50),     -- 'wer', 'cer', 'rtf', 'latency'
    value FLOAT,
    speaker_id VARCHAR(255),  -- optional, for per-speaker metrics
    utterance_id VARCHAR(255),  -- optional, for per-utterance metrics
    timestamp TIMESTAMP,
    PRIMARY KEY (run_id, test_set, metric, speaker_id, utterance_id)
);

CREATE TABLE speech_artifacts (
    artifact_id UUID PRIMARY KEY,
    run_id UUID REFERENCES speech_runs(run_id),
    type VARCHAR(50),  -- 'model', 'checkpoint', 'attention_plot', 'decoded_text'
    path VARCHAR(1024),
    size_bytes BIGINT,
    content_hash VARCHAR(64),
    storage_uri TEXT,
    epoch INT,  -- optional, for checkpoints
    step INT,   -- optional, for checkpoints
    created_at TIMESTAMP
);

2. Python SDK Integration (ESPnet Example)

import speech_experiment_tracker as set

# Initialize client
client = set.Client(api_url="https://speech-tracking.example.com", api_key="...")

# Create experiment
experiment = client.create_experiment(
    name="Conformer ASR - LibriSpeech + Common Voice",
    task="asr",
    description="Multi-dataset training with SpecAugment"
)

# Start a run
run = experiment.start_run(
    name="conformer_12layers_specaug",
    tags={"language": "en", "domain": "read_speech"}
)

# Log model and data configs
run.log_config({
    "model_type": "conformer",
    "encoder_layers": 12,
    "decoder_layers": 6,
    "num_params": 120_000_000,
    "train_dataset": "librispeech-960h + common-voice-en",
    "train_hours": 1200,
    "languages": ["en"],
    "sample_rate": 16000,
    "feature_type": "log-mel",
    "n_mels": 80,
    "hop_length": 160,
    "augmentation_policy": {
        "spec_augment": True,
        "time_mask": 30,
        "freq_mask": 13,
        "speed_perturb": [0.9, 1.0, 1.1]
    },
    "optimizer": "adam",
    "learning_rate": 0.001,
    "batch_size": 32,
    "epochs": 100
})

# Training loop
for epoch in range(100):
    train_loss = train_one_epoch(model, train_loader)
    val_wer = evaluate(model, val_loader)
    
    # Log training metrics
    run.log_metrics({
        "train_loss": train_loss,
        "val_wer": val_wer
    }, step=epoch)
    
    # Save checkpoint
    if epoch % 10 == 0:
        checkpoint_path = f"checkpoints/epoch{epoch}.pt"
        save_checkpoint(model, optimizer, checkpoint_path)
        run.log_artifact(checkpoint_path, type="checkpoint", epoch=epoch)

# Final evaluation on multiple test sets
test_sets = [
    "librispeech-test-clean",
    "librispeech-test-other",
    "common-voice-en-test"
]

for test_set in test_sets:
    wer, cer = evaluate_on_test_set(model, test_set)
    run.log_metrics({
        f"{test_set}_wer": wer,
        f"{test_set}_cer": cer
    })

# Save final model
run.log_artifact("final_model.pt", type="model")

# Mark run as complete
run.finish()

3. Multi-Test-Set Evaluation Tracking

A key speech-specific need: evaluate on multiple test sets and track per-test-set metrics.

def evaluate_and_log(model, test_sets, run):
    """Evaluate model on multiple test sets and log detailed metrics."""
    results = {}
    
    for test_set in test_sets:
        loader = get_test_loader(test_set)
        total_words = 0
        total_errors = 0
        
        utterance_results = []
        
        for batch in loader:
            hyps = model.decode(batch['audio'])
            refs = batch['text']
            
            for hyp, ref, utt_id in zip(hyps, refs, batch['utterance_ids']):
                errors, words = compute_wer_details(hyp, ref)
                total_errors += errors
                total_words += words
                
                utterance_results.append({
                    "utterance_id": utt_id,
                    "wer": errors / max(1, words),
                    "hyp": hyp,
                    "ref": ref
                })
        
        wer = total_errors / max(1, total_words)
        results[test_set] = {
            "wer": wer,
            "utterances": utterance_results
        }
        
        # Log aggregate metric
        run.log_metric(f"{test_set}_wer", wer)
        
        # Log per-utterance results as artifact
        run.log_json(f"results/{test_set}_utterances.json", utterance_results)
    
    return results

4. Data Versioning for Speech

Track dataset versions by hashing audio file lists + preprocessing configs:

import hashlib
import json

def compute_dataset_hash(audio_file_list, preprocessing_config):
    """
    Compute a deterministic hash for a dataset.
    
    Args:
        audio_file_list: List of audio file paths
        preprocessing_config: Dict with sample_rate, n_mels, etc.
    """
    # Sort file list for determinism
    sorted_files = sorted(audio_file_list)
    
    # Combine file list + config
    content = {
        "files": sorted_files,
        "preprocessing": preprocessing_config
    }
    
    # Compute hash
    content_str = json.dumps(content, sort_keys=True)
    dataset_hash = hashlib.sha256(content_str.encode()).hexdigest()
    
    return dataset_hash

# Usage
train_files = glob.glob("/data/librispeech/train-960h/**/*.flac", recursive=True)
preprocessing_config = {
    "sample_rate": 16000,
    "n_mels": 80,
    "hop_length": 160,
    "win_length": 400
}

dataset_hash = compute_dataset_hash(train_files, preprocessing_config)

run.log_config({
    "train_dataset_hash": dataset_hash,
    "train_dataset_files": len(train_files),
    "preprocessing": preprocessing_config
})

5. Decoding Hyperparameter Sweeps

ASR decoding often involves sweeping beam width and language model weight:

def decode_sweep(model, test_set, run):
    """
    Sweep decoding hyperparameters and log results.
    """
    beam_widths = [1, 5, 10, 20]
    lm_weights = [0.0, 0.3, 0.5, 0.7, 1.0]
    
    results = []
    
    for beam in beam_widths:
        for lm_weight in lm_weights:
            wer = evaluate_with_decoding_params(
                model, test_set, beam_width=beam, lm_weight=lm_weight
            )
            
            results.append({
                "beam_width": beam,
                "lm_weight": lm_weight,
                "wer": wer
            })
            
            # Log each config
            run.log_metrics({
                f"wer_beam{beam}_lm{lm_weight}": wer
            })
    
    # Find best config
    best = min(results, key=lambda x: x['wer'])
    run.log_config({"best_decoding_config": best})
    
    # Log full sweep results as artifact
    run.log_json("decoding_sweep.json", results)

Scaling Strategies

1. Efficient Audio Data Handling

Challenge: Copying audio files into each experiment is expensive and redundant.

Solution:

  • Store audio in a centralized data lake (e.g., S3, HDFS).
  • Track by reference: Experiments store paths/URIs, not copies.
  • Use hashing for deduplication: If multiple experiments use the same dataset, hash it once.
# Don't do this:
run.log_artifact("train_audio.tar.gz")  # 100 GB upload per experiment!

# Do this:
run.log_config({
    "train_audio_path": "s3://speech-data/librispeech/train-960h/",
    "train_audio_hash": "sha256:abc123..."
})

2. Checkpoint Deduplication

Challenge: Models can be GBs. Saving every checkpoint is expensive.

Solution:

  • Content-based deduplication: Hash checkpoint files.
  • Incremental checkpoints: Store only parameter diffs if possible.
  • Tiered storage: Recent checkpoints on fast storage, old checkpoints on Glacier.

3. Distributed Evaluation

For large-scale evaluation (100+ test sets, 10+ languages):

  • Use a distributed evaluation service (Ray, Spark).
  • Parallelize across test sets and languages.
  • Aggregate results and log back to experiment tracker.
import ray

@ray.remote
def evaluate_test_set(model_path, test_set):
    model = load_model(model_path)
    wer = evaluate(model, test_set)
    return test_set, wer

# Distribute evaluation
test_sets = ["test_clean", "test_other", "cv_en", "cv_es", ...]
futures = [evaluate_test_set.remote(model_path, ts) for ts in test_sets]
results = ray.get(futures)

# Log all results
for test_set, wer in results:
    run.log_metric(f"{test_set}_wer", wer)

Monitoring & Observability

Key Metrics

System metrics:

  • Request latency (API, artifact upload/download)
  • Storage usage (models, audio references, metadata)
  • Error rates (failed experiments, upload failures)

User metrics:

  • Active experiments and runs
  • Average experiment duration
  • Most common model architectures, datasets, languages
  • WER distribution across runs

Dashboards:

  • Experiment dashboard (running/completed/failed, recent results)
  • System health (latency, storage, errors)
  • Cost dashboard (storage, compute, data transfer)

Alerts

  • Experiment failed after >12 hours of training
  • Storage usage >90% capacity
  • API error rate >1%
  • WER degradation on key test sets

Failure Modes & Mitigations

Failure Mode Impact Mitigation
Training crash mid-run Lost progress, wasted compute Robust checkpointing, auto-resume
Artifact upload failure Model/checkpoint not saved Retry with exponential backoff, local backup
Dataset hash collision Wrong dataset used Use strong hash (SHA-256), validate file count
Metric not logged Incomplete evaluation Client-side buffering, fail-safe logging
Evaluation script bug Wrong WER reported Unit tests for evaluation, log decoded outputs

Real-World Case Study: Multi-Lingual ASR Team

Scenario:

  • Team of 20 researchers
  • 10+ languages (en, zh, es, fr, de, ja, ko, ar, hi, ru)
  • 100K+ hours of audio data
  • 1000+ experiments over 2 years

Architecture:

  • Metadata: PostgreSQL with indexes on language, domain, test_set
  • Metrics: InfluxDB for fast time-series queries
  • Artifacts: S3 with lifecycle policies (checkpoints >6 months → Glacier)
  • Audio data: Centralized S3 bucket, organized by language/domain
  • API: Kubernetes cluster with auto-scaling

Key optimizations:

  • Audio stored by reference (saved ~10 TB of redundant uploads)
  • Checkpoint deduplication (saved ~30% storage)
  • Distributed evaluation (Ray cluster, 10x speedup on multi-test-set evaluation)

Outcomes:

  • 99.9% uptime
  • Median query latency: 150ms
  • Complete audit trail for model releases
  • Reproducibility: 95% of experiments re-runnable from metadata

Cost Analysis

Example: Medium-Sized Speech Team

Assumptions:

  • 10 researchers
  • 50 experiments/month, 500 runs/month
  • Average run: 5 GB model, 10K metrics, 10 test sets
  • Audio data: 50K hours, stored centrally (not per-experiment)
  • Retention: 2 years
Component Cost/Month
Metadata DB (PostgreSQL RDS) $300
Metrics DB (InfluxDB) $400
Model storage (S3, 5 TB) $115
Audio data storage (S3, 50 TB, reference-only) $1,150
API compute (Kubernetes) $600
Data transfer $150
Total $2,715

Cost savings from best practices:

  • Audio by reference (vs copying per-experiment): -$50K/year
  • Checkpoint deduplication: -$500/month
  • Tiered storage (Glacier for old artifacts): -$200/month

Advanced Topics

1. Integration with ESPnet

ESPnet is a popular speech toolkit. Integrate experiment tracking:

# In ESPnet training script
import speech_experiment_tracker as set

# Initialize tracker
client = set.Client(...)
run = client.start_run(...)

# Log ESPnet config
run.log_config(vars(args))  # args from argparse

# Hook into ESPnet trainer
class TrackerCallback:
    def on_epoch_end(self, epoch, metrics):
        run.log_metrics(metrics, step=epoch)
    
    def on_checkpoint_save(self, epoch, checkpoint_path):
        run.log_artifact(checkpoint_path, type="checkpoint", epoch=epoch)

trainer.add_callback(TrackerCallback())

2. Attention Map Visualization

For Transformer-based ASR, log and visualize attention maps:

import matplotlib.pyplot as plt

def plot_attention(attention_weights, hyp_tokens, ref_tokens):
    """Plot attention matrix."""
    fig, ax = plt.subplots(figsize=(10, 10))
    ax.imshow(attention_weights, cmap='Blues')
    ax.set_xticks(range(len(ref_tokens)))
    ax.set_xticklabels(ref_tokens, rotation=90)
    ax.set_yticks(range(len(hyp_tokens)))
    ax.set_yticklabels(hyp_tokens)
    ax.set_xlabel("Reference Tokens")
    ax.set_ylabel("Hypothesis Tokens")
    return fig

# During evaluation
attention_weights = model.get_attention_weights(audio)
fig = plot_attention(attention_weights, hyp_tokens, ref_tokens)
run.log_figure("attention_plot.png", fig)

3. Speaker-Level Analysis

Track per-speaker WER to identify performance gaps:

def speaker_level_analysis(model, test_set, run):
    """Compute and log per-speaker WER."""
    speaker_stats = {}
    
    for batch in test_loader:
        hyps = model.decode(batch['audio'])
        refs = batch['text']
        speakers = batch['speaker_ids']
        
        for hyp, ref, speaker in zip(hyps, refs, speakers):
            errors, words = compute_wer_details(hyp, ref)
            
            if speaker not in speaker_stats:
                speaker_stats[speaker] = {"errors": 0, "words": 0}
            
            speaker_stats[speaker]["errors"] += errors
            speaker_stats[speaker]["words"] += words
    
    # Compute per-speaker WER
    for speaker, stats in speaker_stats.items():
        wer = stats["errors"] / max(1, stats["words"])
        run.log_metric(f"speaker_{speaker}_wer", wer)
    
    # Log summary statistics
    wers = [stats["errors"] / max(1, stats["words"]) for stats in speaker_stats.values()]
    run.log_metrics({
        "avg_speaker_wer": np.mean(wers),
        "median_speaker_wer": np.median(wers),
        "worst_speaker_wer": np.max(wers),
        "best_speaker_wer": np.min(wers)
    })

Practical Operations Checklist

For Speech Researchers

  • Always log data config: Dataset, hours, languages, speaker count.
  • Track augmentation policies: SpecAugment, noise, speed perturbation.
  • Evaluate on multiple test sets: Clean, noisy, accented, domain-specific.
  • Log decoded outputs: For error analysis and debugging.
  • Track decoding hyperparameters: Beam width, LM weight.
  • Use descriptive run names: conformer_12l_960h_specaug > run_42.

For Platform Engineers

  • Monitor storage growth: Audio and models can grow quickly.
  • Set up tiered storage: Move old checkpoints to Glacier.
  • Implement checkpoint cleanup: Delete intermediate checkpoints after final model is saved.
  • Monitor evaluation queue: Distributed eval should not be a bottleneck.
  • Test disaster recovery: Can you restore experiments from backups?

Key Takeaways

Speech experiment management requires domain-specific extensions (audio data, WER/CER, multi-test-set evaluation).

Store audio by reference, not by copying—saves massive storage costs.

Track data versions (hashes, preprocessing configs) for reproducibility.

Multi-dimensional evaluation (language, domain, noise condition) is critical for speech.

Checkpoint deduplication and tiered storage reduce costs significantly.

Systematic iteration through experiment spaces (model × data × hyperparameters) mirrors structured traversal patterns like Spiral Matrix.

Integration with speech toolkits (ESPnet, Kaldi, SpeechBrain) is key for adoption.

All three Day 19 topics converge on systematic, stateful exploration:

DSA (Spiral Matrix):

  • Layer-by-layer traversal with boundary tracking
  • Explicit state management (top, bottom, left, right)
  • Resume/pause friendly

ML System Design (Experiment Tracking Systems):

  • Systematic exploration of hyperparameter/architecture spaces
  • Track state of experiments (running, completed, failed)
  • Resume from checkpoints, recover from failures

Speech Tech (Speech Experiment Management):

  • Organize speech experiments across model × data × hyperparameters × decoding dimensions
  • Track training state (checkpoints, metrics, evaluation results)
  • Enable reproducibility and multi-test-set comparison

The unifying pattern: structured, stateful iteration through complex, multi-dimensional spaces with clear persistence and recoverability.


Originally published at: arunbaby.com/speech-tech/0019-speech-experiment-management

If you found this helpful, consider sharing it with others who might benefit.