Experiment Tracking Systems

Q: Why do ML teams need a dedicated experiment tracking system?

Without tracking, teams lose experiment results, waste compute re-running experiments, cannot reproduce results, and struggle to compare approaches. A tracking system provides systematic exploration, audit trails, and collaboration across large teams.

Q: What metadata should every ML experiment run capture?

At minimum: hyperparameters, training metrics per step, Git commit hash, Python package versions, Docker image, dataset hash, train/val/test split config, and hardware details. This ensures any experiment can be reproduced.

Q: How do you handle high-frequency metric logging without overwhelming the backend?

Buffer metrics client-side and batch-send every N seconds or M metrics. The server ingests via a queue (Kafka/SQS) for further buffering, and writes to a time-series database optimized for append-heavy workloads.

16 minute read

Design robust experiment tracking systems that enable systematic exploration, reproducibility, and collaboration across large ML teams.

TL;DR

Experiment tracking separates metadata (PostgreSQL), time-series metrics (InfluxDB), and large artifacts (S3) into purpose-built services because each has different storage, query, and scale requirements. Content-based deduplication and tiered storage slash artifact costs by 50%. Client-side metric buffering prevents high-frequency logging from overwhelming the backend. The system captures everything needed for reproducibility – hyperparameters, code versions, data versions, environment – and is the foundation that makes A/B testing and model evaluation systematic rather than ad hoc.

A laboratory wall covered in dozens of small clipboards each holding a different chart or graph

Problem Statement

Design an Experiment Tracking System for ML teams that:

Tracks all experiment metadata: hyperparameters, metrics, code versions, data versions, artifacts
Supports large scale: Thousands of experiments, millions of runs, petabyte-scale model artifacts
Enables comparison and visualization: Compare runs, plot learning curves, analyze hyperparameter impact
Ensures reproducibility: Any experiment can be re-run from tracked metadata
Integrates with training pipelines: Minimal code changes, automatic logging
Supports collaboration: Share experiments, notebook integration, API access

Functional Requirements

Experiment lifecycle management:
- Create experiments and runs
- Log parameters, metrics, tags, notes
- Upload artifacts (models, plots, datasets)
- Track code versions (Git commit, diff)
- Track data versions (dataset hashes, splits)
- Link parent/child runs (hyperparameter sweeps, ensemble members)
Query and search:
- Filter by parameters, metrics, tags
- Full-text search over notes and descriptions
- Query by date, user, project
Visualization and comparison:
- Learning curves (metric vs step/epoch)
- Hyperparameter sweeps (parallel coordinates, scatter)
- Compare multiple runs side-by-side
- Export to notebooks (Jupyter, Colab)
Artifact management:
- Store and version models, checkpoints, plots
- Efficient storage for large artifacts (deduplication, compression)
- Support for streaming logs (real-time metrics)
Reproducibility:
- Capture full environment (packages, hardware, Docker image)
- Re-run experiments from tracked metadata
- Audit trail for compliance
Integration:
- Python SDK (PyTorch, TensorFlow, JAX)
- CLI for automation
- REST API for custom clients
- Webhook/notification support

Non-Functional Requirements

Scalability: Support 10K+ concurrent experiments, 1M+ total runs
Performance: Log metrics with <10ms latency, query results in <1s
Reliability: 99.9% uptime, no data loss
Security: Role-based access control, encryption at rest and in transit
Cost efficiency: Optimize storage costs for artifacts (tiered storage, compression)

Understanding the Requirements

Why Experiment Tracking Matters

Without systematic tracking, ML teams face:

Lost experiments: “Which hyperparameters gave us 92% accuracy last month?”
Wasted compute: Re-running experiments accidentally
Non-reproducibility: “It worked on my laptop, but we can’t reproduce it”
Collaboration friction: Hard to share and compare results

A good experiment tracking system is the foundation of MLOps, it enables:

Systematic exploration of model/data/hyperparameter spaces
Clear audit trails for model governance
Faster iteration through better visibility

The Systematic Iteration Connection

Just like Spiral Matrix systematically traverses a 2D structure layer-by-layer:

Experiment tracking systematically explores multi-dimensional spaces:
Hyperparameters × architectures × data configurations × training schedules
Both require clear state management:
Spiral: track boundaries (top, bottom, left, right)
Experiments: track runs (completed, running, failed), checkpoints, metrics
Both enable resumability:
Spiral: can pause and resume traversal
Experiments: can restart from checkpoints, resume sweeps

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│ Experiment Tracking System │
└─────────────────────────────────────────────────────────────────┘

 Client Layer
 ┌────────────────────────────────────────────┐
 │ Python SDK │ CLI │ Web UI │ API │
 └─────────────────────┬──────────────────────┘
 │
 API Gateway
 ┌──────────────┴──────────────┐
 │ - Auth & rate limiting │
 │ - Request routing │
 │ - Logging & monitoring │
 └──────────────┬──────────────┘
 │
 ┌────────────────┼────────────────┐
 │ │ │
 ┌───────▼────────┐ ┌────▼─────┐ ┌───────▼────────┐
 │ Metadata │ │ Metrics │ │ Artifact │
 │ Service │ │ Service │ │ Service │
 │ │ │ │ │ │
 │ - Experiments │ │ - Logs │ │ - Models │
 │ - Runs │ │ - Curves │ │ - Plots │
 │ - Parameters │ │ - Scalars│ │ - Datasets │
 │ - Tags │ │ - Hists │ │ - Checkpoints │
 └───────┬────────┘ └────┬─────┘ └───────┬────────┘
 │ │ │
 ┌───────▼────────┐ ┌────▼─────┐ ┌───────▼────────┐
 │ Postgres / │ │ TimeSeries│ │ Object Store │
 │ MySQL │ │ DB │ │ (S3/GCS) │
 │ │ │ (InfluxDB)│ │ │
 │ - Structured │ │ - Metrics │ │ - Large files │
 │ metadata │ │ - Fast │ │ - Versioned │
 └────────────────┘ │ queries │ │ - Deduped │
 └────────────┘ └────────────────┘

Key Components

Metadata Service:
- Stores experiment and run metadata (params, tags, code versions, user, etc.)
- Relational DB for structured queries
- Indexes on common query patterns (user, project, date, tags)
Metrics Service:
- High-throughput metric logging (train loss, val accuracy, etc.)
- Time-series database (InfluxDB, Prometheus, or custom)
- Support for streaming metrics (real-time plots)
Artifact Service:
- Stores large files (models, checkpoints, plots, datasets)
- Object storage (S3, GCS, Azure Blob)
- Deduplication (hash-based), compression, tiered storage
API Gateway:
- Authentication & authorization (OAuth, API keys)
- Rate limiting (per user/project)
- Request routing and load balancing
Web UI:
- Dashboards for experiments, runs, metrics
- Comparison tools (side-by-side, parallel coordinates)
- Notebook integration (export to Jupyter)

Component Deep-Dive

1. Metadata Schema

Experiments group related runs (e.g., “ResNet ablation study”).

Runs are individual training jobs with:

Unique run ID
Parameters (hyperparameters, model config)
Metrics (logged scalars, step-indexed)
Tags (labels for filtering)
Code version (Git commit, diff)
Data version (dataset hash, split config)
Environment (Python packages, Docker image, hardware)
Artifacts (model files, plots, logs)
Status (running, completed, failed)
Timestamps (start, end)

Schema example (simplified):

CREATE TABLE experiments (
 experiment_id UUID PRIMARY KEY,
 name VARCHAR(255),
 description TEXT,
 created_at TIMESTAMP,
 user_id VARCHAR(255),
 project_id UUID
);

CREATE TABLE runs (
 run_id UUID PRIMARY KEY,
 experiment_id UUID REFERENCES experiments(experiment_id),
 name VARCHAR(255),
 status VARCHAR(50), -- running, completed, failed
 start_time TIMESTAMP,
 end_time TIMESTAMP,
 user_id VARCHAR(255),
 git_commit VARCHAR(40),
 git_diff TEXT,
 docker_image VARCHAR(255),
 environment JSONB, -- packages, hardware
 notes TEXT
);

CREATE TABLE run_params (
 run_id UUID REFERENCES runs(run_id),
 key VARCHAR(255),
 value TEXT, -- JSON serialized
 PRIMARY KEY (run_id, key)
);

CREATE TABLE run_metrics (
 run_id UUID REFERENCES runs(run_id),
 key VARCHAR(255), -- e.g., 'train_loss', 'val_accuracy'
 step INT,
 value FLOAT,
 timestamp TIMESTAMP,
 PRIMARY KEY (run_id, key, step)
);

CREATE TABLE run_tags (
 run_id UUID REFERENCES runs(run_id),
 key VARCHAR(255),
 value VARCHAR(255),
 PRIMARY KEY (run_id, key)
);

CREATE TABLE run_artifacts (
 artifact_id UUID PRIMARY KEY,
 run_id UUID REFERENCES runs(run_id),
 path VARCHAR(1024), -- e.g., 'model.pt', 'plots/loss.png'
 size_bytes BIGINT,
 content_hash VARCHAR(64), -- SHA-256
 storage_uri TEXT, -- S3 URI
 created_at TIMESTAMP
);

2. Python SDK (Client Interface)

import experiment_tracker as et

# Initialize client
client = et.Client(api_url="https://tracking.example.com", api_key="...")

# Create experiment
experiment = client.create_experiment(
name="ResNet50 ImageNet Ablation",
description="Testing different optimizers and learning rates"
)

# Start a run
run = experiment.start_run(
name="run_adam_lr0.001",
tags={"optimizer": "adam", "dataset": "imagenet"}
)

# Log parameters
run.log_params({
"model": "resnet50",
"optimizer": "adam",
"learning_rate": 0.001,
"batch_size": 256,
"epochs": 90
})

# Training loop
for epoch in range(90):
    train_loss = train_one_epoch(model, optimizer, train_loader)
    val_acc = validate(model, val_loader)

    # Log metrics
    run.log_metrics({
    "train_loss": train_loss,
    "val_accuracy": val_acc
    }, step=epoch)

    # Save model
    run.log_artifact("model.pt", local_path="./checkpoints/model_epoch90.pt")

    # Mark run as complete
    run.finish()

3. Metric Logging & Streaming

For real-time metric visualization:

Clients send metrics via WebSocket or HTTP streaming
Metrics Service buffers and batches writes to time-series DB
Web UI subscribes to metric streams for live plots

# Streaming metrics example
def train_with_streaming_metrics(model, run):
    for step, batch in enumerate(train_loader):
        loss = train_step(model, batch)

        # Log every N steps for live tracking
        if step % 10 == 0:
            run.log_metric("train_loss", loss, step=step)
            # Internally: buffered, batched, sent asynchronously

4. Artifact Storage & Deduplication

Challenge: Models can be GBs–TBs. Storing every checkpoint is expensive.

Solution:

Content-based deduplication:
Hash each artifact (SHA-256).
If hash exists, create metadata entry but don’t re-upload.
Tiered storage:
Hot: Recent artifacts on fast storage (SSD, S3 standard).
Cold: Old artifacts on cheaper storage (S3 Glacier, tape).
Compression:
Compress models before upload (gzip, zstd).

import hashlib
import gzip

def upload_artifact(run_id: str, path: str, local_file: str):
    # Compute hash
    with open(local_file, 'rb') as f:
        content = f.read()
        content_hash = hashlib.sha256(content).hexdigest()

        # Check if artifact with this hash exists
        existing = artifact_service.get_by_hash(content_hash)
        if existing:
            # Create metadata entry pointing to existing storage
            artifact_service.link_artifact(run_id, path, existing.storage_uri, content_hash)
            return

            # Compress and upload
            compressed = gzip.compress(content)
            storage_uri = object_store.upload(f"{run_id}/{path}.gz", compressed)

            # Create metadata entry
            artifact_service.create_artifact(
            run_id=run_id,
            path=path,
            size_bytes=len(content),
            content_hash=content_hash,
            storage_uri=storage_uri
            )

5. Query & Search

Common queries:

“Show all runs with learning_rate > 0.01 and val_accuracy > 0.9”
“Find best run in experiment X by val_accuracy”
“Show runs created in last 7 days by user Y”

Implementation:

# Query API example
runs = client.search_runs(
experiment_ids=["exp123"],
filter_string="params.learning_rate > 0.01 AND metrics.val_accuracy > 0.9",
order_by=["metrics.val_accuracy DESC"],
max_results=10
)

for run in runs:
    print(f"Run {run.run_id}: LR={run.params['learning_rate']}, Acc={run.metrics['val_accuracy']}")

Optimization:

Index on common filter fields (user_id, experiment_id, tags, status, timestamps)
Cache popular queries (top runs, recent runs)
Use read replicas for heavy read workloads

Scaling Strategies

1. Sharding Experiments/Runs

For very large deployments:

Shard metadata DB by experiment_id or user_id
Each shard handles a subset of experiments
API Gateway routes requests to correct shard

2. Metric Buffering & Batching

High-throughput training jobs can log metrics at high frequency (100s–1000s/sec):

Client buffers metrics locally
Batches and sends every N seconds or M metrics
Server-side ingestion queue (Kafka, SQS) for further buffering

3. Artifact Caching

Frequently accessed artifacts (latest models, popular checkpoints):

Cache in CDN (CloudFront, Fastly)
Local cache on training nodes (NVMe SSD)
Lazy loading: download only when accessed

4. Distributed Artifact Storage

For petabyte-scale artifact storage:

Use distributed object stores (S3, GCS, Ceph)
Implement multipart upload for large files
Use pre-signed URLs for direct client-to-storage uploads (bypass API server)

Monitoring & Observability

Key Metrics

System metrics:

Request latency (p50, p95, p99)
Throughput (requests/sec, metrics logged/sec, artifacts uploaded/sec)
Error rates (4xx, 5xx)
Storage usage (DB size, object store size)

User metrics:

Active experiments/runs
Average metrics logged per run
Average artifact size
Query response times

Dashboards:

Real-time experiment dashboard (running/completed/failed runs)
System health dashboard (latency, error rates, resource usage)
Cost dashboard (storage costs, compute costs)

Failure Modes & Mitigations

Failure Mode	Impact	Mitigation
Metadata DB down	Can’t create/query experiments	Read replicas, automatic failover, local caching
Object store unavailable	Can’t upload/download artifacts	Retry with exponential backoff, fallback to local storage
Metric ingestion backlog	Delayed metric visibility	Buffering, rate limiting, auto-scaling ingest workers
Lost run metadata	Experiment not reproducible	Periodic backups, transaction logs, write-ahead logs
Concurrent write conflicts	Metrics/artifacts overwritten	Optimistic locking, append-only logs
API rate limit hit	Client blocked	Exponential backoff, client-side buffering, increase limits

Real-World Case Study: Large-Scale ML Team

Scenario:

100+ ML engineers and researchers
50K+ experiments, 1M+ runs
10 PB of artifacts (models, datasets, checkpoints)
Multi-cloud (AWS, GCP)

Architecture:

Metadata: PostgreSQL with read replicas, sharded by experiment_id
Metrics: InfluxDB cluster, 100K metrics/sec write throughput
Artifacts: S3 + GCS with cross-region replication
API: Kubernetes cluster with auto-scaling (10–100 pods)
Web UI: React SPA, served via CDN

Key optimizations:

Pre-signed URLs for large artifact uploads (direct to S3/GCS)
Client-side metric buffering (log every 10 steps, batch send)
Artifact deduplication (saved ~30% storage cost)
Tiered storage (hot: S3 Standard, cold: S3 Glacier, ~50% cost reduction)

Outcomes:

99.95% uptime
Median query latency: 120ms
p99 metric log latency: 8ms
$200K/year savings from deduplication and tiered storage

Cost Analysis

Example: Medium-Sized Team

Assumptions:

10 researchers
100 experiments/month, 1000 runs/month
Average run: 10 GB artifacts, 10K metrics
Retention: 2 years

Component	Cost/Month
Metadata DB (PostgreSQL RDS, db.r5.large)	$300
Metrics DB (InfluxDB Cloud)	$500
Object storage (S3, 10 TB)	$230
API compute (Kubernetes, 5 nodes)	$750
Data transfer	$100
Total	$1,880

Optimization levers:

Deduplication: -20–30% storage cost
Tiered storage: -30–50% storage cost (move old artifacts to Glacier)
Reserved instances: -30% compute cost
Compression: -50% storage and transfer cost

Advanced Topics

1. Hyperparameter Sweep Integration

Integrate with hyperparameter tuning libraries (Optuna, Ray Tune):

import optuna
import experiment_tracker as et

def objective(trial):
    run = experiment.start_run(name=f"trial_{trial.number}")

    # Suggest hyperparameters
    lr = trial.suggest_loguniform("learning_rate", 1e-5, 1e-1)
    batch_size = trial.suggest_categorical("batch_size", [32, 64, 128])

    run.log_params({"learning_rate": lr, "batch_size": batch_size})

    # Train and log metrics
    val_acc = train_and_evaluate(lr, batch_size, run)

    run.finish()
    return val_acc

    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=100)

2. Model Registry Integration

Link experiment tracking with model registry:

Best run → promoted to staging → production
Track lineage: model → run → experiment → dataset

3. Data Versioning

Track data versions alongside experiments:

Dataset hash (content-based)
Data pipeline version (Git commit)
Train/val/test split configs

run.log_dataset(
name="imagenet_v2",
hash="sha256:abc123...",
split_config={"train": 0.8, "val": 0.1, "test": 0.1}
)

4. Compliance & Audit Trails

For regulated industries (healthcare, finance):

Immutable experiment logs
Audit trail for all changes (who, what, when)
Data lineage tracking
Access control and encryption

Practical Debugging & Operations Checklist

For Platform Engineers

Monitor ingestion lag: Metrics should appear in UI within seconds of logging.
Set up alerts: DB disk space >80%, API error rate >1%, artifact upload failures.
Test disaster recovery: Can you restore from backups? Time to recover?
Load test: Can the system handle 10x current load?

For ML Engineers

Always log hyperparameters: Even “fixed” ones, you’ll want to compare later.
Use tags liberally: Makes filtering/searching much easier.
Log environment: Git commit, Docker image, package versions, critical for reproducibility.
Log artifacts incrementally: Don’t wait until end of training to upload checkpoints.
Use run names: Descriptive names make comparison easier (resnet50_adam_lr0.001 vs run_42).

Key Takeaways

✅ Experiment tracking is foundational for MLOps, enables reproducibility, collaboration, and systematic exploration.

✅ Scale requires separation of concerns: metadata, metrics, artifacts each have different storage/query needs.

✅ Deduplication and tiered storage are critical for cost efficiency at scale.

✅ Client-side buffering avoids overwhelming the backend with high-frequency metric logging.

✅ Systematic iteration through experiment spaces mirrors structured traversal patterns (like Spiral Matrix).

✅ Integration with existing tools (Git, Docker, hyperparameter tuning) is key for adoption.

✅ Observability and cost monitoring are as important as core functionality.

Connection to Thematic Link: Systematic Iteration and State Tracking

All three topics converge on systematic, stateful exploration:

DSA (Spiral Matrix):

Layer-by-layer traversal with boundary tracking
Explicit state management (top, bottom, left, right)
Resume/pause friendly

ML System Design (Experiment Tracking Systems):

Systematic exploration of hyperparameter/architecture spaces
Track state of experiments (running, completed, failed)
Resume from checkpoints, recover from failures

Speech Tech (Speech Experiment Management):

Organize speech model experiments across multiple dimensions
Track model versions, data versions, training configs
Enable reproducibility and comparison

The unifying pattern: structured iteration through complex spaces, with clear state persistence and recoverability.

FAQ

Why do ML teams need a dedicated experiment tracking system?

Without tracking, teams lose experiment results (“which hyperparameters gave 92% accuracy last month?”), waste compute re-running experiments accidentally, cannot reproduce results across environments, and struggle to compare approaches systematically. A tracking system provides the audit trail, collaboration tools, and systematic exploration capabilities that turn ad-hoc ML development into a reliable engineering practice.

How should you store ML experiment artifacts at scale?

Use object storage (S3/GCS) with content-based deduplication via SHA-256 hashing – if two runs produce identical model files, store only one copy. Apply gzip or zstd compression before upload to reduce transfer and storage costs. Implement tiered storage that automatically moves artifacts older than 30 days to cheaper cold storage like S3 Glacier. This approach typically saves 50% on artifact storage costs.

What metadata should every ML experiment run capture?

At minimum: hyperparameters (even “fixed” ones), training and validation metrics per step, Git commit hash and diff, Python package versions, Docker image tag, dataset hash with train/val/test split config, and hardware details (GPU type, count). This ensures any experiment can be reproduced months later. Add descriptive tags and run names to make filtering and comparison practical.

How do you handle high-frequency metric logging without overwhelming the backend?

Buffer metrics client-side (e.g., in a list) and batch-send every N seconds (typically 5-10) or M metrics (typically 100-1000). The server ingests via a queue (Kafka or SQS) for further buffering before writing to a time-series database optimized for append-heavy workloads like InfluxDB. This transforms thousands of individual writes per second into efficient batch inserts.

Cross-links: A/B Testing Systems | Model Evaluation Metrics | Model Monitoring Systems

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

Experiment Tracking Systems

TL;DR

Problem Statement

Functional Requirements

Non-Functional Requirements

Understanding the Requirements

Why Experiment Tracking Matters

The Systematic Iteration Connection

High-Level Architecture

Key Components

Component Deep-Dive

1. Metadata Schema

2. Python SDK (Client Interface)

3. Metric Logging & Streaming

4. Artifact Storage & Deduplication

5. Query & Search

Scaling Strategies

1. Sharding Experiments/Runs

2. Metric Buffering & Batching

3. Artifact Caching

4. Distributed Artifact Storage

Monitoring & Observability

Key Metrics

Failure Modes & Mitigations

Real-World Case Study: Large-Scale ML Team

Cost Analysis

Example: Medium-Sized Team

Advanced Topics

1. Hyperparameter Sweep Integration

2. Model Registry Integration

3. Data Versioning

4. Compliance & Audit Trails

Practical Debugging & Operations Checklist

For Platform Engineers

For ML Engineers

Key Takeaways

Connection to Thematic Link: Systematic Iteration and State Tracking

FAQ

Related across topics

Share on

TL;DR

Problem Statement

Functional Requirements

Non-Functional Requirements

Understanding the Requirements

Why Experiment Tracking Matters

The Systematic Iteration Connection

High-Level Architecture

Key Components

Component Deep-Dive

1. Metadata Schema

2. Python SDK (Client Interface)

3. Metric Logging & Streaming

4. Artifact Storage & Deduplication

5. Query & Search

Scaling Strategies

1. Sharding Experiments/Runs

2. Metric Buffering & Batching

3. Artifact Caching

4. Distributed Artifact Storage

Monitoring & Observability

Key Metrics

Failure Modes & Mitigations

Real-World Case Study: Large-Scale ML Team

Cost Analysis

Example: Medium-Sized Team

Advanced Topics

1. Hyperparameter Sweep Integration

2. Model Registry Integration

3. Data Versioning

4. Compliance & Audit Trails

Practical Debugging & Operations Checklist

For Platform Engineers

For ML Engineers

Key Takeaways

Connection to Thematic Link: Systematic Iteration and State Tracking

FAQ

Related across topics

Spiral Matrix

Speech Experiment Management

Voice Activity Detection (VAD)

Share on