Experiment Tracking Systems
Design robust experiment tracking systems that enable systematic exploration, reproducibility, and collaboration across large ML teams.
Problem Statement
Design an Experiment Tracking System for ML teams that:
- Tracks all experiment metadata: hyperparameters, metrics, code versions, data versions, artifacts
- Supports large scale: Thousands of experiments, millions of runs, petabyte-scale model artifacts
- Enables comparison and visualization: Compare runs, plot learning curves, analyze hyperparameter impact
- Ensures reproducibility: Any experiment can be re-run from tracked metadata
- Integrates with training pipelines: Minimal code changes, automatic logging
- Supports collaboration: Share experiments, notebook integration, API access
Functional Requirements
- Experiment lifecycle management:
- Create experiments and runs
- Log parameters, metrics, tags, notes
- Upload artifacts (models, plots, datasets)
- Track code versions (Git commit, diff)
- Track data versions (dataset hashes, splits)
- Link parent/child runs (hyperparameter sweeps, ensemble members)
- Query and search:
- Filter by parameters, metrics, tags
- Full-text search over notes and descriptions
- Query by date, user, project
- Visualization and comparison:
- Learning curves (metric vs step/epoch)
- Hyperparameter sweeps (parallel coordinates, scatter)
- Compare multiple runs side-by-side
- Export to notebooks (Jupyter, Colab)
- Artifact management:
- Store and version models, checkpoints, plots
- Efficient storage for large artifacts (deduplication, compression)
- Support for streaming logs (real-time metrics)
- Reproducibility:
- Capture full environment (packages, hardware, Docker image)
- Re-run experiments from tracked metadata
- Audit trail for compliance
- Integration:
- Python SDK (PyTorch, TensorFlow, JAX)
- CLI for automation
- REST API for custom clients
- Webhook/notification support
Non-Functional Requirements
- Scalability: Support 10K+ concurrent experiments, 1M+ total runs
- Performance: Log metrics with <10ms latency, query results in <1s
- Reliability: 99.9% uptime, no data loss
- Security: Role-based access control, encryption at rest and in transit
- Cost efficiency: Optimize storage costs for artifacts (tiered storage, compression)
Understanding the Requirements
Why Experiment Tracking Matters
Without systematic tracking, ML teams face:
- Lost experiments: “Which hyperparameters gave us 92% accuracy last month?”
- Wasted compute: Re-running experiments accidentally
- Non-reproducibility: “It worked on my laptop, but we can’t reproduce it”
- Collaboration friction: Hard to share and compare results
A good experiment tracking system is the foundation of MLOps—it enables:
- Systematic exploration of model/data/hyperparameter spaces
- Clear audit trails for model governance
- Faster iteration through better visibility
The Systematic Iteration Connection
Just like Spiral Matrix systematically traverses a 2D structure layer-by-layer:
- Experiment tracking systematically explores multi-dimensional spaces:
- Hyperparameters × architectures × data configurations × training schedules
- Both require clear state management:
- Spiral: track boundaries (top, bottom, left, right)
- Experiments: track runs (completed, running, failed), checkpoints, metrics
- Both enable resumability:
- Spiral: can pause and resume traversal
- Experiments: can restart from checkpoints, resume sweeps
High-Level Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Experiment Tracking System │
└─────────────────────────────────────────────────────────────────┘
Client Layer
┌────────────────────────────────────────────┐
│ Python SDK │ CLI │ Web UI │ API │
└─────────────────────┬──────────────────────┘
│
API Gateway
┌──────────────┴──────────────┐
│ - Auth & rate limiting │
│ - Request routing │
│ - Logging & monitoring │
└──────────────┬──────────────┘
│
┌────────────────┼────────────────┐
│ │ │
┌───────▼────────┐ ┌────▼─────┐ ┌───────▼────────┐
│ Metadata │ │ Metrics │ │ Artifact │
│ Service │ │ Service │ │ Service │
│ │ │ │ │ │
│ - Experiments │ │ - Logs │ │ - Models │
│ - Runs │ │ - Curves │ │ - Plots │
│ - Parameters │ │ - Scalars│ │ - Datasets │
│ - Tags │ │ - Hists │ │ - Checkpoints │
└───────┬────────┘ └────┬─────┘ └───────┬────────┘
│ │ │
┌───────▼────────┐ ┌────▼─────┐ ┌───────▼────────┐
│ Postgres / │ │ TimeSeries│ │ Object Store │
│ MySQL │ │ DB │ │ (S3/GCS) │
│ │ │ (InfluxDB)│ │ │
│ - Structured │ │ - Metrics │ │ - Large files │
│ metadata │ │ - Fast │ │ - Versioned │
└────────────────┘ │ queries │ │ - Deduped │
└────────────┘ └────────────────┘
Key Components
- Metadata Service:
- Stores experiment and run metadata (params, tags, code versions, user, etc.)
- Relational DB for structured queries
- Indexes on common query patterns (user, project, date, tags)
- Metrics Service:
- High-throughput metric logging (train loss, val accuracy, etc.)
- Time-series database (InfluxDB, Prometheus, or custom)
- Support for streaming metrics (real-time plots)
- Artifact Service:
- Stores large files (models, checkpoints, plots, datasets)
- Object storage (S3, GCS, Azure Blob)
- Deduplication (hash-based), compression, tiered storage
- API Gateway:
- Authentication & authorization (OAuth, API keys)
- Rate limiting (per user/project)
- Request routing and load balancing
- Web UI:
- Dashboards for experiments, runs, metrics
- Comparison tools (side-by-side, parallel coordinates)
- Notebook integration (export to Jupyter)
Component Deep-Dive
1. Metadata Schema
Experiments group related runs (e.g., “ResNet ablation study”).
Runs are individual training jobs with:
- Unique run ID
- Parameters (hyperparameters, model config)
- Metrics (logged scalars, step-indexed)
- Tags (labels for filtering)
- Code version (Git commit, diff)
- Data version (dataset hash, split config)
- Environment (Python packages, Docker image, hardware)
- Artifacts (model files, plots, logs)
- Status (running, completed, failed)
- Timestamps (start, end)
Schema example (simplified):
CREATE TABLE experiments (
experiment_id UUID PRIMARY KEY,
name VARCHAR(255),
description TEXT,
created_at TIMESTAMP,
user_id VARCHAR(255),
project_id UUID
);
CREATE TABLE runs (
run_id UUID PRIMARY KEY,
experiment_id UUID REFERENCES experiments(experiment_id),
name VARCHAR(255),
status VARCHAR(50), -- running, completed, failed
start_time TIMESTAMP,
end_time TIMESTAMP,
user_id VARCHAR(255),
git_commit VARCHAR(40),
git_diff TEXT,
docker_image VARCHAR(255),
environment JSONB, -- packages, hardware
notes TEXT
);
CREATE TABLE run_params (
run_id UUID REFERENCES runs(run_id),
key VARCHAR(255),
value TEXT, -- JSON serialized
PRIMARY KEY (run_id, key)
);
CREATE TABLE run_metrics (
run_id UUID REFERENCES runs(run_id),
key VARCHAR(255), -- e.g., 'train_loss', 'val_accuracy'
step INT,
value FLOAT,
timestamp TIMESTAMP,
PRIMARY KEY (run_id, key, step)
);
CREATE TABLE run_tags (
run_id UUID REFERENCES runs(run_id),
key VARCHAR(255),
value VARCHAR(255),
PRIMARY KEY (run_id, key)
);
CREATE TABLE run_artifacts (
artifact_id UUID PRIMARY KEY,
run_id UUID REFERENCES runs(run_id),
path VARCHAR(1024), -- e.g., 'model.pt', 'plots/loss.png'
size_bytes BIGINT,
content_hash VARCHAR(64), -- SHA-256
storage_uri TEXT, -- S3 URI
created_at TIMESTAMP
);
2. Python SDK (Client Interface)
import experiment_tracker as et
# Initialize client
client = et.Client(api_url="https://tracking.example.com", api_key="...")
# Create experiment
experiment = client.create_experiment(
name="ResNet50 ImageNet Ablation",
description="Testing different optimizers and learning rates"
)
# Start a run
run = experiment.start_run(
name="run_adam_lr0.001",
tags={"optimizer": "adam", "dataset": "imagenet"}
)
# Log parameters
run.log_params({
"model": "resnet50",
"optimizer": "adam",
"learning_rate": 0.001,
"batch_size": 256,
"epochs": 90
})
# Training loop
for epoch in range(90):
train_loss = train_one_epoch(model, optimizer, train_loader)
val_acc = validate(model, val_loader)
# Log metrics
run.log_metrics({
"train_loss": train_loss,
"val_accuracy": val_acc
}, step=epoch)
# Save model
run.log_artifact("model.pt", local_path="./checkpoints/model_epoch90.pt")
# Mark run as complete
run.finish()
3. Metric Logging & Streaming
For real-time metric visualization:
- Clients send metrics via WebSocket or HTTP streaming
- Metrics Service buffers and batches writes to time-series DB
- Web UI subscribes to metric streams for live plots
# Streaming metrics example
def train_with_streaming_metrics(model, run):
for step, batch in enumerate(train_loader):
loss = train_step(model, batch)
# Log every N steps for live tracking
if step % 10 == 0:
run.log_metric("train_loss", loss, step=step)
# Internally: buffered, batched, sent asynchronously
4. Artifact Storage & Deduplication
Challenge: Models can be GBs–TBs. Storing every checkpoint is expensive.
Solution:
- Content-based deduplication:
- Hash each artifact (SHA-256).
- If hash exists, create metadata entry but don’t re-upload.
- Tiered storage:
- Hot: Recent artifacts on fast storage (SSD, S3 standard).
- Cold: Old artifacts on cheaper storage (S3 Glacier, tape).
- Compression:
- Compress models before upload (gzip, zstd).
import hashlib
import gzip
def upload_artifact(run_id: str, path: str, local_file: str):
# Compute hash
with open(local_file, 'rb') as f:
content = f.read()
content_hash = hashlib.sha256(content).hexdigest()
# Check if artifact with this hash exists
existing = artifact_service.get_by_hash(content_hash)
if existing:
# Create metadata entry pointing to existing storage
artifact_service.link_artifact(run_id, path, existing.storage_uri, content_hash)
return
# Compress and upload
compressed = gzip.compress(content)
storage_uri = object_store.upload(f"{run_id}/{path}.gz", compressed)
# Create metadata entry
artifact_service.create_artifact(
run_id=run_id,
path=path,
size_bytes=len(content),
content_hash=content_hash,
storage_uri=storage_uri
)
5. Query & Search
Common queries:
- “Show all runs with learning_rate > 0.01 and val_accuracy > 0.9”
- “Find best run in experiment X by val_accuracy”
- “Show runs created in last 7 days by user Y”
Implementation:
# Query API example
runs = client.search_runs(
experiment_ids=["exp123"],
filter_string="params.learning_rate > 0.01 AND metrics.val_accuracy > 0.9",
order_by=["metrics.val_accuracy DESC"],
max_results=10
)
for run in runs:
print(f"Run {run.run_id}: LR={run.params['learning_rate']}, Acc={run.metrics['val_accuracy']}")
Optimization:
- Index on common filter fields (user_id, experiment_id, tags, status, timestamps)
- Cache popular queries (top runs, recent runs)
- Use read replicas for heavy read workloads
Scaling Strategies
1. Sharding Experiments/Runs
For very large deployments:
- Shard metadata DB by experiment_id or user_id
- Each shard handles a subset of experiments
- API Gateway routes requests to correct shard
2. Metric Buffering & Batching
High-throughput training jobs can log metrics at high frequency (100s–1000s/sec):
- Client buffers metrics locally
- Batches and sends every N seconds or M metrics
- Server-side ingestion queue (Kafka, SQS) for further buffering
3. Artifact Caching
Frequently accessed artifacts (latest models, popular checkpoints):
- Cache in CDN (CloudFront, Fastly)
- Local cache on training nodes (NVMe SSD)
- Lazy loading: download only when accessed
4. Distributed Artifact Storage
For petabyte-scale artifact storage:
- Use distributed object stores (S3, GCS, Ceph)
- Implement multipart upload for large files
- Use pre-signed URLs for direct client-to-storage uploads (bypass API server)
Monitoring & Observability
Key Metrics
System metrics:
- Request latency (p50, p95, p99)
- Throughput (requests/sec, metrics logged/sec, artifacts uploaded/sec)
- Error rates (4xx, 5xx)
- Storage usage (DB size, object store size)
User metrics:
- Active experiments/runs
- Average metrics logged per run
- Average artifact size
- Query response times
Dashboards:
- Real-time experiment dashboard (running/completed/failed runs)
- System health dashboard (latency, error rates, resource usage)
- Cost dashboard (storage costs, compute costs)
Failure Modes & Mitigations
| Failure Mode | Impact | Mitigation |
|---|---|---|
| Metadata DB down | Can’t create/query experiments | Read replicas, automatic failover, local caching |
| Object store unavailable | Can’t upload/download artifacts | Retry with exponential backoff, fallback to local storage |
| Metric ingestion backlog | Delayed metric visibility | Buffering, rate limiting, auto-scaling ingest workers |
| Lost run metadata | Experiment not reproducible | Periodic backups, transaction logs, write-ahead logs |
| Concurrent write conflicts | Metrics/artifacts overwritten | Optimistic locking, append-only logs |
| API rate limit hit | Client blocked | Exponential backoff, client-side buffering, increase limits |
Real-World Case Study: Large-Scale ML Team
Scenario:
- 100+ ML engineers and researchers
- 50K+ experiments, 1M+ runs
- 10 PB of artifacts (models, datasets, checkpoints)
- Multi-cloud (AWS, GCP)
Architecture:
- Metadata: PostgreSQL with read replicas, sharded by experiment_id
- Metrics: InfluxDB cluster, 100K metrics/sec write throughput
- Artifacts: S3 + GCS with cross-region replication
- API: Kubernetes cluster with auto-scaling (10–100 pods)
- Web UI: React SPA, served via CDN
Key optimizations:
- Pre-signed URLs for large artifact uploads (direct to S3/GCS)
- Client-side metric buffering (log every 10 steps, batch send)
- Artifact deduplication (saved ~30% storage cost)
- Tiered storage (hot: S3 Standard, cold: S3 Glacier, ~50% cost reduction)
Outcomes:
- 99.95% uptime
- Median query latency: 120ms
- p99 metric log latency: 8ms
- $200K/year savings from deduplication and tiered storage
Cost Analysis
Example: Medium-Sized Team
Assumptions:
- 10 researchers
- 100 experiments/month, 1000 runs/month
- Average run: 10 GB artifacts, 10K metrics
- Retention: 2 years
| Component | Cost/Month |
|---|---|
| Metadata DB (PostgreSQL RDS, db.r5.large) | $300 |
| Metrics DB (InfluxDB Cloud) | $500 |
| Object storage (S3, 10 TB) | $230 |
| API compute (Kubernetes, 5 nodes) | $750 |
| Data transfer | $100 |
| Total | $1,880 |
Optimization levers:
- Deduplication: -20–30% storage cost
- Tiered storage: -30–50% storage cost (move old artifacts to Glacier)
- Reserved instances: -30% compute cost
- Compression: -50% storage and transfer cost
Advanced Topics
1. Hyperparameter Sweep Integration
Integrate with hyperparameter tuning libraries (Optuna, Ray Tune):
import optuna
import experiment_tracker as et
def objective(trial):
run = experiment.start_run(name=f"trial_{trial.number}")
# Suggest hyperparameters
lr = trial.suggest_loguniform("learning_rate", 1e-5, 1e-1)
batch_size = trial.suggest_categorical("batch_size", [32, 64, 128])
run.log_params({"learning_rate": lr, "batch_size": batch_size})
# Train and log metrics
val_acc = train_and_evaluate(lr, batch_size, run)
run.finish()
return val_acc
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
2. Model Registry Integration
Link experiment tracking with model registry:
- Best run → promoted to staging → production
- Track lineage: model → run → experiment → dataset
3. Data Versioning
Track data versions alongside experiments:
- Dataset hash (content-based)
- Data pipeline version (Git commit)
- Train/val/test split configs
run.log_dataset(
name="imagenet_v2",
hash="sha256:abc123...",
split_config={"train": 0.8, "val": 0.1, "test": 0.1}
)
4. Compliance & Audit Trails
For regulated industries (healthcare, finance):
- Immutable experiment logs
- Audit trail for all changes (who, what, when)
- Data lineage tracking
- Access control and encryption
Practical Debugging & Operations Checklist
For Platform Engineers
- Monitor ingestion lag: Metrics should appear in UI within seconds of logging.
- Set up alerts: DB disk space >80%, API error rate >1%, artifact upload failures.
- Test disaster recovery: Can you restore from backups? Time to recover?
- Load test: Can the system handle 10x current load?
For ML Engineers
- Always log hyperparameters: Even “fixed” ones—you’ll want to compare later.
- Use tags liberally: Makes filtering/searching much easier.
- Log environment: Git commit, Docker image, package versions—critical for reproducibility.
- Log artifacts incrementally: Don’t wait until end of training to upload checkpoints.
- Use run names: Descriptive names make comparison easier (
resnet50_adam_lr0.001vsrun_42).
Key Takeaways
✅ Experiment tracking is foundational for MLOps—enables reproducibility, collaboration, and systematic exploration.
✅ Scale requires separation of concerns: metadata, metrics, artifacts each have different storage/query needs.
✅ Deduplication and tiered storage are critical for cost efficiency at scale.
✅ Client-side buffering avoids overwhelming the backend with high-frequency metric logging.
✅ Systematic iteration through experiment spaces mirrors structured traversal patterns (like Spiral Matrix).
✅ Integration with existing tools (Git, Docker, hyperparameter tuning) is key for adoption.
✅ Observability and cost monitoring are as important as core functionality.
Connection to Thematic Link: Systematic Iteration and State Tracking
All three Day 19 topics converge on systematic, stateful exploration:
DSA (Spiral Matrix):
- Layer-by-layer traversal with boundary tracking
- Explicit state management (top, bottom, left, right)
- Resume/pause friendly
ML System Design (Experiment Tracking Systems):
- Systematic exploration of hyperparameter/architecture spaces
- Track state of experiments (running, completed, failed)
- Resume from checkpoints, recover from failures
Speech Tech (Speech Experiment Management):
- Organize speech model experiments across multiple dimensions
- Track model versions, data versions, training configs
- Enable reproducibility and comparison
The unifying pattern: structured iteration through complex spaces, with clear state persistence and recoverability.
Originally published at: arunbaby.com/ml-system-design/0019-experiment-tracking-systems
If you found this helpful, consider sharing it with others who might benefit.