Experiment Tracking Systems
Design robust experiment tracking systems that enable systematic exploration, reproducibility, and collaboration across large ML teams.
TL;DR
Experiment tracking separates metadata (PostgreSQL), time-series metrics (InfluxDB), and large artifacts (S3) into purpose-built services because each has different storage, query, and scale requirements. Content-based deduplication and tiered storage slash artifact costs by 50%. Client-side metric buffering prevents high-frequency logging from overwhelming the backend. The system captures everything needed for reproducibility – hyperparameters, code versions, data versions, environment – and is the foundation that makes A/B testing and model evaluation systematic rather than ad hoc.
![]()
Problem Statement
Design an Experiment Tracking System for ML teams that:
- Tracks all experiment metadata: hyperparameters, metrics, code versions, data versions, artifacts
- Supports large scale: Thousands of experiments, millions of runs, petabyte-scale model artifacts
- Enables comparison and visualization: Compare runs, plot learning curves, analyze hyperparameter impact
- Ensures reproducibility: Any experiment can be re-run from tracked metadata
- Integrates with training pipelines: Minimal code changes, automatic logging
- Supports collaboration: Share experiments, notebook integration, API access
Functional Requirements
- Experiment lifecycle management:
- Create experiments and runs
- Log parameters, metrics, tags, notes
- Upload artifacts (models, plots, datasets)
- Track code versions (Git commit, diff)
- Track data versions (dataset hashes, splits)
- Link parent/child runs (hyperparameter sweeps, ensemble members)
- Query and search:
- Filter by parameters, metrics, tags
- Full-text search over notes and descriptions
- Query by date, user, project
- Visualization and comparison:
- Learning curves (metric vs step/epoch)
- Hyperparameter sweeps (parallel coordinates, scatter)
- Compare multiple runs side-by-side
- Export to notebooks (Jupyter, Colab)
- Artifact management:
- Store and version models, checkpoints, plots
- Efficient storage for large artifacts (deduplication, compression)
- Support for streaming logs (real-time metrics)
- Reproducibility:
- Capture full environment (packages, hardware, Docker image)
- Re-run experiments from tracked metadata
- Audit trail for compliance
- Integration:
- Python SDK (PyTorch, TensorFlow, JAX)
- CLI for automation
- REST API for custom clients
- Webhook/notification support
Non-Functional Requirements
- Scalability: Support 10K+ concurrent experiments, 1M+ total runs
- Performance: Log metrics with <10ms latency, query results in <1s
- Reliability: 99.9% uptime, no data loss
- Security: Role-based access control, encryption at rest and in transit
- Cost efficiency: Optimize storage costs for artifacts (tiered storage, compression)
Understanding the Requirements
Why Experiment Tracking Matters
Without systematic tracking, ML teams face:
- Lost experiments: “Which hyperparameters gave us 92% accuracy last month?”
- Wasted compute: Re-running experiments accidentally
- Non-reproducibility: “It worked on my laptop, but we can’t reproduce it”
- Collaboration friction: Hard to share and compare results
A good experiment tracking system is the foundation of MLOps, it enables:
- Systematic exploration of model/data/hyperparameter spaces
- Clear audit trails for model governance
- Faster iteration through better visibility
The Systematic Iteration Connection
Just like Spiral Matrix systematically traverses a 2D structure layer-by-layer:
- Experiment tracking systematically explores multi-dimensional spaces:
- Hyperparameters × architectures × data configurations × training schedules
- Both require clear state management:
- Spiral: track boundaries (top, bottom, left, right)
- Experiments: track runs (completed, running, failed), checkpoints, metrics
- Both enable resumability:
- Spiral: can pause and resume traversal
- Experiments: can restart from checkpoints, resume sweeps
High-Level Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Experiment Tracking System │
└─────────────────────────────────────────────────────────────────┘
Client Layer
┌────────────────────────────────────────────┐
│ Python SDK │ CLI │ Web UI │ API │
└─────────────────────┬──────────────────────┘
│
API Gateway
┌──────────────┴──────────────┐
│ - Auth & rate limiting │
│ - Request routing │
│ - Logging & monitoring │
└──────────────┬──────────────┘
│
┌────────────────┼────────────────┐
│ │ │
┌───────▼────────┐ ┌────▼─────┐ ┌───────▼────────┐
│ Metadata │ │ Metrics │ │ Artifact │
│ Service │ │ Service │ │ Service │
│ │ │ │ │ │
│ - Experiments │ │ - Logs │ │ - Models │
│ - Runs │ │ - Curves │ │ - Plots │
│ - Parameters │ │ - Scalars│ │ - Datasets │
│ - Tags │ │ - Hists │ │ - Checkpoints │
└───────┬────────┘ └────┬─────┘ └───────┬────────┘
│ │ │
┌───────▼────────┐ ┌────▼─────┐ ┌───────▼────────┐
│ Postgres / │ │ TimeSeries│ │ Object Store │
│ MySQL │ │ DB │ │ (S3/GCS) │
│ │ │ (InfluxDB)│ │ │
│ - Structured │ │ - Metrics │ │ - Large files │
│ metadata │ │ - Fast │ │ - Versioned │
└────────────────┘ │ queries │ │ - Deduped │
└────────────┘ └────────────────┘
Key Components
- Metadata Service:
- Stores experiment and run metadata (params, tags, code versions, user, etc.)
- Relational DB for structured queries
- Indexes on common query patterns (user, project, date, tags)
- Metrics Service:
- High-throughput metric logging (train loss, val accuracy, etc.)
- Time-series database (InfluxDB, Prometheus, or custom)
- Support for streaming metrics (real-time plots)
- Artifact Service:
- Stores large files (models, checkpoints, plots, datasets)
- Object storage (S3, GCS, Azure Blob)
- Deduplication (hash-based), compression, tiered storage
- API Gateway:
- Authentication & authorization (OAuth, API keys)
- Rate limiting (per user/project)
- Request routing and load balancing
- Web UI:
- Dashboards for experiments, runs, metrics
- Comparison tools (side-by-side, parallel coordinates)
- Notebook integration (export to Jupyter)
Component Deep-Dive
1. Metadata Schema
Experiments group related runs (e.g., “ResNet ablation study”).
Runs are individual training jobs with:
- Unique run ID
- Parameters (hyperparameters, model config)
- Metrics (logged scalars, step-indexed)
- Tags (labels for filtering)
- Code version (Git commit, diff)
- Data version (dataset hash, split config)
- Environment (Python packages, Docker image, hardware)
- Artifacts (model files, plots, logs)
- Status (running, completed, failed)
- Timestamps (start, end)
Schema example (simplified):
CREATE TABLE experiments (
experiment_id UUID PRIMARY KEY,
name VARCHAR(255),
description TEXT,
created_at TIMESTAMP,
user_id VARCHAR(255),
project_id UUID
);
CREATE TABLE runs (
run_id UUID PRIMARY KEY,
experiment_id UUID REFERENCES experiments(experiment_id),
name VARCHAR(255),
status VARCHAR(50), -- running, completed, failed
start_time TIMESTAMP,
end_time TIMESTAMP,
user_id VARCHAR(255),
git_commit VARCHAR(40),
git_diff TEXT,
docker_image VARCHAR(255),
environment JSONB, -- packages, hardware
notes TEXT
);
CREATE TABLE run_params (
run_id UUID REFERENCES runs(run_id),
key VARCHAR(255),
value TEXT, -- JSON serialized
PRIMARY KEY (run_id, key)
);
CREATE TABLE run_metrics (
run_id UUID REFERENCES runs(run_id),
key VARCHAR(255), -- e.g., 'train_loss', 'val_accuracy'
step INT,
value FLOAT,
timestamp TIMESTAMP,
PRIMARY KEY (run_id, key, step)
);
CREATE TABLE run_tags (
run_id UUID REFERENCES runs(run_id),
key VARCHAR(255),
value VARCHAR(255),
PRIMARY KEY (run_id, key)
);
CREATE TABLE run_artifacts (
artifact_id UUID PRIMARY KEY,
run_id UUID REFERENCES runs(run_id),
path VARCHAR(1024), -- e.g., 'model.pt', 'plots/loss.png'
size_bytes BIGINT,
content_hash VARCHAR(64), -- SHA-256
storage_uri TEXT, -- S3 URI
created_at TIMESTAMP
);
2. Python SDK (Client Interface)
import experiment_tracker as et
# Initialize client
client = et.Client(api_url="https://tracking.example.com", api_key="...")
# Create experiment
experiment = client.create_experiment(
name="ResNet50 ImageNet Ablation",
description="Testing different optimizers and learning rates"
)
# Start a run
run = experiment.start_run(
name="run_adam_lr0.001",
tags={"optimizer": "adam", "dataset": "imagenet"}
)
# Log parameters
run.log_params({
"model": "resnet50",
"optimizer": "adam",
"learning_rate": 0.001,
"batch_size": 256,
"epochs": 90
})
# Training loop
for epoch in range(90):
train_loss = train_one_epoch(model, optimizer, train_loader)
val_acc = validate(model, val_loader)
# Log metrics
run.log_metrics({
"train_loss": train_loss,
"val_accuracy": val_acc
}, step=epoch)
# Save model
run.log_artifact("model.pt", local_path="./checkpoints/model_epoch90.pt")
# Mark run as complete
run.finish()
3. Metric Logging & Streaming
For real-time metric visualization:
- Clients send metrics via WebSocket or HTTP streaming
- Metrics Service buffers and batches writes to time-series DB
- Web UI subscribes to metric streams for live plots
# Streaming metrics example
def train_with_streaming_metrics(model, run):
for step, batch in enumerate(train_loader):
loss = train_step(model, batch)
# Log every N steps for live tracking
if step % 10 == 0:
run.log_metric("train_loss", loss, step=step)
# Internally: buffered, batched, sent asynchronously
4. Artifact Storage & Deduplication
Challenge: Models can be GBs–TBs. Storing every checkpoint is expensive.
Solution:
- Content-based deduplication:
- Hash each artifact (SHA-256).
- If hash exists, create metadata entry but don’t re-upload.
- Tiered storage:
- Hot: Recent artifacts on fast storage (SSD, S3 standard).
- Cold: Old artifacts on cheaper storage (S3 Glacier, tape).
- Compression:
- Compress models before upload (gzip, zstd).
import hashlib
import gzip
def upload_artifact(run_id: str, path: str, local_file: str):
# Compute hash
with open(local_file, 'rb') as f:
content = f.read()
content_hash = hashlib.sha256(content).hexdigest()
# Check if artifact with this hash exists
existing = artifact_service.get_by_hash(content_hash)
if existing:
# Create metadata entry pointing to existing storage
artifact_service.link_artifact(run_id, path, existing.storage_uri, content_hash)
return
# Compress and upload
compressed = gzip.compress(content)
storage_uri = object_store.upload(f"{run_id}/{path}.gz", compressed)
# Create metadata entry
artifact_service.create_artifact(
run_id=run_id,
path=path,
size_bytes=len(content),
content_hash=content_hash,
storage_uri=storage_uri
)
5. Query & Search
Common queries:
- “Show all runs with learning_rate > 0.01 and val_accuracy > 0.9”
- “Find best run in experiment X by val_accuracy”
- “Show runs created in last 7 days by user Y”
Implementation:
# Query API example
runs = client.search_runs(
experiment_ids=["exp123"],
filter_string="params.learning_rate > 0.01 AND metrics.val_accuracy > 0.9",
order_by=["metrics.val_accuracy DESC"],
max_results=10
)
for run in runs:
print(f"Run {run.run_id}: LR={run.params['learning_rate']}, Acc={run.metrics['val_accuracy']}")
Optimization:
- Index on common filter fields (user_id, experiment_id, tags, status, timestamps)
- Cache popular queries (top runs, recent runs)
- Use read replicas for heavy read workloads
Scaling Strategies
1. Sharding Experiments/Runs
For very large deployments:
- Shard metadata DB by experiment_id or user_id
- Each shard handles a subset of experiments
- API Gateway routes requests to correct shard
2. Metric Buffering & Batching
High-throughput training jobs can log metrics at high frequency (100s–1000s/sec):
- Client buffers metrics locally
- Batches and sends every N seconds or M metrics
- Server-side ingestion queue (Kafka, SQS) for further buffering
3. Artifact Caching
Frequently accessed artifacts (latest models, popular checkpoints):
- Cache in CDN (CloudFront, Fastly)
- Local cache on training nodes (NVMe SSD)
- Lazy loading: download only when accessed
4. Distributed Artifact Storage
For petabyte-scale artifact storage:
- Use distributed object stores (S3, GCS, Ceph)
- Implement multipart upload for large files
- Use pre-signed URLs for direct client-to-storage uploads (bypass API server)
Monitoring & Observability
Key Metrics
System metrics:
- Request latency (p50, p95, p99)
- Throughput (requests/sec, metrics logged/sec, artifacts uploaded/sec)
- Error rates (4xx, 5xx)
- Storage usage (DB size, object store size)
User metrics:
- Active experiments/runs
- Average metrics logged per run
- Average artifact size
- Query response times
Dashboards:
- Real-time experiment dashboard (running/completed/failed runs)
- System health dashboard (latency, error rates, resource usage)
- Cost dashboard (storage costs, compute costs)
Failure Modes & Mitigations
| Failure Mode | Impact | Mitigation |
|---|---|---|
| Metadata DB down | Can’t create/query experiments | Read replicas, automatic failover, local caching |
| Object store unavailable | Can’t upload/download artifacts | Retry with exponential backoff, fallback to local storage |
| Metric ingestion backlog | Delayed metric visibility | Buffering, rate limiting, auto-scaling ingest workers |
| Lost run metadata | Experiment not reproducible | Periodic backups, transaction logs, write-ahead logs |
| Concurrent write conflicts | Metrics/artifacts overwritten | Optimistic locking, append-only logs |
| API rate limit hit | Client blocked | Exponential backoff, client-side buffering, increase limits |
Real-World Case Study: Large-Scale ML Team
Scenario:
- 100+ ML engineers and researchers
- 50K+ experiments, 1M+ runs
- 10 PB of artifacts (models, datasets, checkpoints)
- Multi-cloud (AWS, GCP)
Architecture:
- Metadata: PostgreSQL with read replicas, sharded by experiment_id
- Metrics: InfluxDB cluster, 100K metrics/sec write throughput
- Artifacts: S3 + GCS with cross-region replication
- API: Kubernetes cluster with auto-scaling (10–100 pods)
- Web UI: React SPA, served via CDN
Key optimizations:
- Pre-signed URLs for large artifact uploads (direct to S3/GCS)
- Client-side metric buffering (log every 10 steps, batch send)
- Artifact deduplication (saved ~30% storage cost)
- Tiered storage (hot: S3 Standard, cold: S3 Glacier, ~50% cost reduction)
Outcomes:
- 99.95% uptime
- Median query latency: 120ms
- p99 metric log latency: 8ms
- $200K/year savings from deduplication and tiered storage
Cost Analysis
Example: Medium-Sized Team
Assumptions:
- 10 researchers
- 100 experiments/month, 1000 runs/month
- Average run: 10 GB artifacts, 10K metrics
- Retention: 2 years
| Component | Cost/Month |
|---|---|
| Metadata DB (PostgreSQL RDS, db.r5.large) | $300 |
| Metrics DB (InfluxDB Cloud) | $500 |
| Object storage (S3, 10 TB) | $230 |
| API compute (Kubernetes, 5 nodes) | $750 |
| Data transfer | $100 |
| Total | $1,880 |
Optimization levers:
- Deduplication: -20–30% storage cost
- Tiered storage: -30–50% storage cost (move old artifacts to Glacier)
- Reserved instances: -30% compute cost
- Compression: -50% storage and transfer cost
Advanced Topics
1. Hyperparameter Sweep Integration
Integrate with hyperparameter tuning libraries (Optuna, Ray Tune):
import optuna
import experiment_tracker as et
def objective(trial):
run = experiment.start_run(name=f"trial_{trial.number}")
# Suggest hyperparameters
lr = trial.suggest_loguniform("learning_rate", 1e-5, 1e-1)
batch_size = trial.suggest_categorical("batch_size", [32, 64, 128])
run.log_params({"learning_rate": lr, "batch_size": batch_size})
# Train and log metrics
val_acc = train_and_evaluate(lr, batch_size, run)
run.finish()
return val_acc
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
2. Model Registry Integration
Link experiment tracking with model registry:
- Best run → promoted to staging → production
- Track lineage: model → run → experiment → dataset
3. Data Versioning
Track data versions alongside experiments:
- Dataset hash (content-based)
- Data pipeline version (Git commit)
- Train/val/test split configs
run.log_dataset(
name="imagenet_v2",
hash="sha256:abc123...",
split_config={"train": 0.8, "val": 0.1, "test": 0.1}
)
4. Compliance & Audit Trails
For regulated industries (healthcare, finance):
- Immutable experiment logs
- Audit trail for all changes (who, what, when)
- Data lineage tracking
- Access control and encryption
Practical Debugging & Operations Checklist
For Platform Engineers
- Monitor ingestion lag: Metrics should appear in UI within seconds of logging.
- Set up alerts: DB disk space >80%, API error rate >1%, artifact upload failures.
- Test disaster recovery: Can you restore from backups? Time to recover?
- Load test: Can the system handle 10x current load?
For ML Engineers
- Always log hyperparameters: Even “fixed” ones, you’ll want to compare later.
- Use tags liberally: Makes filtering/searching much easier.
- Log environment: Git commit, Docker image, package versions, critical for reproducibility.
- Log artifacts incrementally: Don’t wait until end of training to upload checkpoints.
- Use run names: Descriptive names make comparison easier (
resnet50_adam_lr0.001vsrun_42).
Key Takeaways
✅ Experiment tracking is foundational for MLOps, enables reproducibility, collaboration, and systematic exploration.
✅ Scale requires separation of concerns: metadata, metrics, artifacts each have different storage/query needs.
✅ Deduplication and tiered storage are critical for cost efficiency at scale.
✅ Client-side buffering avoids overwhelming the backend with high-frequency metric logging.
✅ Systematic iteration through experiment spaces mirrors structured traversal patterns (like Spiral Matrix).
✅ Integration with existing tools (Git, Docker, hyperparameter tuning) is key for adoption.
✅ Observability and cost monitoring are as important as core functionality.
Connection to Thematic Link: Systematic Iteration and State Tracking
All three topics converge on systematic, stateful exploration:
DSA (Spiral Matrix):
- Layer-by-layer traversal with boundary tracking
- Explicit state management (top, bottom, left, right)
- Resume/pause friendly
ML System Design (Experiment Tracking Systems):
- Systematic exploration of hyperparameter/architecture spaces
- Track state of experiments (running, completed, failed)
- Resume from checkpoints, recover from failures
Speech Tech (Speech Experiment Management):
- Organize speech model experiments across multiple dimensions
- Track model versions, data versions, training configs
- Enable reproducibility and comparison
The unifying pattern: structured iteration through complex spaces, with clear state persistence and recoverability.
FAQ
Why do ML teams need a dedicated experiment tracking system?
Without tracking, teams lose experiment results (“which hyperparameters gave 92% accuracy last month?”), waste compute re-running experiments accidentally, cannot reproduce results across environments, and struggle to compare approaches systematically. A tracking system provides the audit trail, collaboration tools, and systematic exploration capabilities that turn ad-hoc ML development into a reliable engineering practice.
How should you store ML experiment artifacts at scale?
Use object storage (S3/GCS) with content-based deduplication via SHA-256 hashing – if two runs produce identical model files, store only one copy. Apply gzip or zstd compression before upload to reduce transfer and storage costs. Implement tiered storage that automatically moves artifacts older than 30 days to cheaper cold storage like S3 Glacier. This approach typically saves 50% on artifact storage costs.
What metadata should every ML experiment run capture?
At minimum: hyperparameters (even “fixed” ones), training and validation metrics per step, Git commit hash and diff, Python package versions, Docker image tag, dataset hash with train/val/test split config, and hardware details (GPU type, count). This ensures any experiment can be reproduced months later. Add descriptive tags and run names to make filtering and comparison practical.
How do you handle high-frequency metric logging without overwhelming the backend?
Buffer metrics client-side (e.g., in a list) and batch-send every N seconds (typically 5-10) or M metrics (typically 100-1000). The server ingests via a queue (Kafka or SQS) for further buffering before writing to a time-series database optimized for append-heavy workloads like InfluxDB. This transforms thousands of individual writes per second into efficient batch inserts.
Cross-links: A/B Testing Systems | Model Evaluation Metrics | Model Monitoring Systems
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch