Online Learning Systems

Q: When should you use online learning instead of batch retraining?

Use online learning when the data distribution changes frequently (ad click prediction, fraud detection), when you need real-time personalization, or when labeled data arrives continuously. Stick with batch retraining for stable distributions or models that require full-batch statistics.

Q: How do you detect concept drift in a production ML system?

Monitor the model's error rate over a sliding window and compare it to a baseline. If recent error exceeds the baseline by a configurable threshold (typically 10%), trigger drift handling such as increasing the learning rate, resetting the model, or initiating a full retrain.

Q: What is the hybrid approach to online learning?

Train a strong batch model on historical data as the baseline, then apply online updates to fine-tune it with streaming data. This combines the stability of batch training with the freshness of online adaptation.

Q: How do you protect against bad online model updates?

Checkpoint the model periodically, monitor online metrics against the batch baseline, and automatically roll back to the last good checkpoint if performance degrades beyond a threshold. A/B testing between online and batch models validates improvements.

20 minute read

Design online learning systems that adapt models in real-time using greedy updates, the same adaptive decision-making pattern from Jump Game applied to streaming data.

TL;DR

Online learning systems update models incrementally from streaming data using mini-batch SGD for stability, drift detectors to catch distribution shifts, and model versioning with rollback for safety. The hybrid approach – a batch-trained baseline with online fine-tuning – consistently outperforms either strategy alone. Concept drift detection compares recent error rates to a rolling baseline and triggers adaptation when performance degrades. This system builds on event stream processing for data ingestion and feeds into model serving for real-time predictions.

A weather vane spinning in strong wind with motion blur on the pointer against a sharp background of storm clouds

Problem Statement

Design an Online Learning System that continuously adapts ML models to new data without full retraining, supporting:

Incremental updates from streaming data
Real-time adaptation to distribution shifts
Low-latency inference (<10ms) and updates
Concept drift detection and handling
Model versioning with rollback capability
Scale to millions of updates per day

Functional Requirements

Streaming data ingestion:
- Ingest labeled samples from event streams (Kafka, Kinesis)
- Buffer and batch micro-updates
- Handle out-of-order arrivals
Incremental model updates:
- Update model parameters with each new sample (or mini-batch)
- Support various algorithms (SGD, online gradient descent, online random forests)
- Maintain model state across updates
Inference serving:
- Serve predictions with latest model
- Low latency (<10ms p95)
- High throughput (10K+ QPS)
Drift detection:
- Monitor distribution shifts
- Detect concept drift (when model performance degrades)
- Alert and trigger adaptation strategies
Model versioning and rollback:
- Version models by timestamp/update count
- Store checkpoints periodically
- Rollback to previous version if performance degrades
Evaluation and monitoring:
- Track online metrics (accuracy, loss, calibration)
- A/B test online vs batch models
- Compare to baseline (batch-trained model)

Non-Functional Requirements

Latency: p95 inference < 10ms, updates < 100ms
Throughput: 10K+ predictions/sec, 1M+ updates/day
Availability: 99.9% uptime
Consistency: Eventually consistent updates across replicas
Resource efficiency: Minimize memory and compute per update
Freshness: Model reflects data from last N minutes

Understanding the Requirements

When to Use Online Learning

Good use cases:

Fast-changing environments: Ad click prediction, fraud detection
Personalization: User preferences evolve over time
Limited labeled data: Start with small dataset, improve continuously
Concept drift: Distribution shifts (seasonal, trends, user behavior changes)

Not ideal when:

Stable distributions: Offline training is simpler and often better
Complex models: Deep neural networks are hard to update incrementally
Large batch requirements: Models need full-batch statistics (e.g., batch normalization)

The Greedy Adaptation Connection

Just like Jump Game greedily extends reach at each position:

Jump Game	Online Learning	Adaptive Speech
Track max reachable index	Track model performance	Track model quality (WER)
Greedy: extend reach	Greedy: update weights	Greedy: adapt to speaker
Each step updates state	Each sample updates model	Each utterance refines model
Forward-looking	Predict future distribution	Anticipate corrections
Early termination	Early stopping	Fallback triggers

All three use greedy, adaptive strategies to optimize in dynamic environments.

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│ Online Learning System │
└─────────────────────────────────────────────────────────────────┘

 Data Sources
 ┌─────────────────────────────────────────┐
 │ User interactions │ Feedback loops │
 │ Event streams │ Label corrections│
 └──────────────┬──────────────────────────┘
 │
 ┌──────▼──────┐
 │ Kafka │
 │ (Events + │
 │ Labels) │
 └──────┬──────┘
 │
 ┌──────────────┼──────────────┐
 │ │ │
┌───────▼────────┐ ┌──▼────┐ ┌──────▼──────┐
│ Update │ │Feature│ │ Drift │
│ Service │ │Store │ │ Detector │
│ │ │(Redis)│ │ │
│ - Batch updates│ └───────┘ │ - Monitor │
│ - Gradient │ │ metrics │
│ computation │ │ - Alert │
└───────┬────────┘ └──────┬──────┘
 │ │
 └──────────────┬─────────────┘
 │
 ┌──────▼──────┐
 │ Model │
 │ Store │
 │ │
 │ - Current │
 │ - Versions │
 │ - Checkpts │
 └──────┬──────┘
 │
 ┌──────────────┼──────────────┐
 │ │ │
┌───────▼────────┐ ┌──▼────┐ ┌──────▼──────┐
│ Inference │ │Monitor│ │ A/B Test │
│ Service │ │ │ │ Controller │
│ │ │- Loss │ │ │
│ - Predictions │ │- Acc │ │ - Traffic │
│ - Low latency │ │- Drift│ │ split │
└────────────────┘ └───────┘ └─────────────┘

Key Components

Data Ingestion: Stream labeled samples from Kafka
Feature Store: Low-latency feature lookup (Redis)
Update Service: Apply incremental updates to model
Model Store: Versioned model storage with checkpoints
Inference Service: Serve predictions with latest model
Drift Detector: Monitor for distribution/concept shifts
A/B Test Controller: Compare online vs batch models

Component Deep-Dives

1. Incremental Update Algorithms

Online Gradient Descent (for linear models):

import numpy as np
from typing import Dict, Any

class OnlineLinearModel:
    """
    Online linear model with SGD updates.

    Similar to Jump Game:
        - Each update 'extends reach' (improves model)
        - Greedy: always update towards lower error
        - Track 'max reach' (best performance so far)
        """

    def __init__(
    self,
    n_features: int,
    learning_rate: float = 0.01,
    regularization: float = 0.01
    ):
        self.weights = np.zeros(n_features)
        self.bias = 0.0
        self.learning_rate = learning_rate
        self.regularization = regularization

        # Metrics
        self.update_count = 0
        self.cumulative_loss = 0.0

    def predict(self, features: np.ndarray) -> float:
        """Make prediction."""
        return np.dot(self.weights, features) + self.bias

    def update(self, features: np.ndarray, label: float):
        """
        Incremental update with one sample (online SGD).

        Greedy decision: move weights to reduce error on this sample.
        Like Jump Game extending max_reach.
        """
        # Prediction
        pred = self.predict(features)

        # Error
        error = label - pred

        # Gradient descent update
        self.weights += self.learning_rate * error * features
        self.weights -= self.learning_rate * self.regularization * self.weights # L2 reg
        self.bias += self.learning_rate * error

        # Track metrics
        self.update_count += 1
        self.cumulative_loss += error ** 2

    def batch_update(self, features_batch: np.ndarray, labels_batch: np.ndarray):
        """Update with mini-batch (more stable than single samples)."""
        for features, label in zip(features_batch, labels_batch):
            self.update(features, label)

    def get_state(self) -> Dict:
        """Get model state for checkpointing."""
        return {
        "weights": self.weights.tolist(),
        "bias": float(self.bias),
        "update_count": self.update_count,
        "avg_loss": self.cumulative_loss / max(1, self.update_count)
        }

    def set_state(self, state: Dict):
        """Restore model from checkpoint."""
        self.weights = np.array(state["weights"])
        self.bias = state["bias"]
        self.update_count = state["update_count"]

Online Random Forest (Mondrian Forest):

class OnlineRandomForest:
    """
    Online random forest using Mondrian trees.

    Supports incremental updates without full retraining.
    """

    def __init__(self, n_trees: int = 10):
        self.n_trees = n_trees
        self.trees = [MondrianTree() for _ in range(n_trees)]

    def update(self, features: np.ndarray, label: int):
        """Update all trees with new sample."""
        for tree in self.trees:
            tree.update(features, label)

    def predict(self, features: np.ndarray) -> int:
        """Predict by majority vote."""
        predictions = [tree.predict(features) for tree in self.trees]
        return max(set(predictions), key=predictions.count)

2. Streaming Data Pipeline

from kafka import KafkaConsumer
import json
from queue import Queue
from threading import Thread

class StreamingDataPipeline:
    """
    Ingest labeled samples from Kafka for online learning.
    """

    def __init__(
    self,
    kafka_brokers: List[str],
    topic: str,
    batch_size: int = 32,
    batch_timeout_sec: float = 1.0
    ):
        self.consumer = KafkaConsumer(
        topic,
        bootstrap_servers=kafka_brokers,
        value_deserializer=lambda m: json.loads(m.decode('utf-8'))
        )

        self.batch_size = batch_size
        self.batch_timeout_sec = batch_timeout_sec
        self.sample_queue = Queue(maxsize=10000)

        self.running = False

    def start(self):
        """Start consuming from Kafka."""
        self.running = True
        Thread(target=self._consume_loop, daemon=True).start()

    def _consume_loop(self):
        """Consume samples from Kafka and queue them."""
        for message in self.consumer:
            if not self.running:
                break

                sample = message.value
                self.sample_queue.put(sample)

    def get_batch(self) -> List[Dict]:
        """Get a batch of samples for model update."""
        batch = []

        import time
        start_time = time.time()

        while len(batch) < self.batch_size:
            if time.time() - start_time > self.batch_timeout_sec:
                break

                if not self.sample_queue.empty():
                    batch.append(self.sample_queue.get())

                    return batch

3. Concept Drift Detection

from collections import deque
import numpy as np

class DriftDetector:
    """
    Detect concept drift using performance monitoring.

    Similar to Jump Game checking if we're 'stuck':
        - Monitor if model performance is degrading
        - Trigger adaptation/retraining if drift detected
        """

    def __init__(
    self,
    window_size: int = 1000,
    threshold: float = 0.1
    ):
        self.window_size = window_size
        self.threshold = threshold

        # Recent performance
        self.recent_errors = deque(maxlen=window_size)
        self.baseline_error = None

    def update(self, prediction: float, label: float):
        """Update with new prediction and label."""
        error = abs(prediction - label)
        self.recent_errors.append(error)

        # Set baseline from first window
        if self.baseline_error is None and len(self.recent_errors) == self.window_size:
            self.baseline_error = np.mean(self.recent_errors)

    def detect_drift(self) -> bool:
        """
        Check if concept drift has occurred.

        Returns:
            True if drift detected, False otherwise
            """
            if self.baseline_error is None:
                return False

                if len(self.recent_errors) < self.window_size:
                    return False

                    current_error = np.mean(self.recent_errors)

                    # Drift if error increased significantly
                    return current_error > self.baseline_error * (1 + self.threshold)

    def reset_baseline(self):
        """Reset baseline after handling drift."""
        if self.recent_errors:
            self.baseline_error = np.mean(self.recent_errors)

4. Model Versioning

import time
from dataclasses import dataclass
from typing import Optional

@dataclass
class ModelVersion:
    """Metadata for a model version."""
    version_id: str
    timestamp: float
    update_count: int
    performance_metrics: Dict
    model_state: Dict

    class ModelVersionManager:
        """
        Manage model versions for online learning.

        Features:
            - Periodic checkpoints
            - Performance-based versioning
            - Rollback capability
            """

    def __init__(
    self,
    checkpoint_interval: int = 10000, # Updates between checkpoints
    max_versions: int = 10
    ):
        self.checkpoint_interval = checkpoint_interval
        self.max_versions = max_versions

        self.versions: List[ModelVersion] = []
        self.current_version = None

    def should_checkpoint(self, update_count: int) -> bool:
        """Check if we should create a checkpoint."""
        return update_count % self.checkpoint_interval == 0

    def create_checkpoint(
    self,
    model_state: Dict,
    update_count: int,
    metrics: Dict
    ) -> str:
        """
        Create a new model checkpoint.

        Returns:
            Version ID
            """
            version_id = f"v_{update_count}_{int(time.time())}"

            version = ModelVersion(
            version_id=version_id,
            timestamp=time.time(),
            update_count=update_count,
            performance_metrics=metrics,
            model_state=model_state
            )

            self.versions.append(version)
            self.current_version = version

            # Keep only recent versions
            if len(self.versions) > self.max_versions:
                self.versions.pop(0)

                return version_id

    def rollback(self, version_id: Optional[str] = None) -> Optional[Dict]:
        """
        Rollback to a previous version.

        Args:
            version_id: Specific version to rollback to, or None for previous

            Returns:
                Model state of the version
                """
                if not self.versions:
                    return None

                    if version_id is None:
                        # Rollback to previous version
                        if len(self.versions) >= 2:
                            target = self.versions[-2]
                        else:
                            return None
                        else:
                            # Find specific version
                            target = next((v for v in self.versions if v.version_id == version_id), None)
                            if not target:
                                return None

                                self.current_version = target
                                return target.model_state

Data Flow

Online Learning Pipeline

1. Labeled sample arrives (via Kafka)
 └─> Feature extraction/enrichment
 └─> Add to update buffer

2. Update Service (every N samples or T seconds)
 └─> Get batch from buffer
 └─> Compute gradients/updates
 └─> Apply updates to model
 └─> Update model version

3. Inference Request
 └─> Load latest model version
 └─> Extract features
 └─> Make prediction
 └─> Log prediction for feedback loop

4. Feedback Loop
 └─> Collect true labels (delayed or real-time)
 └─> Send to Kafka as new training samples
 └─> Monitor drift

5. Drift Detection (continuous)
 └─> Compare recent performance to baseline
 └─> If drift detected: alert, increase update frequency, or trigger retraining

Scaling Strategies

Horizontal Scaling - Distributed Online Learning

import ray

@ray.remote
class DistributedOnlineLearner:
    """
    Distributed online learner using parameter server pattern.
    """

    def __init__(self, n_features: int):
        self.model = OnlineLinearModel(n_features)
        self.lock = threading.Lock()

    def update(self, features: np.ndarray, label: float):
        """Thread-safe update."""
        with self.lock:
            self.model.update(features, label)

    def get_weights(self) -> np.ndarray:
        """Get current weights."""
        with self.lock:
            return self.model.weights.copy()

    def predict(self, features: np.ndarray) -> float:
        """Make prediction."""
        return self.model.predict(features)


    class ParameterServerSystem:
        """
        Parameter server for distributed online learning.

        Workers:
            - Process incoming samples
            - Compute gradients
            - Send updates to parameter server

            Parameter Server:
                - Aggregate updates from workers
                - Maintain global model state
                - Serve latest model for inference
                """

    def __init__(self, n_features: int, n_workers: int = 4):
        # Create parameter server
        self.param_server = DistributedOnlineLearner.remote(n_features)

        # Create workers
        self.workers = [
        DistributedOnlineLearner.remote(n_features)
        for _ in range(n_workers)
        ]

        self.n_workers = n_workers

    def update_distributed(self, samples: List[Dict]):
        """
        Distribute samples to workers for parallel updates.
        """
        # Distribute samples to workers
        chunk_size = len(samples) // self.n_workers

        futures = []
        for i, worker in enumerate(self.workers):
            start = i * chunk_size
            end = start + chunk_size if i < self.n_workers - 1 else len(samples)
            worker_samples = samples[start:end]

            # Each worker computes local updates
            for sample in worker_samples:
                futures.append(
                worker.update.remote(sample['features'], sample['label'])
                )

                # Wait for all updates
                ray.get(futures)

                # Sync workers with parameter server (simplified)
                # In production: use all-reduce or async parameter server

Model Update Strategies

1. Single-sample updates (pure online):

for sample in stream:
    model.update(sample['features'], sample['label'])

Pros: Fastest adaptation
Cons: Noisy updates, unstable

2. Mini-batch updates:

batch = []
for sample in stream:
    batch.append(sample)
    if len(batch) >= batch_size:
        model.batch_update(batch)
        batch = []

Pros: More stable, better GPU utilization
Cons: Slightly delayed adaptation

3. Timed updates:

last_update = time.time()
buffer = []

for sample in stream:
    buffer.append(sample)

    if time.time() - last_update > update_interval_sec:
        model.batch_update(buffer)
        buffer = []
        last_update = time.time()

Pros: Predictable update schedule
Cons: Variable batch sizes

Implementation: Complete System

import logging
from typing import List, Dict, Optional
import time

class OnlineLearningSystem:
    """
    Complete online learning system.

    Features:
        - Streaming data ingestion
        - Incremental model updates
        - Drift detection
        - Model versioning
        - Inference serving
        """

    def __init__(
    self,
    n_features: int,
    learning_rate: float = 0.01,
    batch_size: int = 32,
    checkpoint_interval: int = 10000
    ):
        # Core model
        self.model = OnlineLinearModel(
        n_features=n_features,
        learning_rate=learning_rate
        )

        # Components
        self.drift_detector = DriftDetector(window_size=1000)
        self.version_manager = ModelVersionManager(
        checkpoint_interval=checkpoint_interval
        )

        # Update buffer
        self.batch_size = batch_size
        self.update_buffer: List[Dict] = []

        self.logger = logging.getLogger(__name__)

        # Metrics
        self.total_updates = 0
        self.total_predictions = 0
        self.drift_events = 0

    def predict(self, features: np.ndarray) -> float:
        """
        Make prediction with current model.

        Args:
            features: Input features

            Returns:
                Prediction
                """
                self.total_predictions += 1
                return self.model.predict(features)

    def update(self, features: np.ndarray, label: float):
        """
        Queue sample for model update.

        Args:
            features: Input features
            label: True label (from feedback)
            """
            # Add to buffer
            self.update_buffer.append({
            "features": features,
            "label": label
            })

            # Update model when batch is ready
            if len(self.update_buffer) >= self.batch_size:
                self._apply_updates()

    def _apply_updates(self):
        """Apply batched updates to model."""
        batch = self.update_buffer
        self.update_buffer = []

        # Apply updates
        for sample in batch:
            self.model.update(sample['features'], sample['label'])
            self.total_updates += 1

            # Update drift detector
            pred = self.model.predict(sample['features'])
            self.drift_detector.update(pred, sample['label'])

            # Check for drift
            if self.drift_detector.detect_drift():
                self.logger.warning("Concept drift detected!")
                self._handle_drift()

                # Checkpoint if needed
                if self.version_manager.should_checkpoint(self.total_updates):
                    self._create_checkpoint()

    def _handle_drift(self):
        """
        Handle concept drift.

        Strategies:
            1. Increase learning rate temporarily
            2. Reset model (if severe drift)
            3. Trigger full retraining
            4. Alert monitoring system
            """
            self.drift_events += 1

            # Simple strategy: reset drift detector baseline
            self.drift_detector.reset_baseline()

            # Could also:
            # - Increase learning rate
            # - Trigger alert/page
            # - Request full retrain from batch system

            self.logger.info(f"Handled drift event #{self.drift_events}")

    def _create_checkpoint(self):
        """Create model checkpoint."""
        state = self.model.get_state()
        metrics = {
        "avg_loss": state["avg_loss"],
        "total_updates": self.total_updates,
        "total_predictions": self.total_predictions
        }

        version_id = self.version_manager.create_checkpoint(
        model_state=state,
        update_count=self.total_updates,
        metrics=metrics
        )

        self.logger.info(f"Created checkpoint: {version_id}")

    def rollback(self, version_id: Optional[str] = None):
        """Rollback to previous model version."""
        state = self.version_manager.rollback(version_id)

        if state:
            self.model.set_state(state)
            self.logger.info(f"Rolled back to version {version_id}")
        else:
            self.logger.error("Rollback failed")

    def get_metrics(self) -> Dict:
        """Get system metrics."""
        return {
        "total_updates": self.total_updates,
        "total_predictions": self.total_predictions,
        "drift_events": self.drift_events,
        "current_version": (
        self.version_manager.current_version.version_id
        if self.version_manager.current_version
        else None
        ),
        "model_performance": self.model.get_state()["avg_loss"]
        }


        # Example usage
        if __name__ == "__main__":
            logging.basicConfig(level=logging.INFO)

            # Create system
            system = OnlineLearningSystem(
            n_features=10,
            learning_rate=0.01,
            batch_size=32
            )

            # Simulate streaming data
            for i in range(1000):
                # Generate sample
                features = np.random.randn(10)
                label = np.dot([0.5] * 10, features) + np.random.randn() * 0.1

                # Make prediction
                pred = system.predict(features)

                # Update model (with delay, simulating feedback loop)
                system.update(features, label)

                # Get metrics
                metrics = system.get_metrics()
                print(f"\nSystem metrics: {metrics}")

Monitoring & Metrics

Key Metrics to Track

Model Performance:

Online loss/error (moving average)
Online accuracy (for classification)
Prediction drift (distribution shift)

System Performance:

Update latency (time to apply update)
Inference latency (time to predict)
Throughput (updates/sec, predictions/sec)
Buffer size (samples waiting for update)

Data Quality:

Label delay (time from prediction to label)
Sample arrival rate
Feature distribution shifts

Alerts

Concept drift detected (performance degradation >10%)
Update latency >100ms (system overloaded)
Buffer overflow (can’t keep up with data rate)
Model performance below baseline (trigger rollback)

Failure Modes

Failure Mode	Impact	Mitigation
Label delay	Can’t update model	Use semi-supervised or unsupervised proxies
Data quality issues	Model learns garbage	Input validation, outlier detection
Concept drift	Model performance degrades	Drift detection, adaptive learning rate
Update lag	Model falls behind	Increase update frequency, add workers
Catastrophic forgetting	Model forgets old patterns	Regularization, rehearsal buffers
Model instability	Oscillating performance	Decrease learning rate, use momentum

Real-World Case Study: Netflix Recommendation

Netflix’s Online Learning Approach

Netflix uses online learning for:

Real-time recommendation updates
A/B test metric computation
Personalization based on recent viewing

Architecture:

Event stream: User interactions (plays, pauses, ratings) → Kafka
Feature computation: Real-time feature updates (watch history, preferences)
Model updates: Incremental updates every few minutes
Inference: Serve recommendations with latest model
Evaluation: Compare online vs batch models via A/B tests

Results:

<100ms model update latency
Updates every 5 minutes (vs daily batch retraining)
+5% engagement from real-time personalization
Faster adaptation to trending content

Key Lessons

Hybrid approach works best: Batch model as baseline, online updates for fine-tuning
Drift detection is critical: Monitor and handle distribution shifts
Checkpointing enables rollback: When online updates degrade performance
A/B testing validates: Always compare online vs batch models
Greedy updates can be unstable: Use regularization and momentum

Cost Analysis

Infrastructure Costs (1M updates/day)

Component	Resources	Cost/Month	Notes
Kafka cluster	3 brokers	$450	Event streaming
Update service	5 instances	$500	Apply model updates
Inference service	10 instances	$1,000	Serve predictions
Redis (feature store)	1 instance	$200	Fast feature lookup
Model storage	S3, 100 GB	$3	Versioned models
Monitoring	Prometheus+Grafana	$100	Metrics & alerts
Total		`2,253/month \| `0.07 per 1K updates

Optimization strategies:

Batch updates: Reduce update service cost by 50%
Shared inference cache: Reduce duplicate predictions
Model compression: Smaller models → faster updates
Spot instances: 70% cost reduction for update workers

Key Takeaways

✅ Online learning enables real-time adaptation to new data without full retraining

✅ Greedy updates (like Jump Game’s greedy reach extension) work well for many models

✅ Drift detection is critical to maintain model quality over time

✅ Hybrid approach (batch baseline + online fine-tuning) often performs best

✅ Model versioning and rollback protect against bad updates

✅ Mini-batch updates balance stability and adaptation speed

✅ Monitoring and A/B testing validate that online learning improves over batch

✅ Linear models work well, but deep learning requires careful design (see adaptive speech models)

✅ Cost vs freshness trade-off - more frequent updates cost more but improve relevance

✅ Same greedy pattern as Jump Game - make locally optimal decisions, adapt forward

Connection to Thematic Link: Greedy Decisions and Adaptive Strategies

All three topics use greedy, adaptive optimization:

DSA (Jump Game):

Greedy: extend max reach at each position
Adaptive: update strategy based on current state
Forward-looking: anticipate future reachability

ML System Design (Online Learning Systems):

Greedy: update model with each new sample
Adaptive: adjust to distribution shifts via incremental learning
Forward-looking: drift detection predicts future performance

Speech Tech (Adaptive Speech Models):

Greedy: adapt to speaker/noise in real-time
Adaptive: fine-tune acoustic model based on recent utterances
Forward-looking: anticipate user corrections and adapt preemptively

The unifying principle: make greedy, locally optimal decisions while continuously adapting to new information, essential for systems operating in dynamic, uncertain environments.

FAQ

When should you use online learning instead of batch retraining?

Use online learning when the data distribution changes frequently (ad click prediction, fraud detection, trending content), when you need real-time personalization that reflects recent user behavior, or when labeled data arrives continuously from feedback loops. Stick with batch retraining for stable distributions, models requiring full-batch statistics (like batch normalization), or when the complexity of incremental updates outweighs the freshness benefit.

How do you detect concept drift in a production ML system?

Monitor the model’s error rate over a sliding window (typically 1000 recent predictions) and compare it to a baseline error rate established during stable performance. If recent error exceeds the baseline by a configurable threshold (commonly 10%), trigger drift handling. More sophisticated methods include ADWIN (adaptive windowing), Page-Hinkley test, or monitoring feature distribution shifts before they impact model performance.

What is the hybrid approach to online learning?

Train a strong batch model on all available historical data as the baseline, then apply lightweight online updates to fine-tune it with streaming data. The batch model provides stability and captures long-term patterns, while online updates provide freshness and adaptation to recent shifts. Netflix uses this approach: batch models retrained daily with online updates every 5 minutes, achieving 5% higher engagement than either approach alone.

How do you protect against bad online model updates?

Checkpoint the model at regular intervals (e.g., every 10K updates), monitor online metrics against the batch baseline in real-time, and automatically roll back to the last good checkpoint if performance degrades beyond a threshold. A/B test between online-updated and batch-only models to validate that online learning is genuinely improving outcomes. Use regularization and momentum in the update rule to prevent individual samples from causing large parameter swings.

Cross-links: Event Stream Processing | Model Serving Architecture | Model Monitoring Systems

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

Online Learning Systems

TL;DR

Problem Statement

Functional Requirements

Non-Functional Requirements

Understanding the Requirements

When to Use Online Learning

The Greedy Adaptation Connection

High-Level Architecture

Key Components

Component Deep-Dives

1. Incremental Update Algorithms

2. Streaming Data Pipeline

3. Concept Drift Detection

4. Model Versioning

Data Flow

Online Learning Pipeline

Scaling Strategies

Horizontal Scaling - Distributed Online Learning

Model Update Strategies

Implementation: Complete System

Monitoring & Metrics

Key Metrics to Track

Alerts

Failure Modes

Real-World Case Study: Netflix Recommendation

Netflix’s Online Learning Approach

Key Lessons

Cost Analysis

Infrastructure Costs (1M updates/day)

Key Takeaways

Connection to Thematic Link: Greedy Decisions and Adaptive Strategies

FAQ

Related across topics

Share on

TL;DR

Problem Statement

Functional Requirements

Non-Functional Requirements

Understanding the Requirements

When to Use Online Learning

The Greedy Adaptation Connection

High-Level Architecture

Key Components

Component Deep-Dives

1. Incremental Update Algorithms

2. Streaming Data Pipeline

3. Concept Drift Detection

4. Model Versioning

Data Flow

Online Learning Pipeline

Scaling Strategies

Horizontal Scaling - Distributed Online Learning

Model Update Strategies

Implementation: Complete System

Monitoring & Metrics

Key Metrics to Track

Alerts

Failure Modes

Real-World Case Study: Netflix Recommendation

Netflix’s Online Learning Approach

Key Lessons

Cost Analysis

Infrastructure Costs (1M updates/day)

Key Takeaways

Connection to Thematic Link: Greedy Decisions and Adaptive Strategies

FAQ

Related across topics

Jump Game

Adaptive Speech Models

Real-Time Speech-to-Speech Agents

Share on