MLOps & LLMOps: The Production Playbook for Global-Scale AI

Q: What is the difference between MLOps and LLMOps?

MLOps covers the lifecycle of traditional ML models including data pipelines, feature stores, training, and monitoring. LLMOps extends this with LLM-specific concerns like prompt versioning, vector database synchronization for RAG, token cost management, hallucination detection, and safety guardrails against jailbreaking and PII leakage.

Q: What is training-serving skew and how do feature stores prevent it?

Training-serving skew occurs when the code calculating features during training differs from production inference code. A centralized feature store ensures both environments use identical calculations, with an offline store for training and a low-latency online store for real-time serving, connected by automated materialization.

Q: How do you detect model drift in production ML systems?

Track data drift using Population Stability Index or Kolmogorov-Smirnov tests to detect changes in input distributions. Monitor concept drift through rollup accuracy over time when ground truth labels are available. Watch for calibration decay where confidence scores no longer match actual outcomes.

Q: What deployment strategies minimize risk when releasing new ML models?

Use shadow deployment to log predictions without serving them, canary releases to gradually route traffic from 1% to 100% while monitoring metrics, and A/B testing with randomized user buckets for scientific validation of business impact. Always maintain instant rollback capability through traffic shapers like Istio.

24 minute read

"A model in a Jupyter Notebook is a laboratory curiosity. A model in production is a liability until it is governed by a rigorous operations framework."

TL;DR

Production ML is 95% infrastructure and 5% model code. The seven pillars of MLOps are: data pipelines, feature stores (solving training-serving skew), training pipelines (reproducibility), model registries (governance), evaluation gates (robustness testing), inference serving (auto-scaling on GPU metrics), and observability (drift detection, calibration monitoring). LLMOps adds prompt versioning, RAG pipeline synchronization, and safety guardrails. The critical pattern is CI/CD/CT: Continuous Training automatically retrains models when data drift or concept drift is detected. For the inference optimization that complements these pipelines, see the ML inference techniques guide and the chatbot system design for a full end-to-end example.

An industrial control room with SCADA screens showing pipeline status

1. Introduction: The Reality Gap in Modern AI

In the halls of academia and the rows of Kaggle competitions, success is measured by a static performance metric on a fixed dataset. Accuracy, F1-score, or BLEU, these are the currencies of research. But at the scale of global tech giants, the model itself is often the easiest part of the system.

The true challenge lies in the “Hidden Technical Debt in Machine Learning Systems.” As highlighted in the seminal Google paper, the ML code is only a tiny fraction of the overall ecosystem. The rest, the “pipes”, is what determines whether a model generates value or causes a production incident that wakes up an entire engineering organization at 3 AM.

This is the domain of MLOps (Machine Learning Operations) and its newer, more complex sibling, LLMOps (Large Language Model Operations). It is the fusion of DevOps principles with the unique, stochastic, and data-dependent nature of AI. Unlike traditional software, where code is deterministic, ML systems fail in silent, non-linear ways. A model can be “running” (HTTP 200) but be “broken” (returning garbage predictions due to data drift).

In this deep dive, we will architect the infrastructure that allows you to deploy, monitor, and scale models with the same confidence that Google deploys Search or Meta deploys the News Feed.

2. Problem Statement & Operational Requirements

Building a production ML system is an exercise in balancing the “Golden Path” of data with the “Dark Path” of failures.

2.1 Functional Requirements

Reproducible Pipelines: Every model variant must be traceable back to the exact code version, dataset snapshot, and hyperparameter set used to create it.
Automated Continuous Retraining: The system must detect when model performance decays and autonomously trigger a retraining job with fresh data.
Multi-Stage Inference: Support for complex inference patterns, A/B testing, Shadow Mode, Canary rollouts, and multi-model ensembles.
Observability & Attribution: For every prediction, we must be able to explain “Why?” (Explainability) and “How much did it cost?” (Attribution).
LLM Specifics: Handling prompt versioning, RAG (Retrieval-Augmented Generation) sync, and RLHF (Reinforcement Learning from Human Feedback) feedback loops.

2.2 Non-Functional Requirements

99.99% Availability: Inference services must be as resilient as core infrastructure.
Sub-100ms Latency: Real-time applications (search, ads, fraud) cannot wait for slow models.
Security & Compliance: PII (Personally Identifiable Information) masking, bias detection, and SOC2/GDPR compliance across the data lifecycle.
Cost Efficiency: Managing the massive GPU and storage costs of modern AI models through auto-scaling and spot-instance utilization.

3. High-Level Architecture: The Seven Pillars of MLOps

A production MLOps system is not a single tool; it is a coordinated dance between seven distinct pipelines.

[ Data Source ] -> [ Data Pipeline ] -> [ Feature Store ]
                                               |
                                               v
[ CI/CD ] -> [ Training Pipeline ] -> [ Model Registry ]
                                               |
                                               v
[ Observability ] <- [ Inference Pipeline ] <- [ Evaluation Pipeline ]

3.1 Pillar 1: The Data & Feature Pipeline (The Foundation)

Data is the oxygen of ML systems. If the data is poisoned or stale, no amount of hyperparameter tuning will save you.

The Feature Store: Online vs. Offline

The biggest challenge in MLOps is the Training-Serving Skew. This happens when the code used to calculate a feature during training (in a Python/Spark environment) differ from the code used in production (often in a high-speed C++/Java environment).

The Solution: A centralized Feature Store (like Tecton, Feast, or AWS Feature Store).
Offline Store: Stores large volumes of historical features (Parquet files in S3/HDFS). Used for training models and batch inference. It supports complex joins and time-travel.
Online Store: A low-latency key-value store (Redis, Aerospike) that stores only the latest version of each feature. When a request comes in for “user_123”, the inference engine pulls the pre-calculated features from the Online Store in < 10ms.
Point-in-Time Joins: One of the most subtle bugs in ML. If you are predicting “Will a user churn on Tuesday?”, you must not use data from Wednesday in your training row. The Feature Store handles this Temporal Join to ensure perfect causality, preventing “Look-ahead bias.”
Materialization: The automated process of moving data from the offline store to the online store based on a TTL (Time-to-Live) policy.

3.2 Pillar 2: The Training Pipeline (Reproducibility)

We treat a training run as a Function where the inputs are (Code, Data_Version, Params) and the output is a Model_Artifact.

Orchestration: Using Airflow, Step Functions, or Kubeflow Pipelines to handle the DAG (Directed Acyclic Graph) of training steps.
Distributed Training: For LLMs or large computer vision models, we use PyTorch Distributed or DeepSpeed to split the workload across hundreds of GPUs using data parallelism and model parallelism.

3.3 Pillar 3: The Model Registry (Governance)

A Model Registry is not just a folder in S3. It is a metadata-rich ledger.

Lineage Tracking: “Model_V3 was derived from Model_V2 with a 2% learning rate increase and an updated feature set.”
Stage Management: Models move through lifecycle stages: Experimenting -> Staging -> Production -> Archived.
Compliance Sign-offs: A production deployment might require an automated “Security Scan” and a human “Ethical Review” sign-off recorded in the registry metadata.

3.4 Pillar 4: The Evaluation Pipeline (The Filter)

Models must be audited before they touch live traffic.

Static Evaluation: Running metrics (Precision, Recall, NDCG) on a fixed “Golden Dataset.”
Robustness Testing: “How does the model behave if we add random Gaussian noise to the inputs?” Or “What happens to the credit score if we change the user’s zip code?” (Counterfactual testing).
Inference Performance Testing: Measuring CPU/GPU usage and latency under different batch sizes.

4. The CI/CD/CT Pattern: The ML Flywheel

In traditional software, we have CI/CD. In MLOps, we add CT (Continuous Training).

4.1 Continuous Integration (CI): Testing Code, Data, and Model

In MLOps, a build failure can be triggered by more than just a syntax error.

Unit & Integration Tests: Standard testing for your training and inference code.
Data Validation Tests: Running a subset of the pipeline on a “synthetic” dataset. We use Great Expectations to enforce that “User_Age” is never negative and “Country_Code” matches a known list.
Model Performance Gating: A “Candidate” model is automatically evaluated against a “Golden Test Set.” If its accuracy drops below the current production model, the build is marked as “failed.”

4.2 Continuous Delivery (CD): The Path to Production

Automated Packaging: Compiling the model into a hardware-optimized format (ONNX, TensorRT, or CoreML) and wrapping it in a Docker container.
Inference SLA Validation: The model is deployed to a “Staging” environment where a load-testing tool (like Locust or k6) simulates 1,000 requests/sec. If the P99 latency exceeds 100ms, the delivery is halted.

4.3 Continuous Training (CT): The Self-Healing System

CT allows the system to autonomously adapt to the changing real world.

Frequency-Based: Training on a cron job (daily/weekly).
Drift-Triggered: Training is launched automatically the moment a significant shift is detected in Data Drift or Concept Drift.

5. Monitoring & Observability: The SRE for ML

A traditional software dashboard tracks CPU, Memory, and Error Rates. In MLOps, we track Drift and Calibration.

5.1 Data Drift (Feature Drift)

The inputs change. $P(X)$ shifts.

Detection: We use the Population Stability Index (PSI) or the Kolmogorov-Smirnov (K-S) Test to compare the distribution of features today against the distribution during training.
Example: A model trained on high-end smartphone users starts receiving traffic from low-end devices. The feature “device_ram” drifts.

5.2 Concept Drift

The relationship between input and output changes. $P(Y|X)$ shifts.

Detection: Requires “Ground Truth” labels to be returned to the system (often delayed). We monitor the Rollup Accuracy over time.
Example: A pandemic changes shopping habits. The “Price” model doesn’t work anymore even though the “House Size” feature is the same.

5.3 Model Calibration & Interpretability

Brier Score: Measures how “confident” and “correct” your model is. A well-calibrated model that says “80% chance” should be right 80 times out of 100.
Real-time SHAP/LIME: For high-stakes decisions (e.g., denying a loan), the inference engine must log the Shapley Values of the features that contributed to that specific decision for future auditability.

## 6. Infrastructure as Code (IaC) for Machine Learning

MLOps is not just software; it’s infrastructure. To avoid the “Snowflake Server” problem (where a server is manually configured and impossible to reproduce), we use IaC.

### 6.1 Provisioning with Terraform We define our GPU clusters, S3 buckets, and IAM roles in Terraform or Pulumi.

The Benefit: You can spin up a perfect clone of your production environment in a new AWS region in minutes.
Environment Parity: Ensuring that dev, staging, and prod are identical in architecture, differing only in scale.

### 6.2 The Kubernetes Operator Pattern For complex ML workflows, we use K8s Operators (like the Kubeflow Training Operator or SparkOperator). These custom controllers automate the movement of data and the allocation of pods, treating a distributed training job as a single manageable resource.

## 7. Scaling Inference: From 1 to 10,000 requests/sec

Scaling ML models is exponentially harder than scaling web servers because of the Memory Footprint and GPU Hunger.

7.1 Horizontal Scaling with Kubernetes (K8s)

Model Serving Clusters: Using KServe or BentoML to manage a fleet of inference pods.
Auto-scaling (HPA): Scaling not based on CPU, but on GPU Utilization or Request Queue Length.

7.2 Optimization Techniques

Quantization: Converting 32-bit floats to 8-bit integers (INT8) to reduce model size by 4x and speed up inference by 2-5x.
Pruning: Removing neurons that contribute little to the output.
Distillation: Training a small “Student” model to mimic a large “Teacher” model (e.g., BERT-Small mimicking GPT-4 results for specific tasks).

8. LLMOps: The Specialized Frontier

LLMs introduce new failure modes, hallucination, jailbreaking, and massive token costs.

8.1 Prompt Engineering as Code (PEaC)

Prompts are now first-class citizens. We version them using tools like LangSmith or Weights & Biases Prompts. A single change in a “System Message” (e.g., adding “Be concise”) can radically change the precision of an extraction model.

8.2 The RAG Pipeline Infrastructure

Retreival-Augmented Generation (RAG) requires keeping the Vector Database (Pinecone, Milvus, Weaviate) in sync with the source of truth (Cassandra, Snowflake).

Index Refresh Policies: How fast can we update an embedding?
RAG Evaluation (Ragas): Monitoring the “Faithfulness” (Did the model stick to the retrieved info?) and “Relevance” (Did the retrieved info answer the question?).

8.3 Governance & Guardrails

LLM Gateways: A centralized proxy (like Portkey or LiteLLM) that handles rate-limiting, load-balancing across multiple providers (OpenAI, Anthropic, Gemini), and caching frequent queries to save costs.
Safety Filtering: Running a “Moderation Model” on every input and output to detect toxicity or PII leakage before it reaches the user.

## 9. Explainability Pipelines: The “Why” behind the “What”

In regulated industries (Healthcare, Finance, Insurance), a model’s prediction is useless unless it is accompanied by an explanation.

### 9.1 Global vs. Local Interpretability

Global: What features are generally important to the model? (e.g., “The model mostly looks at Credit Score and Debt-to-Income ratio”). We calculate this once per training run using Global SHAP or Feature Permutation Importance.
Local: Why was this specific user denied a loan? (e.g., “This user was denied because their Debt-to-Income ratio was 45%”).

### 9.2 The Explanation Service (XAI) In a high-scale architecture, we don’t calculate SHAP values in the inference path (it would be too slow). Instead:

Inference: The model returns a prediction in 5ms.
XAI Queue: The prediction and input features are sent to an asynchronous queue (e.g., RabbitMQ or Kafka).
XAI Worker: An offline worker calculates the SHAP/LIME explanation and stores it in the Explanation Database.
Retrieval: When a customer support agent looks at the user’s profile 10 seconds later, the explanation is ready for them.

10. Rollbacks & Deployment Strategies: The “Aces” Playbook

How do you deploy a model at Scale without breaking the world?

10.1 Shadow Deployment

The new model receives 100% of live traffic, but its results are discarded. We only log the predictions for offline comparison with the “Champion” model. This is the safest way to test for Inference Performance and Data Compatibility.

10.2 Canary Release

Deploy Model V2 to 1% of users. Monitor the “Business Metrics” (e.g., Click-Through Rate) and “System Metrics” (e.g., Latency). If metrics are stable after 1 hour, ramp up to 5%, 20%, 100%.

10.3 A/B Testing: The Causal Proof

Unlike a Canary release (which is about stability), A/B testing is about Scientific Validity. We use randomized buckets to determine if Model B actually outperforms Model A on key KPIs (like Revenue per User). This requires a robust Experimentation Platform that can handle hash-based assignment. —

11. Edge MLOps: Taking Models to the User

In many applications (Autonomous Vehicles, Mobile Apps, Smart Cameras), waiting for a round-trip to a centralized cloud is impossible. We must deploy to the Edge.

11.1 The Edge Lifecycle

Compression: Using Weight Quantization and Knowledge Distillation to shrink a 500MB model down to 10MB while retaining 98% accuracy.
Hardware Targeting: Compiling the model for specific chips (e.g., Apple Neural Engine via CoreML, or Google TPU via TFLite).
Fleet Orchestration: Managing “The Fleet”, over-the-air (OTA) updates for millions of devices simultaneously without bricking them.

11.2 Federated Learning: Privacy-First Training

For sensitive data (like medical records or keyboard typing habits), we use Federated Learning. The model is sent to the device, trained locally on a small batch of private data, and only the gradients (the mathematical updates) are sent back to the central server. The raw data never leaves the user’s hand.

12. Security & Governance: The “Fortress” Model

Production AI is a new attack vector. We must protect against both accidental bias and malicious intent.

12.1 Adversarial Robustness

Attackers can use “Adversarial Perturbations”, tiny, invisible changes to an image or audio clip that cause a model to misclassify a Stop sign as a Speed Limit sign.

Defense: We include adversarial samples in our Training Pipeline (Adversarial Training) to make the model “immune” to these specific noise patterns.

12.2 PII Masking & Data Residency

In a global MLOps system, data from German users (GDPR) must stay in Germany.

The Implementation: Automated “PII Scrubbers” in the Data Pipeline that mask Social Security Numbers and Names before the data is stored in the Feature Store.

12.3 Model Watermarking

To prevent “Model Stealing” (where a competitor queries your API millions of times to train their own model), we can embed a unique, mathematical “Watermark” into the weights. If a competitor’s model shows the same watermark response, it’s proof of intellectual property theft.

13. Implementation: A Scalable Training & Inference Pipeline

import mlflow
import dagshub
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib

# 1. Pipeline Governance: Tracking Everything with MLflow
def train_and_register_model(X_train, y_train, X_test, y_test, params):
    # Tracking context (experiment name, run name)
    mlflow.set_experiment("global_fraud_detection")

    with mlflow.start_run():
        # Log Parameters
        mlflow.log_params(params)

        # Training
        model = RandomForestClassifier(**params)
        model.fit(X_train, y_train)

        # Evaluation
        predictions = model.predict(X_test)
        acc = accuracy_score(y_test, predictions)

        # Log Metrics
        mlflow.log_metric("accuracy", acc)

        # 2. Model Versioning & Registry
        # The model is not just a file; it's an artifact with a lineage.
        if acc > 0.95: # Gating Deployment
            mlflow.sklearn.log_model(
            model,
            artifact_path="fraud_model",
            registered_model_name="fraud_classifier_prod"
            )
            print("Model Registered to Production Stage.")
        else:
            print("Performance check failed. Model discarded.")

            # 3. Serving with a Wrapper for Observability
    class ProductionModelWrapper:
    def __init__(self, model_uri):
        self.model = mlflow.sklearn.load_model(model_uri)
        self.stats = {} # Injected from Prometheus in real systems

    def predict(self, input_data):
        # Pre-execution: Input Validation
        if not self.validate_schema(input_data):
            return {"error": "Invalid Input Schema", "status": 400}

            # Execution
            result = self.model.predict(input_data)

            # Post-execution: Observability logging (Data Drift Check)
            self.log_prediction_telemetry(input_data, result)

            return result

    def validate_schema(self, data):
        # Pydantic or Great Expectations validation logic
        return True

    def log_prediction_telemetry(self, data, result):
        # Async push to a Feature Monitor queue (Kafka -> ELK)
        pass

## 14. Case Study: The “Michelangelo” Pattern at Global Scale

Uber’s Michelangelo was the pioneer of the “ML-as-a-Service” platform. Its success came from three core philosophies:

### 14.1 Standardizing the Workflow Instead of letting every data scientist pick their own libraries, Michelangelo provided a curated menu of “Supported Algorithms” (XGBoost, SparkML, TensorFlow). This allowed the platform team to build highly optimized inference engines and monitoring bridges that worked for 90% of use cases.

### 14.2 The “Horizontal” vs “Vertical” Team Model

Horizontal (Platform Team): Responsible for the feature store, the CI/CD runners, and the K8s clusters.
Vertical (Application Teams): Responsible for the model logic (e.g., “Uber Eats Delivery Time Prediction”).
This separation of concerns allowed the platform to scale to supporting thousands of unique models without the platform team becoming a bottleneck. ### 14.3 Near-Real-Time Feature Engineering Michelangelo famously used Apache Flink for “streaming features.” For example, the “Average rating of this restaurant in the last 60 minutes” is calculated as events flow through Kafka and is materialized into the Online Store instantly.

15. Case Study: Architecting a Content Moderation Immune System

Imagine a social media platform like Facebook or Twitter. You have 100 million pieces of content uploaded per hour. You need a model to detect “Hate Speech” and “Graphic Violence.”

15.1 The Cascade: Balancing Safety and Speed

In a system of this scale, a single model is never sufficient. We use a Tiered Filtration Strategy:

L0: Hash Matching (The Blacklist): Before any ML is invoked, the system checks the content’s digital fingerprint (hash) against a database of known illegal content. This takes < 1ms and catches 90% of re-uploaded violating material.
L1: Fast Lexical & Image Features: A lightweight model analyzes the text for obvious slur patterns and the images for “blatant” violence using character-level features.
L2: Large Multi-modal Transformer: The remaining 5% of “ambiguous” content is sent to a heavy transformer (like a fine-tuned Llama-Guard or CLIP) that understands cultural context, sarcasm, and subtle visual cues.

15.2 The Feedback Loop: Human-in-the-Loop MLOps

When a human moderator disagrees with a model’s “High Confidence” block, the event is flagged as a Calibration Conflict. These specific examples are automatically routed into a High-Priority Data Pool for the next retraining cycle, ensuring the model’s “immune system” evolves with new slang and complex social trends.

16. Lineage and the “Black Box” Audit

In high-stakes industries, “the model said so” is not a defense. You must be able to perform a “Forensic Reconstruction” of any prediction made in the last 7 years.

16.1 The Immutable Audit Trail

For every prediction ID generated by your inference service, you must be able to retrieve:

The Artifacts: The exact weights (model_hash) and the exact container image hash that served it.
The Lineage: The DVC snapshot of the training data and the git commit of the training code.
The Explainability Snapshot: The SHAP or Integrated Gradients values calculated at inference time that explain why that specific result was returned.

16.2 Managing Model Decay

Models don’t stay accurate forever. We implement an Automatic Retirement Policy. If a model’s performance on a daily “Golden Test Set” drops below a certain threshold for three consecutive days, it is automatically withdrawn, and traffic is rolled back to a previous, more robust version. This “Fail-Safe” mechanism is the hallmark of a Senior-grade MLOps engineer.

17. The Modern MLOps & LLMOps Stack: A Comparison

Category	Standard MLOps	LLMOps (The New Stack)
Model Registry	MLflow, Kubeflow	Weights & Biases, LangSmith
Orchestration	Airflow, Prefect	Flyte, ZenML
Database	Snowflake, BigQuery	Pinecone, Weaviate, Milvus
Monitoring	Arize, WhyLabs	LangSmith, Giskard
Serving	TFServing, TorchServe	vLLM, TGI (Text Generation Inference)
Governance	Great Expectations	Guardrails AI, NeMo Guardrails

18. The Future of MLOps: Self-Optimizing Infrastructure

As we move toward “Autonomous MLOps,” the role of the engineer shifts from “Builder” to “Orchestrator.”

Agentic MLOps: AI agents that monitor their own drift and submit PRs to fix the data pipeline when they detect a regression.
Green AI: Real-time carbon tracking for GPU clusters. The scheduler automatically moves training jobs to regions with the cheapest and “greenest” energy grid at that hour.
Zero-Shot Production: Models that learn how to handle new categories (Edge Cases) in production without needing a full retraining cycle, using RAG-based on-the-fly updates.

19. The Engineer’s Checklist for MLOps Readiness

Before you push any model to a billion users, ask yourself these ten critical questions. If you can’t answer “Yes” to all of them, your model is a liability.

Can I reproduce Model V2 exactly?: Do I have the exact code, exact data snapshot, and exact environment locked?
Is my Training-Serving Parity guaranteed?: Am I using a Feature Store to ensure the online and offline calculations are identical?
Does my Monitoring detect silent failures?: Am I tracking Data Drift (PSI) and Concept Drift, or just watching CPU usage?
Is my Calibration healthy?: Does “90% confidence” actually mean 90% accuracy in the real world?
Can I Rollback in under 60 seconds?: Is my traffic-shaper (Istio/Seldon) configured for instant model reversion?
Are my PII filters active?: Is sensitive data being scrubbed before hitting the Feature Store and the logs?
Is my Cost-per-Prediction within budget?: Do I know how much my H100 GPU cluster is costing per user session?
Is my Evaluation Pipeline robust?: Have I tested the model against adversarial noise and extreme edge cases?
Are my LLM Guardrails enforced?: Am I running a safety layer on top of my GPT-5/Llama-4 queries?
Do I have an Audit trail?: Can I explain a specific decision from 6 months ago to a regulator?

20. Summary & Final Thoughts

Infrastructure is the Model: In the long run, your pipeline’s resilience matters more than your model’s architecture.
DevOps + Science: MLOps is the bridge between the rigor of the laboratory and the chaos of the internet.
LLMOps is Non-linear: Prompt versioning and Vector DB sync are the new “hard” problems.
Drift is Inevitable: Design your systems to expect decay and heal themselves automatically.

Production Machine Learning is the ultimate engineering frontier. It requires us to build systems that aren’t just reliable, but also “enlightened” enough to know when they are losing their grasp on reality. Master these principles, and you will build the intelligent systems of the next decade.

FAQ

What is the difference between MLOps and LLMOps?

MLOps covers the full lifecycle of traditional ML models including data pipelines, feature stores, reproducible training, model registries, and monitoring for drift. LLMOps extends this with LLM-specific concerns: prompt versioning as code, vector database synchronization for RAG, token cost management, hallucination detection, and safety guardrails against jailbreaking and PII leakage.

What is training-serving skew and how do feature stores prevent it?

Training-serving skew occurs when the code calculating features during training (Python/Spark) differs from production inference code (C++/Java). A centralized feature store like Tecton or Feast ensures both environments use identical feature calculations. The offline store handles training with historical data, while the online store provides sub-10ms latency lookups for real-time serving, connected by automated materialization.

How do you detect model drift in production ML systems?

Track data drift using Population Stability Index (PSI) or Kolmogorov-Smirnov tests to detect changes in input feature distributions. Monitor concept drift through rollup accuracy over time when ground truth labels become available. Watch for calibration decay using the Brier Score, which reveals when model confidence scores no longer match actual outcomes.

What deployment strategies minimize risk when releasing new ML models?

Use shadow deployment to log predictions without serving them to users, canary releases to gradually route traffic from 1% to 100% while monitoring metrics, and A/B testing with randomized user buckets for scientific validation of business impact. Always maintain instant rollback capability through traffic shapers like Istio or Seldon, targeting sub-60-second reversion.

Originally published at: arunbaby.com/ml-system-design/0063-mlops-llmops-production-playbook

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch