Cost Optimization for ML
A comprehensive guide to FinOps for Machine Learning: reducing TCO without compromising accuracy or latency.
The Challenge: Efficiency vs. Performance
In the world of Machine Learning System Design, building a model that achieves 99.9% accuracy is only half the battle. The other half is ensuring that this model doesn’t bankrupt your company.
Cost Optimization for ML is the art and science of reducing the financial footprint of your ML workloads without compromising on user experience (latency) or model quality (accuracy).
As a junior engineer, you might think, “Cost is a manager’s problem.” But in modern tech companies, FinOps (Financial Operations) is everyone’s responsibility. An engineer who can design a system that saves the company $50,000 a month is often more valuable than one who improves model accuracy by 0.1%.
In this deep dive, we will explore the entire stack—from hardware selection to model compression to architectural patterns—to uncover where the money goes and how to save it. We will compare cloud providers, dive into the physics of semiconductors, and write actual Kubernetes configuration files.
Glossary of FinOps Terms
Before we begin, let’s define the language of money in tech.
- CAPEX (Capital Expenditure): Upfront money spent on buying physical servers (e.g., buying 100 H100s for your own data center).
- OPEX (Operational Expenditure): Ongoing money spent on cloud services (e.g., renting AWS EC2 instances). Most modern ML is OPEX.
- TCO (Total Cost of Ownership): The sum of all costs (compute + storage + networking + engineering salaries + maintenance) over the life of the project.
- ROI (Return on Investment): (Revenue - Cost) / Cost. If your model costs $100 to run and generates $110 in value, the ROI is 10%.
- Unit Economics: The cost to serve one unit of value (e.g., “Cost per 1000 predictions”). This is the most important metric for scaling.
- Commitment Savings Plan (CSP): A contract where you promise to spend $X/hour for 1-3 years in exchange for a 30-50% discount.
The Anatomy of ML Costs
To fix a leak, you first have to find it. Let’s break down the bill.
1. Compute (The Big One)
This is usually 70-80% of the cost.
- Training: Massive bursts of high-end GPU usage (e.g., NVIDIA A100s, H100s). Training a large language model can cost millions.
- Inference: Continuous usage of smaller GPUs (T4, L4) or CPUs. While per-hour cost is lower, this runs 24/7, so it adds up.
- Development: Notebooks (Jupyter/Colab) left running overnight. This is “zombie spend.”
2. Storage
- Object Storage (S3/GCS): Storing petabytes of raw data, logs, and model checkpoints.
- Block Storage (EBS/Persistent Disk): High-speed disks attached to GPU instances. These are expensive!
- Feature Store: Low-latency databases (Redis/DynamoDB) for serving features.
3. Data Transfer (Networking)
- Egress: Moving data out of the cloud provider (e.g., serving images to users).
- Cross-Zone/Region: Moving data between availability zones (AZs) for redundancy. Training a model in Zone A with data in Zone B incurs massive costs.
Strategy 1: The Spot Instance Revolution
Cloud providers have excess capacity. They sell this spare capacity at a massive discount (60-90% off) called Spot Instances (AWS) or Preemptible VMs (GCP). The catch? They can take it back with a 30-second to 2-minute warning.
How to Tame the Spot Beast
You cannot run a standard web server on Spot instances without risk. But for ML, they are a goldmine.
For Training
Training is long-running but often checkpointable.
- Checkpoint Frequently: Save your model weights to S3 every 10-15 minutes.
- Auto-Resume: Use a job orchestrator (like Ray, Slurm, or Kubernetes Jobs) that detects when a node dies and automatically spins up a new one, loading the last checkpoint.
- Mixed Clusters: Use a small “On-Demand” head node (manager) and a fleet of “Spot” worker nodes. If workers die, the manager survives and requests new workers.
For Inference
This is trickier because you can’t drop user requests.
- Over-provisioning: Run 20% more replicas than you need. If some get preempted, the others handle the load while new ones spin up.
- Graceful Shutdown: Listen for the “Preemption Notice” (a signal sent by the cloud provider). When received:
- Stop accepting new requests (update Load Balancer health check to ‘fail’).
- Finish processing current requests.
- Upload logs.
- Die peacefully.
Strategy 2: Model Optimization (Make it Smaller)
The most effective way to save compute is to do less math.
1. Quantization
Standard ML models use FP32 (32-bit floating point numbers).
- FP16 (Half Precision): Most modern GPUs run faster on FP16. It cuts memory usage in half.
- INT8 (8-bit Integer): This is the game changer. It reduces model size by 4x and speeds up inference by 2-4x on CPUs.
Types of Quantization:
- Post-Training Quantization (PTQ): Take a trained model and convert weights to INT8. Simple, but can drop accuracy.
- Quantization-Aware Training (QAT): Simulate low-precision during training. The model learns to be robust to rounding errors. Higher accuracy, but more complex.
2. Pruning
Neural networks are over-parameterized. Many weights are close to zero and contribute nothing.
- Unstructured Pruning: Set individual weights to zero. Makes the matrix “sparse.” Requires specialized hardware to see speedups.
- Structured Pruning: Remove entire neurons, channels, or layers. This shrinks the matrix dimensions, leading to immediate speedups on all hardware.
3. Knowledge Distillation
Train a massive “Teacher” model (e.g., BERT-Large) to get high accuracy. Then, train a tiny “Student” model (e.g., DistilBERT) to mimic the Teacher’s output probabilities.
- Result: The Student is 40% smaller and 60% faster, retaining 97% of the Teacher’s accuracy.
Strategy 3: Hardware Selection Deep Dive
Don’t default to the most expensive GPU. Let’s look at the physics.
NVIDIA GPUs: The Workhorses
- A100 (Ampere): The king of training. 40GB/80GB VRAM. Massive memory bandwidth (1.6TB/s). Use this for training LLMs. Cost: ~$3-4/hr.
- H100 (Hopper): The new king. Specialized Transformer Engine. 3x faster than A100 for LLMs. Cost: ~$4-5/hr (if you can find one).
- T4 (Turing): The inference workhorse. 16GB VRAM. Cheap, widely available, supports INT8 well. Cost: ~$0.35-0.50/hr.
- L4 (Ada Lovelace): The successor to T4. 24GB VRAM. Much faster ray tracing and video encoding. Great for generative AI (Stable Diffusion). Cost: ~$0.50-0.70/hr.
- A10G: A middle ground. 24GB VRAM. Good for fine-tuning smaller models (7B params). Cost: ~$1.00/hr.
Google TPUs (Tensor Processing Units)
- Architecture: Systolic Arrays. Data flows through the chip like a heartbeat.
- Pros: Massive throughput for large matrix math (perfect for Transformers).
- Cons: Harder to debug than GPUs. Tightly coupled with XLA compiler.
- Versions: TPUv4, TPUv5e (Efficiency focused).
AWS Inferentia / Trainium
- Custom Silicon: Built by AWS specifically for cost.
- Pros: Up to 40% cheaper than comparable GPU instances.
- Cons: Requires recompiling models using AWS Neuron SDK.
| Hardware | Best For | Cost Profile |
|---|---|---|
| NVIDIA A100/H100 | Training massive LLMs | Very High ($3-4/hr) |
| NVIDIA T4/L4 | Inference, Fine-tuning small models | Medium ($0.50/hr) |
| CPU (Intel/AMD) | Classical ML (XGBoost), Small DL models | Low ($0.05-0.10/hr) |
| AWS Inferentia | Specialized DL Inference | Very Low (High performance/$) |
| Google TPU | Massive Training/Inference | Varies (Great for TensorFlow/JAX) |
Rule of Thumb: Always try to run inference on CPU first. If it’s too slow, try a T4. Only use A100 for training.
Strategy 4: Kubernetes Cost Optimization
Most ML runs on Kubernetes (K8s). Here is how to configure it for cost.
1. Node Pools
Create separate pools for different workloads.
cpu-pool: For the API server, logging, monitoring. (Cheap instances).gpu-pool: For inference pods. (Expensive instances).spot-gpu-pool: For batch jobs. (Cheap, risky instances).
2. Taints and Tolerations
Prevent non-critical pods from stealing expensive GPU nodes.
Node Configuration:
# On the GPU node
taints:
- key: "accelerator"
value: "nvidia-tesla-t4"
effect: "NoSchedule"
Pod Configuration:
# In your Inference Deployment
tolerations:
- key: "accelerator"
operator: "Equal"
value: "nvidia-tesla-t4"
effect: "NoSchedule"
3. Resource Requests & Limits
If you don’t set these, one pod can eat the whole node.
- Requests: “I need at least this much.” K8s uses this for scheduling.
- Limits: “Kill me if I use more than this.” K8s uses this for throttling/OOMKill.
Best Practice: Set Requests = Limits for memory (to avoid OOM kills). Set Requests < Limits for CPU (to allow bursting).
Detailed Case Study: The “Expensive Classifier”
Scenario: You work at a startup. You have a sentiment analysis model (BERT-Base) that processes 1 million user reviews per day.
- Current Setup: 5 x
g4dn.xlarge(NVIDIA T4) instances running 24/7. - Cost: $0.526/hr * 24 hrs * 30 days * 5 instances = $1,893 / month.
The Junior Engineer’s Optimization Plan:
Step 1: Auto-scaling (HPA) Traffic isn’t constant. It peaks at 9 AM and drops at 2 AM.
- You implement Kubernetes HPA.
- Average instance count drops from 5 to 3.
- New Cost: $1,135 / month. (Saved $758)
Step 2: Spot Instances You switch the node pool to Spot instances.
- Spot price for
g4dn.xlargeis ~$0.15/hr (approx 70% discount). - New Cost: $0.15 * 24 * 30 * 3 = $324 / month. (Saved $811)
Step 3: Quantization & CPU Migration You quantize the model to INT8 using ONNX Runtime. It now runs fast enough on a CPU!
- You switch to
c6i.large(Compute Optimized CPU) instances. - Spot price for
c6i.largeis ~$0.03/hr. - Because CPU is slower than GPU, you need 6 instances instead of 3 to handle the load.
- New Cost: $0.03 * 24 * 30 * 6 = $129 / month. (Saved $195)
Total Savings: From $1,893 to $129 per month. That is a 93% reduction in cost. This is the power of system design.
Implementation: Cost-Aware Router
Let’s look at code for a “Cascade” router. This is a pattern where you try a cheap model first, and only call the expensive model if the cheap one is unsure.
import requests
class ModelCascade:
def __init__(self):
self.cheap_model_url = "http://cpu-service/predict"
self.expensive_model_url = "http://gpu-service/predict"
self.confidence_threshold = 0.85
def predict(self, input_text):
# Step 1: Call the Cheap Model (DistilBERT on CPU)
response = requests.post(self.cheap_model_url, json={"text": input_text})
result = response.json()
confidence = result['confidence']
prediction = result['label']
print(f"Cheap Model Confidence: {confidence}")
# Step 2: Check Confidence
if confidence >= self.confidence_threshold:
# Good enough! Return early.
return prediction
# Step 3: Fallback to Expensive Model (GPT-4 / Large BERT on GPU)
print("Confidence too low. Calling Expensive Model...")
response = requests.post(self.expensive_model_url, json={"text": input_text})
return response.json()['label']
# Usage
cascade = ModelCascade()
# "I love this product!" -> Cheap model is 99% sure. Returns. Cost: $0.0001
# "The nuance of the texture was..." -> Cheap model is 60% sure. Calls GPU. Cost: $0.01
Monitoring & Metrics: The FinOps Dashboard
You cannot optimize what you cannot measure. You need a dashboard.
Tools:
- Prometheus: Scrapes metrics from your pods.
- Grafana: Visualizes the metrics.
- Kubecost: A specialized tool that tells you exactly how much each namespace/deployment costs.
Key Metrics to Track:
- Cost per Inference: Total Cost / Total Requests. (Goal: Drive this down).
- GPU Utilization: If average utilization is < 30%, you are wasting money. Scale down or bin-pack more models.
- Spot Interruption Rate: How often are your nodes dying? If > 5%, your reliability might suffer.
Vendor Comparison: AWS vs GCP vs Azure
| Feature | AWS (SageMaker) | GCP (Vertex AI) | Azure (ML Studio) |
|---|---|---|---|
| Spot Instances | “Spot Instances” (Deep pools, reliable) | “Preemptible VMs” (Cheaper, but hard 24h limit) | “Spot VMs” (Variable eviction policy) |
| Inference Hardware | Inferentia (Custom cheap chips) | TPUs (Fastest for massive models) | Strong partnership with OpenAI/NVIDIA |
| Serverless | Lambda (Good support) | Cloud Run (Excellent container support) | Azure Functions |
| Pricing | Complex, many hidden fees | Per-second billing (very friendly) | Enterprise-focused, bundled deals |
Verdict:
- GCP is often the cheapest for pure compute and easiest to use (K8s native).
- AWS has the most mature ecosystem and hardware options (Inferentia).
- Azure is best if you are already a Microsoft shop.
Green AI: The Hidden Cost
Cost isn’t just money. It’s carbon. Training a single large Transformer model can emit as much CO2 as 5 cars in their lifetimes.
- Measure: Use tools like
CodeCarbonto estimate your emissions. - Optimize: Train in regions with green energy (e.g., Montreal, Oregon) where electricity comes from hydro/wind.
- Impact: Cost optimization usually leads to carbon optimization. Using fewer GPUs means burning less coal.
Future Trends
Where is this field going?
- Neuromorphic Computing: Chips that mimic the human brain (Spiking Neural Networks). They consume milliwatts instead of watts.
- Optical Computing: Using light (photons) instead of electricity (electrons) for matrix multiplication. Potentially 1000x faster and cheaper.
- Federated Learning: Training models on user devices (phones) instead of central servers. Shifts the cost from you to the user (and preserves privacy).
Checklist for Junior Engineers
Before you deploy, ask yourself:
- Do I really need a GPU? Have I benchmarked on a modern CPU?
- Is my model quantized? Can I use INT8?
- Am I using Spot instances? If not, why?
- Is auto-scaling enabled? Or am I paying for idle time?
- Are my logs optimized? Am I logging huge tensors to CloudWatch/Datadog? (This is a hidden cost killer!)
- Is the data in the same region? Check for cross-region transfer fees.
Appendix A: System Design Interview Transcript
Interviewer: “Design a cost-efficient training platform for a startup.”
Candidate: “Okay, let’s start with requirements. How many users? What kind of models?”
Interviewer: “50 data scientists. Training BERT and ResNet models. Budget is tight.”
Candidate: “Understood. I propose a Kubernetes-based architecture on AWS.
- Compute: We will use a mixed cluster.
- Head Node: On-Demand
m5.largefor the K8s control plane. - Notebooks: Spot
t3.mediuminstances. If they die, we lose the kernel but data is on EFS. - Training: Spot
g4dn.xlargeinstances. We will useVolcanoscheduler for batch scheduling.
- Head Node: On-Demand
- Storage:
- Data: S3 Standard-IA (Infrequent Access) to save money.
- Checkpoints: S3 Intelligent-Tiering.
- Scratch Space: Amazon FSx for Lustre (expensive but needed for speed) or just local NVMe on the instances.
- Networking:
- Keep everything in
us-east-1to avoid data transfer fees. - Use VPC Endpoints for S3 to avoid NAT Gateway charges.”
- Keep everything in
Interviewer: “How do you handle Spot interruptions during training?”
Candidate: “We will use TorchElastic or Ray Train. These frameworks support fault tolerance. When a Spot node is reclaimed, the job pauses. The K8s autoscaler requests a new Spot node. Once it joins, the job resumes from the last checkpoint stored in S3.”
Interviewer: “What if Spot capacity is unavailable for hours?”
Candidate: “We can implement a ‘Fallback to On-Demand’ policy. If a job is pending for > 1 hour, we spin up an On-Demand instance. It costs more, but it unblocks the team.”
Appendix B: FAQ
Q: Is Serverless always cheaper? A: No. If you have constant high traffic (e.g., 100 requests/sec 24/7), a dedicated instance is cheaper. Serverless is cheaper for “spiky” or low-volume traffic.
Q: Does quantization hurt accuracy? A: Usually < 1% drop for INT8. If you go to INT4, the drop is significant unless you use advanced techniques like QLoRA.
Q: Why is data transfer so expensive? A: Cloud providers charge a premium for “Egress” (data leaving their network). It’s a lock-in mechanism.
Q: What is the best region for cost?
A: us-east-1 (N. Virginia), us-west-2 (Oregon), and eu-west-1 (Ireland) are usually the cheapest and have the most capacity.
Conclusion
Cost optimization is not about being “cheap.” It’s about being efficient. It’s about maximizing the business value extracted from every compute cycle.
By mastering these techniques—Spot instances, quantization, architectural patterns like Cascading—you become a force multiplier for your team. You allow your company to run more experiments, serve more users, and build better products with the same budget.
Key Takeaways:
- Spot Instances are your best friend for batch workloads.
- Quantization (INT8) is the easiest way to slash inference costs.
- Right-sizing hardware (CPU vs GPU) is critical.
- FinOps is an engineering discipline, not just accounting.
Originally published at: arunbaby.com/ml-system-design/0022-cost-optimization-for-ml
If you found this helpful, consider sharing it with others who might benefit.