ML Capacity Planning and Infrastructure Scaling
“Capacity Planning is the art of predicting the future while paying for the present. In ML, it is the difference between a high-growth product and a bankrupt one.”
TL;DR
ML capacity planning balances three pillars: availability, performance (p99 latency), and cost efficiency (targeting over 70% GPU utilization). The core calculation divides peak RPS by per-instance throughput, then adds overhead, but the real insight is that throughput optimization is profit optimization. A model that runs twice as fast cuts your infrastructure bill in half. Predictive auto-scaling, hybrid cloud strategies, and robust failure handling (rate limiting, circuit breakers) separate production systems from expensive guesswork. For more on scaling inference engines, see the ML inference optimization guide and the advanced caching strategies that reduce the load on your GPU fleet.

1. Introduction: The Billion Dollar Guessing Game
2. Higher-Level Goals: The Three Pillars
- Availability: Ensure the system stays up during peak traffic.
- Performance: Maintain the latency budget across the 99th percentile (p99).
- Cost Efficiency: Maximize the utilization of expensive GPU resources (targeting > 70%).
3. High-Level Architecture: The Capacity Lifecycle
3.1 Load Projection (The Input)
- Growth Modeling: Using historical DAU (Daily Active Users) to predict traffic.
- Seasonality: Factoring in time-of-day and weekly cycles.
3.2 Benchmarking (The Profiler)
- Baseline Throughput: How many requests per second (RPS) can one instance handle at the target latency?
- Resource Saturation: Identifying bottlenecks at specific RPS levels.
3.3 Provisioning (The Output)
- Buffer/Headroom: Leaving ~30% capacity for sudden spikes.
- Auto-Scaling: Dynamically adding/removing nodes based on real-time metrics.
4. Operational Math: Calculating “The Number”
Consider a hypothetical Speech ASR system:
4.1 Variables
- Peak Traffic: 10,000 requests per second (RPS).
- Single Instance Performance: 5 RPS per GPU (with a 200ms p99 latency).
- Overhead: 20% system overhead.
4.2 The Calculation
Required_Nodes = (Peak_RPS / Instance_RPS) * (1 + Overhead)
Required_Nodes = (10,000 / 5) * 1.2 = 2,400 GPUs.
At 2 per hour per GPU, this system costs **4,800 per hour** to operate. If efficiency drops to 4 RPS, costs jump by \1,200 per hour. This is why Latency Optimization is actually a Profit Margin Optimization.
5. Scaling Strategy: Distributed vs. Vertical
- Vertical Scaling: Upgrading to more powerful GPUs. Best for models with massive memory requirements.
- Horizontal Scaling: Adding more nodes to a cluster. Best for handling massive request volume.
- The Hybrid Approach: Use high-end GPUs for training and cost-effective chips for serving.
6. Real-time Implementation: Auto-scaling and Load Balancing
- Least-Loaded Balancing: Directing requests to workers with the most free resources.
- Predictive Auto-scaling: Scaling up servers in anticipation of known traffic patterns.
7. Comparative Analysis: Cloud vs. On-Premise
| Metric | Cloud (AWS/GCP) | On-Premise (Owned Hardware) |
|---|---|---|
| CapEx | 0 | Extremely High |
| Agility | High | Low |
| Visibility | Low | High |
| Break-even | Good for startups | Good for established volume |
8. Failure Modes in Infrastructure Scaling
- The “Thundering Herd”: Sudden traffic surges that crash new nodes before they warm up.
- Mitigation: Use Rate Limiting and Circuit Breakers.
- Resource Fragmentation: Inefficient request distribution preventing effective batching.
- Cloud Stockouts: Regional unavailability of specific GPU types.
9. Real-World Case Study: Capacity Limits
Major AI companies have had to pause new signups due to physical constraints in data centers and lead times on hardware. growth is limited by the Maximum Area of your hardware histogram. If you don’t pre-book capacity, growth hits a hard ceiling.
10. Key Takeaways
- Throughput is a Profit Metric: Making a model twice as fast is equivalent to cutting your infrastructure bill in half.
- Proactive is Cheaper than Reactive: Predictive scaling prevents customer churn.
- The Histogram Connection: Capacity is about finding the “Largest Stable Area” of operation without crossing into saturation.
- Reliability is Scaling: An unreliable agent is often just a slow agent running on saturated hardware.
FAQ
How do you calculate GPU requirements for ML inference?
Divide peak requests per second by single-instance throughput, then multiply by an overhead factor (typically 1.2x). For example, 10,000 RPS at 5 RPS per GPU with 20% overhead requires 2,400 GPUs. At $2 per hour per GPU, this directly translates to operational cost, making throughput optimization a profit margin concern.
What is the difference between vertical and horizontal scaling for ML systems?
Vertical scaling upgrades to more powerful GPUs, which is best for models with massive memory requirements that cannot be easily sharded. Horizontal scaling adds more nodes to a cluster, which is best for handling high request volume with models that fit on a single device. Most production systems use a hybrid approach: high-end GPUs for training and cost-effective chips for serving.
How do you handle sudden traffic spikes in ML infrastructure?
Use rate limiting and circuit breakers to prevent thundering herd problems where sudden surges crash new nodes before they warm up. Implement predictive auto-scaling based on known traffic patterns (time-of-day, weekly cycles) and maintain approximately 30% capacity headroom for unexpected surges.
When should you choose cloud versus on-premise ML infrastructure?
Cloud provides zero CapEx and high agility, making it ideal for startups and variable workloads. On-premise offers high visibility and better break-even economics for established companies with predictable, high-volume workloads. The decision often comes down to whether your GPU utilization is consistent enough to justify the upfront capital investment.
Originally published at: arunbaby.com/ml-system-design/0057-capacity-planning
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch