Resource Partitioning in ML Clusters
“How to share a supercomputer without stepping on each other’s toes.”
1. The Problem: Multi-Tenant ML Clusters
Training Large Language Models (LLMs) requires massive compute clusters (e.g., 10k H100s).
- Users: Research, Product, Data Science.
- Workloads:
- Training: Long-running (weeks), gang-scheduled, fault-intolerant.
- Inference: Latency-sensitive, high availability, bursty.
- Dev/Notebooks: Interactive, low utilization.
Goal: Maximize utilization while ensuring fairness and isolation.
2. Partitioning Strategies
1. Static Partitioning
- Concept: Dedicate physical nodes to specific teams.
- Team A gets Rack 1-10.
- Team B gets Rack 11-20.
- Pros: Perfect isolation. No “noisy neighbors”.
- Cons: Low utilization. If Team A is sleeping, their GPUs sit idle.
2. Dynamic Partitioning (Cluster Autoscaling)
- Concept: A shared pool of resources. Jobs request what they need.
- Scheduler: Decides placement based on availability and priority.
- Pros: High utilization.
- Cons: Complex scheduling, potential for interference.
3. Kubernetes Resource Management
Kubernetes (K8s) is the de-facto OS for ML clusters.
Requests vs. Limits
- Request: “I need at least 4 CPUs.” (Used for scheduling).
- Limit: “Kill me if I use more than 8 CPUs.” (Used for isolation/throttling).
- Best Practice: For Training, set
Request == Limit(Guaranteed QoS).
Namespaces & Quotas
- Namespace: Logical partition (e.g.,
team-nlp,team-vision). - ResourceQuota: Hard limit on aggregate resource usage per namespace.
apiVersion: v1 kind: ResourceQuota metadata: name: gpu-quota spec: hard: requests.nvidia.com/gpu: "100"
4. GPU Partitioning
GPUs are expensive. We need to slice them.
1. Multi-Instance GPU (MIG)
- Hardware Support: NVIDIA A100/H100.
- Mechanism: Physically partitions the GPU into up to 7 isolated instances (Compute + Memory).
- Use Case: Inference, small model training.
- Isolation: Strong (Hardware-level).
2. Time-Slicing (MPS)
- Mechanism: Multiple processes share the GPU context. CUDA kernels are interleaved.
- Use Case: Dev notebooks.
- Isolation: Weak. One process can crash the GPU or hog memory bandwidth.
3. Virtualization (vGPU)
- Mechanism: Hypervisor manages GPU access.
- Use Case: Cloud providers (AWS EC2 instances).
5. Scheduling Algorithms
How do we decide who goes next?
1. FIFO (First-In, First-Out)
- Simple.
- Problem: Head-of-line blocking. A small job waits behind a massive 1-week training job.
2. Dominant Resource Fairness (DRF)
- Generalization of Max-Min Fairness for multi-dimensional resources (CPU, RAM, GPU).
- Idea: Equalize the “dominant share” of resources across users.
- If User A uses lots of CPU and User B uses lots of RAM, DRF balances their dominant usage.
3. Gang Scheduling (All-or-Nothing)
- Critical for Distributed Training.
- If a job needs 100 GPUs, it must get all 100 at once.
- If it gets 99, it waits (holding the 99 hostage).
- Coscheduling: K8s plugins (Volcano, Kueue) ensure gang scheduling.
6. Deep Dive: Ray for ML Orchestration
Ray sits on top of K8s and provides Python-native resource management.
- Actors: Stateful workers (e.g., a Parameter Server).
- Tasks: Stateless functions.
- Placement Groups:
# Reserve a bundle of resources atomically pg = placement_group([{"CPU": 1, "GPU": 1}] * 10, strategy="STRICT_PACK") ray.get(pg.ready()) - Strategy:
STRICT_PACK: Put everything on the same node (low latency).SPREAD: Distribute across nodes (fault tolerance).
7. Real-World Case Studies
Case Study 1: OpenAI’s Supercomputer
- Scale: Thousands of GPUs.
- Challenge: Training GPT-4.
- Solution: Kubernetes + Azure. Custom scheduler to handle massive gang scheduling and topology-aware placement (minimizing InfiniBand hops).
Case Study 2: Uber Michelangelo
- Workload: Mixed (Training + Inference).
- Solution:
- Training: Mesos (later K8s) with gang scheduling.
- Inference: Dedicated serving cluster with HPA (Horizontal Pod Autoscaler).
- Quota: “Elastic Quota” allows teams to burst into idle capacity but get preempted if the owner returns.
8. Summary
| Level | Technology | Strategy |
|---|---|---|
| Cluster | Kubernetes | Namespaces, Quotas |
| Node | Kubelet | Requests/Limits |
| Device | NVIDIA MIG | Hardware Partitioning |
| Workload | Ray / Volcano | Gang Scheduling |
9. Deep Dive: Kubernetes Scheduler Internals
The K8s scheduler decides which node a pod goes to. It runs in a loop:
- Filtering (Predicates): Remove nodes that don’t fit.
PodFitsResources: Does the node have enough CPU/GPU?PodFitsHostPorts: Is the port available?TaintToleration: Does the pod tolerate the node’s taints (e.g.,gpu-node=true:NoSchedule)?
- Scoring (Priorities): Rank remaining nodes.
LeastRequestedPriority: Spread pods to balance load.MostRequestedPriority: Pack pods to fill nodes (bin packing, good for autoscaling).ImageLocalityPriority: Prefer nodes that already have the Docker image.
- Binding: Assign the pod to the highest-ranked node.
Custom Scheduling for ML: Standard K8s scheduler is bad for ML because it schedules one pod at a time. It doesn’t understand “I need 100 GPUs or nothing”.
10. Deep Dive: Gang Scheduling with Volcano
Volcano is a batch system built on K8s.
Architecture:
- Volcano Scheduler: Replaces default K8s scheduler.
- PodGroup: A CRD that groups pods together.
apiVersion: scheduling.volcano.sh/v1beta1 kind: PodGroup metadata: name: tensorflow-job spec: minMember: 10 # Gang size
Logic:
- Wait until
minMemberpods are pending. - Check if cluster has resources for all of them.
- If yes, bind all.
- If no, wait (don’t partial schedule). This prevents deadlocks where Job A holds 50% GPUs and Job B holds 50%, and both wait for 100%.
11. Deep Dive: GPU Virtualization (vGPU vs MIG)
NVIDIA vGPU (Virtual GPU):
- Software-based. Hypervisor time-slices the GPU.
- Memory: Shared (can oversubscribe).
- Performance: Overhead from context switching.
- Use Case: VDI (Virtual Desktop), Cloud Gaming.
NVIDIA MIG (Multi-Instance GPU):
- Hardware-based. A100/H100 splits into partitions.
- Memory: Dedicated (cannot oversubscribe).
- Performance: Near-native. No context switching overhead between instances.
- Configuration:
1g.5gb: 1 Compute Slice, 5GB RAM.3g.20gb: 3 Compute Slices, 20GB RAM.
- Use Case: Inference, lightweight training.
12. Deep Dive: Ray Autoscaler Logic
Ray’s autoscaler is smarter than K8s HPA (Horizontal Pod Autoscaler).
Logic:
- Resource Demand: Ray scheduler sees a task needing
{"GPU": 1}but no node has it. - Pending State: The task goes into
PENDING. - Upscaling: Autoscaler calculates: “I need a node of type
g4dn.xlargeto satisfy this.” - Launch: Calls cloud provider API (AWS EC2) to launch node.
- Bin Packing: Ray tries to pack tasks onto existing nodes before launching new ones to save money.
Downscaling:
- If a node is idle for
idle_timeout_minutes(default 5), terminate it.
13. Deep Dive: Spot Instance Management
Training on Spot Instances (preemptible) saves 70% cost but adds risk.
Strategy:
- Checkpointing: Save model weights to S3 every 10 minutes.
- Elastic Training (TorchElastic):
- If a node dies, the job pauses.
- Remaining nodes re-rendezvous.
- Training continues with fewer nodes (or waits for replacement).
- Mixed Cluster: Use On-Demand for the “Head Node” (Parameter Server / Scheduler) and Spot for “Worker Nodes”.
14. Case Study: Meta’s Training Cluster (RSC)
Research SuperCluster (RSC):
- Hardware: 16,000 A100 GPUs.
- Network: InfiniBand (dedicated high-speed network).
- Storage: Pure Storage FlashBlade (high throughput).
Partitioning:
- Physical Isolation: Experiments are physically separated to avoid “noisy neighbor” network congestion.
- Topology-Aware Scheduling: The scheduler knows the physical wiring. It places communicating pods on the same switch to minimize latency.
15. System Design: Multi-Tenant Inference Platform
Scenario: Serve 1000 different models (LLMs, ResNets, BERTs) for 50 teams.
Design:
- Ingress: Nginx/Istio routes request to correct service.
- Model Loading:
- Hot: Top 50 models loaded in GPU memory.
- Warm: Next 200 models in CPU RAM.
- Cold: Rest in S3.
- Partitioning:
- Tier 1 (High Priority): Dedicated GPUs. No sharing.
- Tier 2 (Batch): MIG-partitioned GPUs.
- Tier 3 (Dev): Time-sliced GPUs (MPS).
- Isolation: K8s Namespaces + NetworkPolicies prevent Team A from calling Team B’s model.
- Isolation: K8s Namespaces + NetworkPolicies prevent Team A from calling Team B’s model.
16. Deep Dive: Preemption and Priority Classes
Problem: High-priority job arrives, but cluster is full.
Solution: Preempt (kill) low-priority jobs.
Kubernetes PriorityClass:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000
globalDefault: false
description: "For production inference"
---
apiVersion: v1
kind: Pod
metadata:
name: inference-pod
spec:
priorityClassName: high-priority
containers:
- name: model
image: bert-serving
Preemption Logic:
- Scheduler sees high-priority pod can’t fit.
- Finds low-priority pods to evict.
- Sends
SIGTERMto those pods. - Waits for graceful shutdown (default 30s).
- Schedules high-priority pod.
Best Practice:
- Production Inference: Priority 1000.
- Training: Priority 500.
- Dev/Notebooks: Priority 100.
17. Deep Dive: Resource Fragmentation Problem
Scenario: Cluster has 100 GPUs total, but they’re spread across 100 nodes (1 GPU each). A job needs 8 GPUs on the same node (for NVLink). It can’t run.
This is fragmentation.
Solutions:
1. Bin Packing (MostRequestedPriority):
- Pack pods tightly to leave some nodes completely empty.
- Empty nodes can be terminated (autoscaling) or reserved for large jobs.
2. Defragmentation (Descheduler):
- Periodically evict pods from underutilized nodes.
- Re-schedule them to consolidate resources.
3. Topology-Aware Scheduling:
- Prefer nodes with multiple GPUs when scheduling multi-GPU jobs.
Code (Custom Scheduler Plugin):
func (p *GPUTopologyPlugin) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
nodeInfo, _ := p.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
// Count available GPUs on this node
availableGPUs := nodeInfo.Allocatable.ScalarResources["nvidia.com/gpu"]
// Prefer nodes with more GPUs
return availableGPUs * 10, nil
}
18. Advanced: NUMA-Aware Scheduling
NUMA (Non-Uniform Memory Access): Modern servers have multiple CPU sockets. Memory attached to Socket 0 is faster for cores on Socket 0.
Problem: If a pod’s CPUs are on Socket 0 but its memory is on Socket 1, performance degrades (cross-socket memory access).
Solution: Topology Manager (K8s 1.18+):
apiVersion: v1
kind: Pod
spec:
containers:
- name: training
resources:
requests:
cpu: "16"
memory: "64Gi"
# Topology Manager ensures CPUs and memory are on the same NUMA node
Policies:
none: No alignment (default).best-effort: Try to align, but don’t fail if impossible.restricted: Only allow if alignment is possible.single-numa-node: All resources must be on a single NUMA node.
19. Advanced: Network Topology-Aware Placement
Problem: Distributed training has massive inter-GPU communication (All-Reduce).
Network Hierarchy:
GPU 0 ←→ GPU 1 (NVLink: 600 GB/s)
↓ ↓
Node 0 ←→ Node 1 (InfiniBand: 200 GB/s)
↓ ↓
Rack 0 ←→ Rack 1 (Ethernet: 100 GB/s)
Optimization: Place all 8 GPUs of a job on the same node (NVLink) > same rack (IB) > different racks (Ethernet).
Implementation (Volcano Topology Plugin):
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
name: distributed-training
spec:
minMember: 8
queue: default
# Topology constraint
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
job: distributed-training
topologyKey: "topology.kubernetes.io/zone" # Same rack
20. Case Study: Google Borg (Predecessor to Kubernetes)
Borg is Google’s internal cluster manager (2003-present).
Key Innovations:
- Allocs (Allocations): Reserve resources for future use.
- “I’ll need 100 GPUs tomorrow at 9am.”
- Quota Reclamation: If Team A’s quota is unused, Team B can borrow it (but gets preempted when A returns).
- Borgmaster: Centralized scheduler (handles 10k+ machines).
- Borglet: Agent on each machine (like Kubelet).
Lessons for K8s:
- Declarative API: “I want 10 replicas” (not “start pod 1, start pod 2…”).
- Labels/Selectors: Flexible grouping.
- Reconciliation Loop: Continuously drive actual state toward desired state.
21. Production Monitoring and Debugging
Key Metrics:
- Cluster Utilization:
GPU_Utilization = (Allocated_GPUs / Total_GPUs) * 100- Target: > 80%.
- Alert: If < 60% for > 1 hour, investigate.
- Pending Pods:
- Pods stuck in
Pendingstate indicate scheduling failures. - Reasons: Insufficient resources, taints, affinity rules.
- Pods stuck in
- Preemption Rate:
- How often are low-priority jobs killed?
- High rate: Users frustrated. Consider adding capacity.
Debugging Tools:
# Why is my pod pending?
kubectl describe pod my-pod | grep -A 10 Events
# Common reasons:
# - "Insufficient nvidia.com/gpu"
# - "Node had taints that the pod didn't tolerate"
# - "PodGroup is not ready" (gang scheduling)
# Check node resources
kubectl describe node gpu-node-01 | grep -A 5 Allocated
# Check quota
kubectl describe resourcequota -n team-nlp
22. Common Pitfalls and How to Avoid Them
Pitfall 1: Setting Limits Too High
limits.memory: 1TBon a 512GB node.- Result: Pod gets scheduled, then OOMKilled.
- Fix: Set
limitsclose torequests.
Pitfall 2: Forgetting Gang Scheduling
- Distributed training job requests 100 GPUs.
- K8s schedules 99, waits for 1.
- Result: Deadlock (99 GPUs wasted).
- Fix: Use Volcano/Kueue.
Pitfall 3: Ignoring Topology
- 8-GPU job spread across 8 nodes.
- Result: 10x slower (network bottleneck).
- Fix: Use affinity rules or topology-aware scheduler.
Pitfall 4: No Resource Quotas
- One team launches 1000 pods, starves everyone.
- Fix: Enforce
ResourceQuotaper namespace.
Pitfall 5: Not Monitoring Fragmentation
- Cluster is “full” but no single node has 8 GPUs.
- Fix: Run descheduler periodically.
23. Advanced: Elastic Training with TorchElastic
Problem: Spot instances can be reclaimed mid-training.
TorchElastic (PyTorch 1.9+):
- Rendezvous: Workers discover each other dynamically.
- Fault Tolerance: If a worker dies, remaining workers re-form the group.
- Elasticity: Can add/remove workers mid-training.
Code:
import torch.distributed as dist
from torch.distributed.elastic.multiprocessing.errors import record
@record
def train():
dist.init_process_group(backend="nccl")
# Training loop
for epoch in range(100):
for batch in dataloader:
# If a worker dies here, TorchElastic handles it
loss = model(batch)
loss.backward()
optimizer.step()
if __name__ == "__main__":
# Launch with torchrun (replaces torch.distributed.launch)
# torchrun --nproc_per_node=8 --nnodes=10 train.py
train()
Benefit: Training survives spot interruptions without manual intervention.
24. Further Reading
- “Large-scale cluster management at Google with Borg” (Verma et al., 2015): The Borg paper.
- “Kubernetes Scheduling” (K8s Docs): Official scheduler documentation.
- “Volcano: A Cloud Native Batch System” (Volcano Team): Gang scheduling for ML.
- “Ray: A Distributed Framework for Emerging AI Applications” (Moritz et al., 2018): Ray architecture.
- “NVIDIA Multi-Instance GPU User Guide”: MIG configuration and best practices.
- NVIDIA Multi-Instance GPU User Guide: MIG configuration and best practices.
24. Interview Questions for Resource Partitioning
Q1: How would you design a scheduler for a multi-tenant GPU cluster? Answer: Use Kubernetes with custom scheduler plugins. Implement:
- Gang scheduling (Volcano) for distributed training
- Priority classes for production vs. dev workloads
- Resource quotas per team/namespace
- Topology-aware placement to minimize network latency
- Preemption for high-priority jobs
Q2: What’s the difference between MIG and time-slicing? Answer:
- MIG: Hardware partitioning. Strong isolation, dedicated memory, no context switching overhead. Only on A100/H100.
- Time-Slicing (MPS): Software multiplexing. Weak isolation, shared memory, context switching overhead. Works on any GPU.
Q3: How do you handle a job that requests 100 GPUs but only 99 are available? Answer: This is the gang scheduling problem. Solutions:
- Use Volcano/Kueue to ensure all-or-nothing scheduling
- Implement backfilling: If a small job can run without blocking the large job, schedule it
- Preemption: Kill low-priority jobs to free resources
Q4: How would you optimize cost for training on spot instances? Answer:
- Checkpointing every 10 minutes to S3
- TorchElastic for fault tolerance
- Mixed cluster: On-demand for head node, spot for workers
- Diversification: Use multiple instance types to reduce interruption probability
Q5: What metrics would you monitor for cluster health? Answer:
- Utilization: GPU/CPU/Memory usage (target >80%)
- Pending pods: Indicates scheduling bottlenecks
- Preemption rate: High rate = users frustrated
- Job completion time: Detect performance degradation
- Network bandwidth: Detect congestion
25. Cost Optimization Strategies
1. Right-Sizing:
- Don’t request 8 GPUs if you only use 4.
- Tool: Profile with
nvidia-smito measure actual usage.
2. Spot Instance Strategies:
# Diversify across instance types
instance_types = ['p3.8xlarge', 'p3.16xlarge', 'p4d.24xlarge']
# Bid strategy: Max price = On-Demand price
for instance_type in instance_types:
launch_spot_instance(
instance_type=instance_type,
max_price=get_on_demand_price(instance_type)
)
3. Autoscaling Policies:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: gpu-autoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: inference-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: nvidia.com/gpu
target:
type: Utilization
averageUtilization: 70
4. Idle Resource Reclamation:
- Descheduler: Evict pods from underutilized nodes.
- Cluster Autoscaler: Terminate empty nodes after 10 minutes.
5. Batch Job Scheduling:
- Run low-priority batch jobs during off-peak hours (nights, weekends).
- Savings: 50-70% by using cheaper spot instances.
Cost Breakdown Example:
Training GPT-3 (175B params):
- Hardware: 10,000 V100s for 1 month
- On-Demand: $3/hr/GPU × 10,000 × 720 hrs = $21.6M
- Spot (70% discount): $6.5M
- Savings: $15.1M
26. Ethical Considerations
1. Energy Consumption:
- Training GPT-3 consumed ~1,287 MWh (equivalent to 120 US homes for a year).
- Mitigation:
- Use renewable energy data centers (Google, AWS have carbon-neutral regions)
- Efficient architectures (MoE, sparse models)
- Model distillation (train large, deploy small)
2. Access Inequality:
- Only large organizations can afford 10k GPU clusters.
- Impact: Concentrates AI research in Big Tech.
- Solutions:
- Public compute grants (NSF, EU HPC)
- Open-source pre-trained models (Hugging Face)
- Federated learning (train on distributed data)
3. Resource Hoarding:
- One team reserves 1000 GPUs “just in case”.
- Fix: Enforce quotas, implement “use it or lose it” policies.
4. Bias in Allocation:
- Prioritizing certain teams/projects over others.
- Transparency: Publish allocation policies, audit logs.
27. Advanced: Multi-Cloud Resource Partitioning
Scenario: Burst to AWS when on-prem cluster is full.
Architecture:
┌──────────────┐
│ On-Prem K8s │ (Primary)
└──────┬───────┘
│
▼
┌──────────────┐
│ Kubefed │ (Federation)
└──────┬───────┘
│
┌───┴────┐
▼ ▼
┌─────┐ ┌─────┐
│ AWS │ │ GCP │ (Burst)
└─────┘ └─────┘
Challenges:
- Data Transfer: Moving training data to cloud is slow.
- Solution: Replicate data to S3/GCS in advance.
- Network Latency: Cross-cloud communication is slow.
- Solution: Use cloud for independent jobs, not distributed training.
- Cost: Egress fees can be expensive.
- Solution: Minimize data movement, use cloud-native storage.
- Solution: Minimize data movement, use cloud-native storage.
28. Production Deployment Checklist
Before launching a multi-tenant ML cluster:
Infrastructure:
- Set up Kubernetes cluster with GPU support
- Install NVIDIA device plugin
- Configure MIG partitions (if using A100/H100)
- Set up monitoring (Prometheus + Grafana)
- Configure logging (ELK stack or CloudWatch)
Resource Management:
- Define ResourceQuotas for each team/namespace
- Create PriorityClasses (production, training, dev)
- Install gang scheduler (Volcano or Kueue)
- Configure autoscaling policies
- Set up descheduler for defragmentation
Security:
- Enable RBAC (Role-Based Access Control)
- Configure NetworkPolicies for isolation
- Set up Pod Security Policies
- Enable audit logging
- Implement secrets management (Vault or AWS Secrets Manager)
Cost Optimization:
- Enable cluster autoscaler
- Configure spot instance policies
- Set up cost monitoring (Kubecost)
- Implement idle resource reclamation
- Define off-peak batch job schedules
Disaster Recovery:
- Set up automated backups (etcd, persistent volumes)
- Test failover procedures
- Document runbooks for common issues
- Configure alerts for critical failures
- Implement multi-region redundancy (if needed)
29. Further Reading
- “Large-scale cluster management at Google with Borg” (Verma et al., 2015): The Borg paper.
- “Kubernetes Scheduling” (K8s Docs): Official scheduler documentation.
- “Volcano: A Cloud Native Batch System” (Volcano Team): Gang scheduling for ML.
- “Ray: A Distributed Framework for Emerging AI Applications” (Moritz et al., 2018): Ray architecture.
- “NVIDIA Multi-Instance GPU User Guide”: MIG configuration and best practices.
30. Conclusion
Resource partitioning in ML clusters is a balancing act between utilization (pack jobs tightly), fairness (everyone gets their share), and performance (avoid interference). The shift from static partitioning to dynamic, topology-aware scheduling has enabled organizations to train models 10x larger on the same hardware budget. As models grow (GPT-5 will likely need 100k+ GPUs), the challenges of gang scheduling, fault tolerance, and network optimization will only intensify. The future lies in intelligent schedulers that understand model characteristics (memory footprint, communication patterns) and elastic training that adapts to resource availability in real-time.
31. Summary
| Level | Technology | Strategy |
|---|---|---|
| Cluster | Kubernetes | Namespaces, Quotas |
| Node | Kubelet | Requests/Limits |
| Device | NVIDIA MIG | Hardware Partitioning |
| Workload | Ray / Volcano | Gang Scheduling |
| Cost | Spot Instances | Checkpointing, Elasticity |
| Performance | Topology-Aware | NUMA, Network Placement |
Originally published at: arunbaby.com/ml-system-design/0036-resource-partitioning