GPU Scheduling in Kubernetes: Lessons from Running ML Workloads
Practical lessons from running GPU-heavy ML training and inference workloads on Kubernetes in production.
Context
We run a mix of training and inference workloads across a cluster with heterogeneous GPU types (A100s for training, T4s for inference). Kubernetes doesn’t make this easy out of the box.
The Setup
We use the NVIDIA device plugin with custom scheduling, node affinity rules, and priority classes to manage GPU allocation.
# Node labels for GPU type routing
apiVersion: v1
kind: Node
metadata:
labels:
gpu-type: a100
gpu-memory: "80Gi"
workload-class: training
Three Things That Bit Us
1. GPU memory fragmentation. A single pod requesting 1 GPU on an 8-GPU node can block scheduling for a pod that needs all 8. We solved this with bin-packing scheduling and dedicated training nodes.
2. Preemption cascades. High-priority inference pods evicting training jobs caused checkpoint corruption. We added graceful termination handlers and checkpoint-on-eviction logic.
3. Node scaling lag. GPU nodes take 5-7 minutes to become ready. We maintain a warm pool of 2 standby nodes to absorb burst traffic.
What Worked
- Separate node pools for training vs. inference
- Custom metrics-based HPA for inference (tokens/second, not CPU)
- Automated spot instance fallback for non-critical training jobs
Running ML on Kubernetes is doable, but you’ll fight the scheduler more than you’d like.