GPU Scheduling in Kubernetes: Lessons from Running ML Workloads

Context

We run a mix of training and inference workloads across a cluster with heterogeneous GPU types (A100s for training, T4s for inference). Kubernetes doesn’t make this easy out of the box.

The Setup

We use the NVIDIA device plugin with custom scheduling, node affinity rules, and priority classes to manage GPU allocation.

# Node labels for GPU type routing
apiVersion: v1
kind: Node
metadata:
  labels:
    gpu-type: a100
    gpu-memory: "80Gi"
    workload-class: training

Three Things That Bit Us

1. GPU memory fragmentation. A single pod requesting 1 GPU on an 8-GPU node can block scheduling for a pod that needs all 8. We solved this with bin-packing scheduling and dedicated training nodes.

2. Preemption cascades. High-priority inference pods evicting training jobs caused checkpoint corruption. We added graceful termination handlers and checkpoint-on-eviction logic.

3. Node scaling lag. GPU nodes take 5-7 minutes to become ready. We maintain a warm pool of 2 standby nodes to absorb burst traffic.

What Worked

Separate node pools for training vs. inference
Custom metrics-based HPA for inference (tokens/second, not CPU)
Automated spot instance fallback for non-critical training jobs

Running ML on Kubernetes is doable, but you’ll fight the scheduler more than you’d like.

Context

The Setup

Three Things That Bit Us

What Worked

Let's build something together