Low Level Design: Kubernetes Scheduler Design

⏱ 3 min read

The Kubernetes scheduler is responsible for assigning Pods to Nodes, considering resource requirements, affinity rules, taints/tolerations, and cluster capacity. Understanding the scheduler internals is valuable for system design interviews at companies running large Kubernetes clusters (Airbnb, Shopify, Lyft) and for understanding how distributed resource management systems work at scale. The scheduler is a masterpiece of constraint satisfaction under production load.

Scheduler Architecture: Watch-Bind Loop

The scheduler operates as a reconciliation loop: watch for unscheduled Pods (Pods with no nodeName set) via the Kubernetes API server watch stream. For each unscheduled Pod, run the scheduling algorithm (filter + score nodes), select the best node, and write a Binding object to the API server (which sets pod.spec.nodeName). The scheduler is stateless — all state is stored in etcd via the API server. Multiple scheduler replicas can run with leader election: only the leader actively schedules; followers are warm standbys ready to take over within seconds.

// Scheduler high-level loop (simplified)
func (s *Scheduler) Run(ctx context.Context) {
    for {
        pod := s.nextUnscheduledPod()   // blocks until a pod needs scheduling

        // Phase 1: Filter — find feasible nodes
        feasible := s.filterNodes(pod, s.cache.Snapshot())

        // Phase 2: Score — rank feasible nodes
        scored := s.scoreNodes(pod, feasible)

        // Phase 3: Select — pick the highest scorer
        selectedNode := scored[0].Node

        // Phase 4: Bind — assign pod to node
        s.bind(pod, selectedNode)
    }
}

Filter Plugins: Feasibility Constraints

Filter plugins eliminate nodes that cannot run the pod. Standard filters: NodeResourcesFit: node must have enough allocatable CPU and memory (requested resources must not exceed node capacity minus already-allocated resources). NodeAffinity: match nodeSelector and affinity rules (e.g., pod must run on nodes labeled region=us-east). TaintToleration: pod must tolerate all node taints. PodAffinity/AntiAffinity: pod must (or must not) co-locate with other pods matching a selector. VolumeBinding: required PersistentVolumes must be available on the node or its zone.

Score Plugins: Ranking Feasible Nodes

Score plugins rank feasible nodes on a 0-100 scale; the highest scorer is selected. Key scoring plugins: LeastAllocated: prefer nodes with the most available resources — spreads load across the cluster. BalancedAllocation: prefer nodes where CPU and memory usage are balanced (avoid nodes fully allocated on CPU but empty on memory). ImageLocality: prefer nodes that already have the pod container image cached (avoids pull latency). InterPodAffinity: score nodes that satisfy soft affinity preferences. Plugin scores are weighted and summed — operators can tune weights.

Scheduler Extendability: Custom Schedulers

The scheduler framework allows custom plugins at each extension point: PreFilter, Filter, PostFilter, PreScore, Score, Reserve, Permit, PreBind, Bind, PostBind. For ML training workloads: a custom gang scheduling plugin (Permit phase) holds pods until all pods of a training job can be scheduled simultaneously — prevents partial scheduling that wastes GPU resources. For spot/preemptible nodes: a custom filter plugin preferentially schedules batch jobs on spot nodes (cheaper) while keeping latency-sensitive services on on-demand nodes. The default scheduler is extended, not replaced.

Key Interview Discussion Points

Preemption: if no feasible node exists, the scheduler may evict lower-priority pods to make room — PriorityClass determines eviction order
Scheduling throughput: the scheduler processes ~150-200 pods/second on large clusters; for higher throughput, run multiple scheduler instances with different profiles or use queue parallelism
Resource requests vs. limits: the scheduler uses requests (guaranteed) for placement decisions; limits cap runtime usage but are not considered during scheduling
Cluster autoscaler integration: when no feasible node exists and preemption fails, the cluster autoscaler provisions new nodes — the scheduler then reschedules the pending pod
Descheduler: periodically re-evaluates existing pod placements and evicts suboptimally placed pods (e.g., after node addition) so the scheduler can place them better