Low Level Design: Kubernetes Scheduler Internals

What the Scheduler Does

The Kubernetes scheduler is a control plane component with a single job: assign unscheduled pods to nodes. When a pod is created without a nodeName, it sits in the pending state in etcd. The scheduler watches for these pods, runs them through a pipeline of constraints and policies, picks a node, and writes the binding back to the API server. Everything downstream — the kubelet, the container runtime — reacts to that binding.

The scheduler is stateless and pluggable. It reads node and pod state from the API server cache (not etcd directly) and can be replaced or extended. Multiple schedulers can coexist in the same cluster, with pods opting into a specific scheduler via schedulerName.

The Scheduling Pipeline: Filter, Score, Bind

Every pod goes through three phases:

Filter (Predicates): Eliminate nodes that cannot run the pod. Any node failing a single predicate is removed from consideration. The result is a feasible set.
Score (Priorities): Rank the remaining feasible nodes. Each scoring function assigns 0–100 points; scores are weighted and summed. The highest-scoring node wins.
Bind: Persist the pod-to-node assignment by writing a Binding object to the API server, which updates the pod’s nodeName in etcd. The kubelet on that node detects the change and starts the pod.

If no node passes filtering, the pod stays pending. The scheduler logs the reason (e.g., "0/10 nodes are available: 3 Insufficient cpu, 7 node(s) didn’t match Pod’s node affinity") — these messages are key for debugging stuck pods.

Filter Predicates

Filter plugins run in parallel across all nodes. Common predicates:

NodeSelectorRequirement: The pod’s nodeSelector or nodeAffinity labels must match the node’s labels. Hard requirements only — no partial matches allowed.
TaintToleration: A node taint (e.g., gpu=true:NoSchedule) blocks all pods that don’t carry a matching toleration. Used to reserve nodes for specialized workloads.
ResourceFit: For each resource dimension (CPU, memory, GPU), the scheduler checks: sum of requests of all running pods + new pod’s request <= node’s allocatable capacity. Limits are not considered here — only requests determine schedulability.
PodAffinity / PodAntiAffinity: Required affinity rules are enforced as filters. A pod requiring co-location with pods labeled app=cache will be filtered out from any node that has no such pods.
VolumeNodeAffinity: PersistentVolumes backed by local storage or zone-restricted cloud disks can only be used on specific nodes. This predicate ensures the pod lands where its volume is accessible.

Scoring Functions

After filtering, scoring functions determine which feasible node is best. The four most important:

LeastRequestedPriority: Prefers nodes with the most free resources (CPU and memory). Score = (capacity – requested) / capacity. Spreads pods across nodes, maximizing headroom. Good for availability — failure of one node affects fewer pods.
MostRequestedPriority: The opposite — prefers nodes that are already heavily loaded. Packs pods onto fewer nodes, leaving others empty. Good for cost efficiency in autoscaled clusters: empty nodes can be terminated.
BalancedResourceAllocation: Penalizes nodes where CPU and memory utilization are imbalanced (e.g., 90% CPU used but 10% memory used). Encourages even resource consumption ratios.
ImageLocalityPriority: Prefers nodes that already have the pod’s container image cached. Avoids image pull latency. Particularly valuable for large ML model images (>10 GB) where a cold pull takes minutes.

Spreading vs. bin packing is a policy choice. Most production clusters use LeastRequested for stateless services (resilience) and MostRequested for batch or cost-sensitive workloads (efficiency).

Pod Affinity and Anti-Affinity

Affinity rules let you express topology preferences relative to other pods, not just nodes.

Pod Affinity: Schedule this pod near pods matching a label selector, within a topology domain (e.g., same node, same zone). Classic use case: co-locate a web server with its Redis cache to minimize network latency.
Pod Anti-Affinity: Spread replicas away from each other. Require that no two replicas of app=api land on the same node or same availability zone, so a single failure doesn’t take down multiple replicas.

Rules can be requiredDuringSchedulingIgnoredDuringExecution (hard — enforced as a filter) or preferredDuringSchedulingIgnoredDuringExecution (soft — enforced as a score). The "IgnoredDuringExecution" suffix means running pods are not evicted if conditions change after scheduling.

Taints and Tolerations

Taints mark a node as special-purpose. A taint has a key, value, and effect:

NoSchedule: Don’t schedule pods without a matching toleration.
PreferNoSchedule: Soft version — avoid if possible.
NoExecute: Evict existing pods without a toleration and block new ones.

Common patterns: GPU nodes are tainted nvidia.com/gpu=present:NoSchedule so only GPU-requesting pods land there. Spot/preemptible nodes are tainted cloud.google.com/gke-spot=true:NoSchedule so only fault-tolerant batch workloads tolerate them. Control plane nodes are tainted node-role.kubernetes.io/control-plane:NoSchedule to keep user workloads off them.

Resource Requests vs. Limits

The distinction is critical for understanding scheduling behavior:

Request: The amount of CPU/memory the scheduler reserves for the pod on the node. The node’s allocatable capacity is decremented by the sum of all pod requests. A pod is guaranteed at least its requested resources.
Limit: The maximum the pod can consume. Enforced at runtime by cgroups (CPU throttling) and the kernel OOM killer (memory). The scheduler ignores limits — a node can be overcommitted on limits.

A pod with no requests is scheduled as if it requests zero resources — it can land on a fully packed node and compete for whatever’s available. Setting requests accurately is essential for meaningful scheduling decisions. The Quality of Service class (Guaranteed, Burstable, BestEffort) is derived from whether requests equal limits, affecting eviction priority under node pressure.

Priority Classes and Preemption

When no node can fit a high-priority pod, the scheduler can evict lower-priority pods to make room. The process:

A PriorityClass assigns an integer value (higher = more important). System-critical pods use values >1,000,000,000.
The scheduler identifies nodes where evicting lower-priority pods would create enough room for the pending pod.
It selects the node that requires the minimum disruption (fewest evictions, lowest-priority victims).
Victims receive a graceful termination period before the pending pod is bound.

Preemption is powerful but dangerous — it can cascade if many high-priority pods are pending simultaneously. PodDisruptionBudgets (PDBs) can block eviction of pods that are already at their minimum availability threshold.

Topology Spread Constraints

Topology spread constraints offer fine-grained control over pod distribution across failure domains:

topologyKey: The node label defining the domain (e.g., topology.kubernetes.io/zone).
maxSkew: Maximum allowed difference in pod count between the most and least loaded domains.
whenUnsatisfiable: DoNotSchedule (hard) or ScheduleAnyway (soft).

Example: 3 zones, 9 replicas, maxSkew=1 ensures no zone has more than 4 pods while another has 2. This replaces the older podAntiAffinity zone-spreading pattern with a more expressive and efficient mechanism.

Gang Scheduling and Custom Schedulers

Gang scheduling addresses ML training and batch jobs where all pods must start together or not at all. The default scheduler places pods individually; if only 7 of 8 GPU workers can be scheduled, the 7 sit idle waiting for the 8th. Gang schedulers (Volcano, Koordinator) hold all pods until the entire group can be placed simultaneously.

Scheduler extenders let you add custom filter/score logic via a webhook without forking the scheduler. The scheduler calls your HTTP endpoint with node and pod data; you return a filtered or scored list.

Custom schedulers are standalone deployments implementing the full scheduling loop. Pods opt in via schedulerName: my-custom-scheduler. Used for specialized placement logic that can’t be expressed in standard constraints — e.g., hardware affinity for FPGAs, network topology awareness, or co-location with licensed software.

The scheduler framework (kube-scheduler framework) exposes extension points at every phase (PreFilter, Filter, PostFilter, PreScore, Score, Reserve, Permit, PreBind, Bind, PostBind), making it possible to build sophisticated scheduling behavior as plugins without modifying core code.

Frequently Asked Questions

How does the Kubernetes scheduler filter nodes?

The Filter phase runs each candidate node through a set of filter plugins. Key filters: NodeSelector/NodeAffinity checks node labels match pod requirements. TaintToleration checks that the node’s taints are tolerated by the pod. ResourceFit verifies that the sum of all running pod requests plus the new pod’s request does not exceed node allocatable capacity (CPU, memory, GPU). VolumeNodeAffinity checks that the pod’s PersistentVolumes are accessible from the candidate node. Nodes failing any filter are eliminated from consideration.

How does bin packing differ from spreading in Kubernetes scheduling?

Spreading (LeastRequestedPriority): prefer nodes with the most free resources, distributing pods evenly across the cluster. This improves fault tolerance — a node failure affects fewer pods. Bin packing (MostRequestedPriority): prefer nodes with the least free resources, packing pods densely. This improves resource utilization and allows underloaded nodes to be scaled down (for cloud cost savings). The choice depends on priorities: high availability favors spreading; cost efficiency favors bin packing. Descheduler can rebalance after the fact.

What are pod affinity and anti-affinity rules?

Pod affinity schedules a pod near other pods matching a label selector (e.g., co-locate a web pod with its Redis cache pod on the same node for low latency). Pod anti-affinity does the opposite (e.g., spread replicas of a service across different nodes or availability zones to avoid a single node failure taking all replicas). Both support required rules (hard constraints, pod is not scheduled if violated) and preferred rules (soft constraints, scored but not enforced). Topology key specifies the domain: kubernetes.io/hostname for node-level, topology.kubernetes.io/zone for zone-level.

How does Kubernetes scheduler handle resource overcommitment?

Kubernetes separates resource requests (used for scheduling) from limits (enforced at runtime by cgroups). The scheduler places pods based on requests — a node with 8 CPU may schedule pods totaling 16 CPU requested if limits are higher. Overcommitment allows higher utilization but risks throttling (CPU) or OOM kill (memory) if actual usage exceeds node capacity. QoS classes: Guaranteed (request == limit), Burstable (request < limit), BestEffort (no request/limit) — BestEffort pods are evicted first under memory pressure.