Resource Scheduler Low-Level Design: Bin Packing, Preemption, and Fairness Queues

Resource Scheduler: Overview and Requirements

A resource scheduler assigns compute resources — CPU, memory, GPU, and storage — to workloads submitted by users or services. It must maximize utilization through bin packing, enforce fairness across competing tenants, support preemption of low-priority jobs, and respect per-tenant resource quotas.

Functional Requirements

Accept job submissions with resource requests, priority class, and runtime constraints.
Assign each job to a node that has sufficient available capacity.
Preempt lower-priority jobs when a higher-priority job cannot be scheduled otherwise.
Enforce resource quotas per tenant: reject or queue submissions that would exceed quota.
Maintain multiple fairness queues to prevent a single high-volume tenant from starving others.

Non-Functional Requirements

Scheduling decisions for batch workloads must complete within 500 ms for clusters up to 10,000 nodes.
Interactive or real-time priority jobs must be scheduled within 50 ms.
The scheduler must handle node failures by rescheduling affected jobs within 30 seconds.
Quota enforcement must be exact: no tenant may exceed its quota even under concurrent submission bursts.

Data Model

Job: job_id, tenant_id, priority_class (real-time | high | default | low | best-effort), cpu_request, mem_request_mb, gpu_request, node_selector, status (pending | scheduled | running | preempted | completed | failed), assigned_node_id, submitted_at, started_at, ended_at.
Node: node_id, pool, cpu_total, mem_total_mb, gpu_total, cpu_allocated, mem_allocated_mb, gpu_allocated, status (ready | draining | offline), taints, labels.
Quota: quota_id, tenant_id, cpu_limit, mem_limit_mb, gpu_limit, cpu_used, mem_used_mb, gpu_used, version.
Queue: queue_id, tenant_id, priority_class, depth, head_job_id, weighted_share.

Bin Packing Algorithm

Bin packing maximizes node utilization by preferring nodes that are already partially full, leaving empty nodes available for large jobs that require contiguous resources.

For each pending job, the scheduler computes a candidate set of nodes that satisfy the resource request and node selector constraints.
Candidates are scored using a weighted sum of resource fit scores. The fit score for a resource dimension is computed as the ratio of the job request to the remaining capacity on that node. Higher scores indicate a tighter fit.
The node with the highest aggregate fit score is selected. This is equivalent to first-fit-decreasing applied across multiple dimensions simultaneously.
When no candidate node has sufficient free capacity, the scheduler checks whether preemption can create room.

The candidate set computation uses a node index backed by a sorted structure per resource dimension, allowing O(log N) filtering rather than a full O(N) scan of all nodes.

Preemption Policy

Preemption evicts running jobs to free resources for a higher-priority job. The policy minimizes disruption by selecting the smallest set of victims that collectively free sufficient resources:

Victims must have a lower priority class than the preemptor.
Among eligible victims on a given node, the scheduler selects those with the shortest remaining estimated runtime first, minimizing wasted compute.
Preempted jobs are moved to status preempted and re-enqueued at the head of their priority queue with a preemption backoff penalty to prevent cascading preemption storms.
A grace period is sent to the victim job before forcible termination, configurable per priority class.

Multi-Queue Fairness

Fair scheduling prevents a single high-volume tenant from monopolizing the cluster. The scheduler uses a Dominant Resource Fairness (DRF) algorithm:

Each tenant has a weighted share of cluster capacity proportional to its subscription tier.
At each scheduling cycle, the scheduler selects the job from the tenant whose dominant resource share — the maximum of its CPU, memory, and GPU consumption ratios — is currently lowest.
This ensures that no tenant consumes more than its fair share of any single resource dimension over time.
Queue weights can be adjusted dynamically via the admin API without restarting the scheduler.

Quota Enforcement

Quotas are enforced at submission time using optimistic locking on the Quota record:

The scheduler reads the current Quota row and checks that cpu_used + job.cpu_request does not exceed cpu_limit, and similarly for memory and GPU.
If within quota, the scheduler issues an UPDATE quotas SET cpu_used = cpu_used + :delta WHERE tenant_id = :tid AND version = :v. A row-count check detects concurrent modifications and retries up to three times before returning a quota-exceeded error.
On job completion or failure, quota usage is decremented in the same atomic update pattern.
Soft quotas can be configured to allow bursting up to 150% of the hard limit for jobs in the low priority class, subject to available cluster capacity.

API Design

POST /jobs — submit a job with resource requests, priority class, and constraints.
GET /jobs/{job_id} — get job status, assigned node, and scheduling events.
DELETE /jobs/{job_id} — cancel a pending or running job.
GET /nodes — list nodes with allocated and available resources per dimension.
GET /tenants/{tenant_id}/quota — get current quota usage and limits.
PATCH /tenants/{tenant_id}/quota — update quota limits (admin only).
GET /queues — view queue depths, weighted shares, and head-of-line jobs per tenant and priority class.

Scalability and Observability

For large clusters the scheduler maintains an in-memory node snapshot updated via a node heartbeat stream rather than querying the database on every scheduling cycle. Scheduling decisions are made against the snapshot; the database is updated asynchronously. A reconciliation loop corrects any drift between the snapshot and the database every 10 seconds.

Key metrics: scheduling latency by priority class, bin packing efficiency (mean node utilization), preemption rate per priority pair, quota rejection rate per tenant, and queue depth per tenant and priority class. Alerts fire when scheduling latency for real-time jobs exceeds 100 ms or when any node is offline for more than 60 seconds.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does multi-dimensional bin packing apply to resource scheduling?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each node is a bin with capacity vectors (CPU, memory, GPU, network bandwidth). Each workload is an item with a corresponding request vector. The scheduler solves a variant of vector bin packing: it scores candidate nodes by a dot-product fit metric (e.g., Most-Requested or Least-Requested policies) and picks the best fit. Because optimal bin packing is NP-hard, production schedulers use greedy heuristics with priority queues, tolerating slight sub-optimality in exchange for O(n log n) scheduling latency.”
}
},
{
“@type”: “Question”,
“name”: “What preemption policies are used when a high-priority workload can't be scheduled?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When no node has sufficient free capacity, the scheduler identifies the minimum set of lower-priority pods to evict. Common policies: preempt the fewest pods (minimize disruption), preempt pods with the lowest priority class first, and among equals prefer pods that have run the shortest time (protecting long-running jobs). Preempted workloads are placed back on the pending queue with a grace period. Preemption is gated by PodDisruptionBudgets to avoid taking down an entire deployment.”
}
},
{
“@type”: “Question”,
“name”: “How do Dominant Resource Fairness (DRF) queues work in a multi-tenant scheduler?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “DRF computes each tenant's dominant share — the maximum of their allocated fraction across all resource dimensions. The scheduler always picks the runnable task belonging to the tenant with the smallest dominant share, enforcing max-min fairness across heterogeneous workloads. This prevents a CPU-heavy tenant from starving a memory-heavy one. DRF is implemented as a priority queue sorted by dominant share, updated incrementally as tasks start and finish.”
}
},
{
“@type”: “Question”,
“name”: “How is resource quota enforcement implemented without a central bottleneck?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Quotas are stored per namespace in a quota table with soft and hard limits. Before admitting a workload, the scheduler runs an admission check that atomically reads current usage (aggregated from a usage_ledger table) and compares against the hard limit. Usage is incremented optimistically; if the resulting sum exceeds the quota, the request is rejected. A reconciliation job periodically recomputes actual usage from running workloads and corrects drift, decoupling the hot admission path from expensive full scans.”
}
}
]
}