Resource Scheduler: Overview and Requirements
A resource scheduler assigns compute resources — CPU, memory, GPU, and storage — to workloads submitted by users or services. It must maximize utilization through bin packing, enforce fairness across competing tenants, support preemption of low-priority jobs, and respect per-tenant resource quotas.
Functional Requirements
- Accept job submissions with resource requests, priority class, and runtime constraints.
- Assign each job to a node that has sufficient available capacity.
- Preempt lower-priority jobs when a higher-priority job cannot be scheduled otherwise.
- Enforce resource quotas per tenant: reject or queue submissions that would exceed quota.
- Maintain multiple fairness queues to prevent a single high-volume tenant from starving others.
Non-Functional Requirements
- Scheduling decisions for batch workloads must complete within 500 ms for clusters up to 10,000 nodes.
- Interactive or real-time priority jobs must be scheduled within 50 ms.
- The scheduler must handle node failures by rescheduling affected jobs within 30 seconds.
- Quota enforcement must be exact: no tenant may exceed its quota even under concurrent submission bursts.
Data Model
- Job: job_id, tenant_id, priority_class (real-time | high | default | low | best-effort), cpu_request, mem_request_mb, gpu_request, node_selector, status (pending | scheduled | running | preempted | completed | failed), assigned_node_id, submitted_at, started_at, ended_at.
- Node: node_id, pool, cpu_total, mem_total_mb, gpu_total, cpu_allocated, mem_allocated_mb, gpu_allocated, status (ready | draining | offline), taints, labels.
- Quota: quota_id, tenant_id, cpu_limit, mem_limit_mb, gpu_limit, cpu_used, mem_used_mb, gpu_used, version.
- Queue: queue_id, tenant_id, priority_class, depth, head_job_id, weighted_share.
Bin Packing Algorithm
Bin packing maximizes node utilization by preferring nodes that are already partially full, leaving empty nodes available for large jobs that require contiguous resources.
- For each pending job, the scheduler computes a candidate set of nodes that satisfy the resource request and node selector constraints.
- Candidates are scored using a weighted sum of resource fit scores. The fit score for a resource dimension is computed as the ratio of the job request to the remaining capacity on that node. Higher scores indicate a tighter fit.
- The node with the highest aggregate fit score is selected. This is equivalent to first-fit-decreasing applied across multiple dimensions simultaneously.
- When no candidate node has sufficient free capacity, the scheduler checks whether preemption can create room.
The candidate set computation uses a node index backed by a sorted structure per resource dimension, allowing O(log N) filtering rather than a full O(N) scan of all nodes.
Preemption Policy
Preemption evicts running jobs to free resources for a higher-priority job. The policy minimizes disruption by selecting the smallest set of victims that collectively free sufficient resources:
- Victims must have a lower priority class than the preemptor.
- Among eligible victims on a given node, the scheduler selects those with the shortest remaining estimated runtime first, minimizing wasted compute.
- Preempted jobs are moved to status preempted and re-enqueued at the head of their priority queue with a preemption backoff penalty to prevent cascading preemption storms.
- A grace period is sent to the victim job before forcible termination, configurable per priority class.
Multi-Queue Fairness
Fair scheduling prevents a single high-volume tenant from monopolizing the cluster. The scheduler uses a Dominant Resource Fairness (DRF) algorithm:
- Each tenant has a weighted share of cluster capacity proportional to its subscription tier.
- At each scheduling cycle, the scheduler selects the job from the tenant whose dominant resource share — the maximum of its CPU, memory, and GPU consumption ratios — is currently lowest.
- This ensures that no tenant consumes more than its fair share of any single resource dimension over time.
- Queue weights can be adjusted dynamically via the admin API without restarting the scheduler.
Quota Enforcement
Quotas are enforced at submission time using optimistic locking on the Quota record:
- The scheduler reads the current Quota row and checks that cpu_used + job.cpu_request does not exceed cpu_limit, and similarly for memory and GPU.
- If within quota, the scheduler issues an UPDATE quotas SET cpu_used = cpu_used + :delta WHERE tenant_id = :tid AND version = :v. A row-count check detects concurrent modifications and retries up to three times before returning a quota-exceeded error.
- On job completion or failure, quota usage is decremented in the same atomic update pattern.
- Soft quotas can be configured to allow bursting up to 150% of the hard limit for jobs in the low priority class, subject to available cluster capacity.
API Design
POST /jobs— submit a job with resource requests, priority class, and constraints.GET /jobs/{job_id}— get job status, assigned node, and scheduling events.DELETE /jobs/{job_id}— cancel a pending or running job.GET /nodes— list nodes with allocated and available resources per dimension.GET /tenants/{tenant_id}/quota— get current quota usage and limits.PATCH /tenants/{tenant_id}/quota— update quota limits (admin only).GET /queues— view queue depths, weighted shares, and head-of-line jobs per tenant and priority class.
Scalability and Observability
For large clusters the scheduler maintains an in-memory node snapshot updated via a node heartbeat stream rather than querying the database on every scheduling cycle. Scheduling decisions are made against the snapshot; the database is updated asynchronously. A reconciliation loop corrects any drift between the snapshot and the database every 10 seconds.
Key metrics: scheduling latency by priority class, bin packing efficiency (mean node utilization), preemption rate per priority pair, quota rejection rate per tenant, and queue depth per tenant and priority class. Alerts fire when scheduling latency for real-time jobs exceeds 100 ms or when any node is offline for more than 60 seconds.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture