Fair Server Scheduling: Round Robin, Weighted Fair Queueing, and Modern Approaches
“Design a server that processes a fair number of functions per client” is a classic systems-design and concurrency interview question. It maps to a real production concern: under bursty traffic, how do you ensure no single client (or job type) starves the others? This guide covers the standard scheduling algorithms — round robin, weighted fair queueing, deficit round robin, and modern variants — with working code, the trade-offs between them, and the practical considerations interviewers ask about.
The Problem
A server receives requests from many clients. Some clients submit many requests; others submit few. You want to:
- Process all requests without starving any client
- Bound the maximum delay any client experiences
- Optionally, give some clients more capacity than others (priority / weighting)
- Handle dynamic changes (clients arrive and leave)
Approach 1: Round Robin
Maintain a queue of clients with pending work. Process one item from each client in turn; cycle back to the start.
from collections import deque, defaultdict
from threading import Lock
class RoundRobinScheduler:
def __init__(self):
self.queues = defaultdict(deque) # client_id -> queue of jobs
self.client_order = deque() # order to visit clients
self.lock = Lock()
def submit(self, client_id, job):
with self.lock:
if not self.queues[client_id]:
self.client_order.append(client_id)
self.queues[client_id].append(job)
def next_job(self):
with self.lock:
while self.client_order:
client_id = self.client_order.popleft()
if self.queues[client_id]:
job = self.queues[client_id].popleft()
if self.queues[client_id]:
self.client_order.append(client_id) # reschedule
return client_id, job
return None, None
Properties: Each client gets equal share. Bounded delay (each round visits everyone). Simple to implement.
Limitation: Treats all clients equally. If client A submits 100 jobs and client B submits 1, both get equal scheduling priority, which means client A waits unfairly long for its jobs to finish.
Approach 2: Weighted Round Robin
Each client has a weight; processing visits each client a number of times proportional to its weight per round.
class WeightedRoundRobin:
def __init__(self):
self.queues = defaultdict(deque)
self.weights = {} # client_id -> weight
self.credits = {} # client_id -> remaining credits this round
self.client_order = deque()
self.lock = Lock()
def submit(self, client_id, job, weight=1):
with self.lock:
if client_id not in self.weights:
self.weights[client_id] = weight
self.credits[client_id] = weight
self.client_order.append(client_id)
self.queues[client_id].append(job)
def next_job(self):
with self.lock:
for _ in range(len(self.client_order)):
client_id = self.client_order[0]
if self.queues[client_id] and self.credits[client_id] > 0:
job = self.queues[client_id].popleft()
self.credits[client_id] -= 1
return client_id, job
self.client_order.append(self.client_order.popleft())
# All credits exhausted; reset for next round
for c in self.credits:
self.credits[c] = self.weights[c]
return self.next_job()
Properties: Clients with higher weight get proportionally more processing per round. Useful when you want to give paid users priority over free users, or production traffic priority over background jobs.
Approach 3: Deficit Round Robin (DRR)
Improvement on weighted RR: gives each client a “deficit” counter that accumulates over rounds. Clients can carry unused quantum to the next round, smoothing throughput.
class DeficitRoundRobin:
def __init__(self, default_quantum: int = 1):
self.queues = defaultdict(deque)
self.deficits = {} # client_id -> deficit counter
self.quantum = {} # client_id -> quantum per round
self.client_order = deque()
self.default_quantum = default_quantum
self.lock = Lock()
def submit(self, client_id, job, quantum=None):
with self.lock:
if client_id not in self.quantum:
self.quantum[client_id] = quantum or self.default_quantum
self.deficits[client_id] = 0
self.client_order.append(client_id)
self.queues[client_id].append(job)
def next_job(self):
with self.lock:
while True:
if not self.client_order:
return None, None
client_id = self.client_order.popleft()
if not self.queues[client_id]:
continue
self.deficits[client_id] += self.quantum[client_id]
# Each "service unit" is 1; consume up to deficit
if self.deficits[client_id] >= 1:
job = self.queues[client_id].popleft()
self.deficits[client_id] -= 1
if self.queues[client_id]:
self.client_order.append(client_id)
return client_id, job
self.client_order.append(client_id)
Properties: Smooth throughput; bounded fairness. Used in network packet schedulers (Linux’s sch_drr) and modern scheduler designs.
Approach 4: Token Bucket Per Client
Each client has its own token bucket; submissions consume tokens. The server processes at full capacity but rate-limits per client. This is the typical approach for production API servers — see our Rate Limiter guide for the full implementation.
Approach 5: Priority Queue with Aging
Use a priority queue ordered by (priority, submission_time). Older items get a priority boost over time to prevent starvation.
import heapq
import time
class PriorityScheduler:
def __init__(self):
self.heap = []
self.lock = Lock()
def submit(self, client_id, job, priority=0):
with self.lock:
# Lower priority value = higher importance
# Add submission time as tiebreaker (prevents starvation)
heapq.heappush(self.heap, (priority, time.time(), client_id, job))
def next_job(self):
with self.lock:
if not self.heap:
return None, None
_, _, client_id, job = heapq.heappop(self.heap)
return client_id, job
Trade-offs: Priority + aging gives flexible prioritization. Use when “fairness” needs to be relative rather than strict round-robin.
Common Production Considerations
Hot keys / power users
One client with 1000× the traffic of others can dominate the queue. Solutions: per-client rate limits; sharding; fanout limiting. Round-robin alone doesn’t handle hot keys well — the client’s queue grows without bound.
Latency vs throughput trade-off
Round-robin minimizes worst-case latency per client. Priority queues optimize for high-priority clients. Weighted approaches let you pick the trade-off explicitly.
Job duration variance
If some jobs take 1ms and others take 10s, RR alone doesn’t account for service time. Use SCHED_FIFO + per-job timeouts, or shortest-job-first scheduling.
Distributed scheduling
For multi-server systems, a single scheduler is a single point of failure. Modern approaches use consistent hashing (each client maps to a specific server), distributed queues (Kafka, RabbitMQ), or per-shard scheduling with global rate limits.
Backpressure
When all queues are saturated, refuse new submissions or redirect upstream. Let the load shed gracefully rather than building unbounded queues.
Frequently Asked Questions
What’s the expected interview answer?
Round robin as the baseline. Discuss its limitations (no priority, doesn’t handle different job sizes well). Mention weighted RR or DRR for prioritization, token buckets for rate limiting, priority queue with aging for flexible scheduling. Strong candidates draw the system, identify the bottlenecks, and propose monitoring.
How does this differ from rate limiting?
Rate limiting rejects or delays excess requests; scheduling decides which queued requests to process next. They’re complementary: rate limiting controls input; scheduling controls processing order. Real systems use both. See our rate limiter guide for the rate-limiting side.
What’s the right choice for a real production API?
Token bucket per client (rate limiting) plus FIFO processing within each queue (or weighted by priority for premium clients). Scheduling complexity beyond this is rarely necessary unless you have specific fairness or QoS requirements. Most large-scale APIs (Stripe, Cloudflare, AWS) use variants of this pattern.
How do I prevent starvation?
For round robin: it’s automatic — every client gets a turn each round. For priority queue: add aging (priority boost over time). For token bucket: ensure refill rate is non-zero so all clients eventually make progress. For multi-server: ensure routing fairness so no single client’s traffic clusters on one node.
What if jobs have very different durations?
RR with FIFO per queue doesn’t handle this — a 10s job blocks the queue for 10s. Solutions: time-slice processing (Linux SCHED_RR-style), separate queues for fast vs slow jobs, or asynchronous worker pools that pull from all queues. Strong candidates identify the issue and propose explicit handling.
See also: Producer-Consumer Problem • Implement a Rate Limiter • Execution and Slippage
💡Strategies for Solving This Problem
Load Balancing and Request Handling
Got this at Meta in 2024. It's about designing scalable systems that handle variable load. Tests understanding of queuing, load balancing, and priority systems.
The Problem
Design a server that processes thousands of function requests per second. Challenges:
- Unknown request rate
- Different priorities
- Need fairness
- Can't block
Approach 1: Simple Queue
Single FIFO queue. Process requests as they come. Simple but:
- No priorities
- High-priority requests wait behind low-priority
- Doesn't scale well
Approach 2: Priority Queue
Use heap or priority queue. High-priority requests first. Better, but:
- Starvation: low-priority may never execute
- Not truly "fair"
Approach 3: Multiple Queues + Round Robin
Separate queue per priority level. Use weighted round-robin to process:
- 60% time on high priority
- 30% on medium
- 10% on low
Prevents starvation while maintaining priorities.
Approach 4: Thread Pool with Work Stealing
Multiple worker threads. Each has local queue. If idle, "steal" work from others. Good for:
- Load balancing
- Utilizing all cores
- Handling variable load
Key Components
Rate limiting: Prevent overwhelming server
Back pressure: Reject requests when overloaded
Monitoring: Track queue sizes, latencies
Autoscaling: Add workers when load increases
At Meta
They wanted to see if I understood the tradeoffs. Started with simple queue, they asked about priorities. Then about fairness and starvation. Finally discussed scaling and monitoring.