Low Level Design: Operating System Process Scheduling

An operating system scheduler decides which process or thread runs on each CPU core at any given moment. The scheduler must balance throughput (maximizing CPU utilization), fairness (preventing starvation), latency (minimizing response time for interactive tasks), and real-time guarantees (meeting deadlines for latency-sensitive work). Linux CFS (Completely Fair Scheduler), Windows NT scheduler, and real-time schedulers each take distinct approaches. Understanding OS scheduling explains why application-level thread pools, goroutines, and async event loops exist and how they relate to kernel scheduling.

Scheduling Algorithms

First-Come-First-Served (FCFS): run processes in arrival order. Simple but causes convoy effect (short jobs wait behind long jobs). Shortest Job First (SJF): run the shortest job next. Optimal for throughput but requires knowing job duration (impossible in practice). Round-Robin: each process gets a fixed time quantum (e.g., 10ms) in rotation. Fair but high context switch overhead for short quanta. Preemptive: the running process is interrupted when its quantum expires. Multilevel Feedback Queue (used by Linux, Windows): multiple priority queues; processes start at high priority and are demoted to lower queues if they use their full time quantum (CPU-bound); I/O-bound processes that block frequently stay at high priority (they tend to use little CPU per burst). This naturally prioritizes interactive and I/O-bound tasks.

// Linux CFS: Completely Fair Scheduler
// Each process tracks virtual runtime (vruntime) = actual CPU time / weight
// The runqueue is a red-black tree ordered by vruntime
// The scheduler always picks the process with the smallest vruntime (leftmost node)
// This ensures proportional CPU sharing based on priority (nice value)

// nice value -20 (highest priority) to +19 (lowest priority)
// weight = 1024 / (1.25 ^ nice)  -- exponential: nice-0 = 1024, nice-1 = 820, nice+1 = 1277

// CFS key parameters:
// /proc/sys/kernel/sched_latency_ns = 6000000 (6ms target scheduling period)
// /proc/sys/kernel/sched_min_granularity_ns = 750000 (0.75ms minimum run time)
// /proc/sys/kernel/sched_wakeup_granularity_ns = 1000000 (1ms wakeup preemption)

// Monitor scheduling:
// cat /proc/[pid]/schedstat  -- CPU time, wait time, context switches
// perf sched record/report   -- detailed scheduling trace
// chrt -f 50 ./realtime_app  -- set SCHED_FIFO real-time policy, priority 50

Preemption and Context Switching

Preemption allows the kernel to interrupt a running process and switch to another. Without preemption (cooperative multitasking), processes run until they voluntarily yield — a buggy process can freeze the system. Preemptive scheduling uses timer interrupts: every N milliseconds, the CPU fires an interrupt, the kernel timer handler runs, and the scheduler decides whether to preempt. Context switching saves CPU register state (PC, SP, general-purpose registers, FPU state) to the process control block (PCB) and restores another process’s state. Cost: 1-10 microseconds depending on cache effects (the new process’s data is likely not in L1/L2 cache). High context switch rate is detectable via vmstat or pidstat -w.

Thread vs. Process Scheduling

Modern OSes schedule threads (the unit of CPU execution), not processes. Threads within a process share memory space and file descriptors but have separate stacks and CPU register state. The kernel schedules kernel threads; user-space thread libraries (pthreads) map user threads to kernel threads. M:N threading (Go goroutines, Erlang processes): M user-level goroutines multiplexed over N kernel threads by the runtime scheduler. The runtime scheduler (Go: work-stealing, Erlang: process-based preemption) avoids kernel context switch overhead for goroutine switching — goroutine switch costs ~100ns vs. ~10µs for kernel thread switch. This is why Go can efficiently run millions of goroutines.

Real-Time Scheduling

Real-time schedulers guarantee that critical tasks run within a deadline. Linux real-time policies: SCHED_FIFO: fixed priority, run until blocked or preempted by higher-priority task. SCHED_RR: like FIFO but with round-robin among equal priorities. SCHED_DEADLINE: specify runtime (how much CPU needed), deadline (when needed by), and period (how often). The kernel uses EDF (Earliest Deadline First) to schedule DEADLINE tasks. Real-time tasks preempt all CFS tasks. Use cases: audio processing (PulseAudio, JACK), industrial control systems, autonomous vehicle sensor processing. CPU isolation (isolcpus kernel parameter) dedicates cores exclusively to real-time tasks, preventing OS interference.

Key Interview Discussion Points

Priority inversion: a low-priority task holds a lock needed by a high-priority task, while a medium-priority task preempts the low-priority task — the high-priority task is effectively blocked by a lower-priority task; solution: priority inheritance (temporarily elevate the lock holder to the priority of the waiting task)
NUMA awareness: on multi-socket systems, memory access latency differs by whether memory is on the same NUMA node as the CPU; Linux NUMA-aware scheduling keeps threads near their memory; use numactl to pin processes
CPU affinity: pin a process to specific CPU cores with taskset or sched_setaffinity() to maximize cache locality and avoid migration overhead
Scheduler bypass: high-performance networking (DPDK) and storage (SPDK) bypass the kernel scheduler entirely by busy-polling on a dedicated CPU core — eliminates interrupt latency at the cost of one full CPU core
cgroup CPU quotas: Kubernetes CPU limits use cgroup cpu.cfs_quota_us to throttle processes — a pod limited to 0.5 CPU gets 50ms of CPU time per 100ms period