Load Shedding: Low-Level Design – Tech Interview Dot Org

Load shedding is the deliberate rejection of requests when a system is overloaded — sacrificing some requests to protect the system’s ability to serve others. Without load shedding, an overloaded service degrades for all users: queues grow, latencies spike, and the system eventually collapses. With load shedding, the system rejects low-priority requests quickly, keeping high-priority requests fast.

Why Load Shedding Is Necessary

HTTP servers, databases, and queues all have finite capacity. When demand exceeds capacity, queueing theory predicts that latency grows superlinearly — at 90% utilization, latency is several times higher than at 50%. At 100%+, queues grow unbounded and the system fails for everyone. Load shedding maintains utilization below the knee of the latency curve, ensuring the system degrades gracefully rather than failing completely.

When to Shed Load

Detect overload via: (1) CPU utilization above threshold (e.g., >85%), (2) request queue depth exceeding a threshold, (3) p99 latency exceeding an SLO, (4) active request count exceeding a concurrency limit, (5) memory pressure approaching OOM. Use a combination — CPU alone misses I/O-bound overload; latency alone has too much lag. The concurrency-based approach (Little’s Law: L = λW) is most theoretically sound: if throughput and latency are known, concurrency limit = target_throughput × target_latency.

What to Shed

LIFO Queue Dropping

When a queue is full, drop the newest requests instead of the oldest (LIFO drop). The reasoning: requests at the front of the queue have already waited — their clients are more likely to still be waiting for a response. Requests at the back are newest — their clients may have already timed out. LIFO drop maximizes the chance that served requests have a client still waiting for them.

Priority-Based Shedding

Assign priority tiers: P0 (health checks, auth), P1 (logged-in user requests), P2 (anonymous requests), P3 (background jobs). Under load, shed P3 first, then P2, preserving P0 and P1. Priority is determined from request metadata: JWT claims (authenticated vs. anonymous), endpoint category, or request headers. This requires a priority classification layer at the edge or load balancer.

Random Shedding

When load exceeds capacity by X%, reject X% of requests randomly. Simple to implement: generate a random float per request; reject if float < shedding_rate. No state required, no priority classification needed. Works well when all requests have roughly equal cost and priority. The downside: some high-value requests are rejected while low-value ones are served.

Concurrency Limiters

A concurrency limiter caps the number of simultaneous in-flight requests. New requests beyond the limit are rejected immediately (not queued). This bounds the system’s active work, preventing memory exhaustion and thread pool saturation. Implementation: a semaphore or atomic counter tracking in-flight requests. When a request completes (or times out), decrement the counter. Netflix’s Concurrency Limit library implements this with adaptive limit adjustment based on measured latency.

Adaptive Load Shedding

Static thresholds require manual tuning and go stale as traffic patterns change. Adaptive shedding adjusts the shedding rate automatically based on real-time system signals. Algorithm: measure current p99 latency over a rolling window. If p99 > target, increase shedding rate by a step. If p99 < target and CPU < 70%, decrease shedding rate. This feedback loop maintains the system near its optimal operating point without manual intervention.

Client-Side Behavior

Shed requests must return quickly with a clear signal: HTTP 429 (Too Many Requests) or 503 (Service Unavailable) with a Retry-After header. A 429 with Retry-After tells clients to back off, preventing them from immediately retrying and amplifying load. Do not return 200 with an error body — clients cannot distinguish this from success and will retry aggressively.

Load Shedding vs. Rate Limiting

Rate limiting (token bucket, sliding window) limits request rate per client. Load shedding limits total system load regardless of client. They are complementary: rate limiting prevents any single client from overwhelming the system; load shedding protects against aggregate overload from many well-behaved clients. Deploy both: rate limiting at the edge, load shedding at the service layer.

Interview Discussion Points

Key interview talking points: explain why queueing theory makes overload worse than linear degradation, describe at least two shedding policies (LIFO, priority-based, random), mention that shed requests must return 429/503 fast (not queue), discuss adaptive vs. static thresholds, and distinguish load shedding from rate limiting. Strong candidates also mention that shed requests should propagate cancellation signals to avoid doing work whose result will be discarded.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is load shedding and why does it improve reliability?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “Load shedding is the deliberate rejection of requests when a system is overloaded. Without shedding, overload causes queues to grow unbounded, latency to spike superlinearly (per queueing theory), and the service to fail for all users. With shedding, low-priority requests are rejected quickly (returning 429 or 503), allowing the service to remain fast for high-priority requests. The key insight: a fast rejection is better than a slow failure — clients can retry or degrade gracefully, whereas a timeout after 30 seconds wastes resources and leaves the client hanging.”} }, { “@type”: “Question”, “name”: “What is LIFO queue dropping and why is it better than FIFO dropping?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “When a request queue is full, LIFO drop removes the newest requests; FIFO drop removes the oldest. LIFO drop is better because: requests at the front of the queue have waited longest and their clients are more likely still waiting for a response (shorter timeout remaining). Requests at the back are newest — their clients may have already timed out, making serving them wasteful. LIFO drop maximizes the probability that a served request has a live client waiting for it. This is counterintuitive but empirically effective at maintaining throughput under queue saturation.”} }, { “@type”: “Question”, “name”: “How does a concurrency limiter implement load shedding?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “A concurrency limiter maintains an atomic counter of in-flight requests. When a new request arrives: if current_count = limit, reject immediately with 429. The limit is derived from Little’s Law: L = lambda * W, where L is concurrency, lambda is throughput, and W is average latency. Set limit = target_throughput * target_latency. This bounds active work regardless of how overloaded the system is. Netflix’s Concurrency Limit library implements adaptive limit adjustment based on measured gradient descent on latency.”} }, { “@type”: “Question”, “name”: “How does load shedding differ from rate limiting?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “Rate limiting (token bucket, sliding window) limits requests per client — it prevents any single caller from using too much capacity. Load shedding limits total system load regardless of which client sends the request. They are complementary: a well-behaved client can be rate-limited without triggering load shedding; many well-behaved clients together can trigger load shedding without any individual being rate-limited. Deploy both: rate limiting at the edge (API gateway) to prevent per-client abuse, load shedding at the service layer to protect against aggregate overload from the entire traffic mix.”} } ] }