Load Shedding: Low-Level Design

Load shedding is the deliberate rejection of requests when a system is overloaded — sacrificing some requests to protect the system’s ability to serve others. Without load shedding, an overloaded service degrades for all users: queues grow, latencies spike, and the system eventually collapses. With load shedding, the system rejects low-priority requests quickly, keeping high-priority requests fast.

Why Load Shedding Is Necessary

HTTP servers, databases, and queues all have finite capacity. When demand exceeds capacity, queueing theory predicts that latency grows superlinearly — at 90% utilization, latency is several times higher than at 50%. At 100%+, queues grow unbounded and the system fails for everyone. Load shedding maintains utilization below the knee of the latency curve, ensuring the system degrades gracefully rather than failing completely.

When to Shed Load

Detect overload via: (1) CPU utilization above threshold (e.g., >85%), (2) request queue depth exceeding a threshold, (3) p99 latency exceeding an SLO, (4) active request count exceeding a concurrency limit, (5) memory pressure approaching OOM. Use a combination — CPU alone misses I/O-bound overload; latency alone has too much lag. The concurrency-based approach (Little’s Law: L = λW) is most theoretically sound: if throughput and latency are known, concurrency limit = target_throughput × target_latency.

What to Shed

LIFO Queue Dropping

When a queue is full, drop the newest requests instead of the oldest (LIFO drop). The reasoning: requests at the front of the queue have already waited — their clients are more likely to still be waiting for a response. Requests at the back are newest — their clients may have already timed out. LIFO drop maximizes the chance that served requests have a client still waiting for them.

Priority-Based Shedding

Assign priority tiers: P0 (health checks, auth), P1 (logged-in user requests), P2 (anonymous requests), P3 (background jobs). Under load, shed P3 first, then P2, preserving P0 and P1. Priority is determined from request metadata: JWT claims (authenticated vs. anonymous), endpoint category, or request headers. This requires a priority classification layer at the edge or load balancer.

Random Shedding

When load exceeds capacity by X%, reject X% of requests randomly. Simple to implement: generate a random float per request; reject if float < shedding_rate. No state required, no priority classification needed. Works well when all requests have roughly equal cost and priority. The downside: some high-value requests are rejected while low-value ones are served.

Concurrency Limiters

A concurrency limiter caps the number of simultaneous in-flight requests. New requests beyond the limit are rejected immediately (not queued). This bounds the system’s active work, preventing memory exhaustion and thread pool saturation. Implementation: a semaphore or atomic counter tracking in-flight requests. When a request completes (or times out), decrement the counter. Netflix’s Concurrency Limit library implements this with adaptive limit adjustment based on measured latency.

Adaptive Load Shedding

Static thresholds require manual tuning and go stale as traffic patterns change. Adaptive shedding adjusts the shedding rate automatically based on real-time system signals. Algorithm: measure current p99 latency over a rolling window. If p99 > target, increase shedding rate by a step. If p99 < target and CPU < 70%, decrease shedding rate. This feedback loop maintains the system near its optimal operating point without manual intervention.

Client-Side Behavior

Shed requests must return quickly with a clear signal: HTTP 429 (Too Many Requests) or 503 (Service Unavailable) with a Retry-After header. A 429 with Retry-After tells clients to back off, preventing them from immediately retrying and amplifying load. Do not return 200 with an error body — clients cannot distinguish this from success and will retry aggressively.

Load Shedding vs. Rate Limiting

Rate limiting (token bucket, sliding window) limits request rate per client. Load shedding limits total system load regardless of client. They are complementary: rate limiting prevents any single client from overwhelming the system; load shedding protects against aggregate overload from many well-behaved clients. Deploy both: rate limiting at the edge, load shedding at the service layer.

Interview Discussion Points

Key interview talking points: explain why queueing theory makes overload worse than linear degradation, describe at least two shedding policies (LIFO, priority-based, random), mention that shed requests must return 429/503 fast (not queue), discuss adaptive vs. static thresholds, and distinguish load shedding from rate limiting. Strong candidates also mention that shed requests should propagate cancellation signals to avoid doing work whose result will be discarded.

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale

See also: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

See also: Atlassian Interview Guide

See also: Coinbase Interview Guide

See also: Shopify Interview Guide

See also: Snap Interview Guide

See also: Lyft Interview Guide 2026: Rideshare Engineering, Real-Time Dispatch, and Safety Systems

See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems

Scroll to Top