Low Level Design: Adaptive Concurrency Limiting

⏱ 3 min read

Adaptive concurrency limiting automatically tunes the number of concurrent requests a service allows based on observed performance. Unlike static rate limiting (fixed requests per second), concurrency limiting reacts to actual server capacity — reducing the limit when the server is slow and increasing it when the server has headroom. This prevents overload without requiring manual capacity planning.

Why Concurrency vs Rate Limiting

Rate limiting (requests per second) does not adapt to variable request duration. A server handling 100 simple requests per second may not handle 100 complex requests per second — even at the same RPS, complex requests consume more resources for longer, overloading the server. Concurrency limiting (maximum in-flight requests) is proportional to actual resource consumption: if each request takes 10ms, 100 concurrent requests = 10,000 RPS; if requests slow to 100ms, 100 concurrent requests = 1,000 RPS — the limit adapts to the throughput automatically.

AIMD (Additive Increase Multiplicative Decrease)

AIMD is the algorithm TCP uses for congestion control, adapted for service concurrency. Additive increase: if no errors are observed, increment the concurrency limit by 1 per round-trip time. Multiplicative decrease: if an error (timeout, overload response) is observed, halve the concurrency limit. This converges to the optimal concurrency rapidly under stable conditions and backs off quickly under overload. Simple to implement but reacts slowly to sudden load increases because the increase rate is linear.

Vegas Algorithm

Netflix's concurrency limiter (based on TCP Vegas) estimates optimal concurrency from observed RTT. Gradient: limit = min_RTT / current_RTT (ratio of best observed RTT to current RTT). If the service is healthy, min_RTT ≈ current_RTT, gradient ≈ 1, limit stays the same. If the service is overloaded, current_RTT increases, gradient drops below 1, limit decreases. If underloaded, current_RTT drops below min_RTT (temporary), limit increases. This provides smooth, proactive control without requiring explicit error signals.

Gradient-Based Limiting

The Gradient2 algorithm (Resilience4j, Envoy) uses RTT gradient to adjust limits: new_limit = old_limit * (min_RTT / avg_RTT) + sqrt(old_limit). The sqrt term provides headroom for the limit to grow even when the RTT ratio is less than 1, preventing premature convergence to a too-small limit. Update the limit every N requests rather than every request to reduce noise. Smooth the RTT measurements using an exponential moving average to filter transient spikes.

Bulkhead Integration

Adaptive concurrency limiter is typically combined with bulkheads: separate concurrency limits per upstream dependency. If database queries slow down, only the database query concurrency limit decreases — other operations (cache hits, in-memory logic) continue at full concurrency. Without bulkheads, a slow database would reduce the overall service concurrency limit, starving fast code paths. Implement per-dependency limiters using the same algorithm, each tracking RTT and error rate for its specific upstream.

Queue Management

When the concurrency limit is reached, new requests must be queued or rejected. A small queue (2-5 requests) smooths short bursts without adding much latency. A large queue allows the system to appear to handle load while building up latency — defeating the purpose of the limit. Set queue timeout: reject requests that have been queued longer than the target p99 latency. This ensures that accepted requests are served promptly rather than waiting in a queue long enough to time out at the caller.

Observability

Expose concurrency limiter state as metrics: current_limit, in_flight_requests, queue_depth, rejected_requests, min_rtt_ms, current_rtt_ms. Alert when rejected_requests is sustained (indicates the service is consistently near capacity). Dashboard the limit over time — a steadily decreasing limit indicates a service health problem. Distinguish between limit rejections (429 Too Many Requests) and error rejections (503 Service Unavailable from the upstream) in metrics to separate concurrency issues from upstream errors.