Low Level Design: Rate Limiting Algorithms Deep Dive

⏱ 8 min read

Token Bucket

The token bucket algorithm models a bucket of capacity B tokens that refills at rate R tokens per second. On each request of cost C: if tokens >= C, deduct C and allow; otherwise reject. The bucket allows bursts up to B requests instantaneously, making it well-suited for APIs where occasional bursts are acceptable. Implementation uses lazy refill: rather than a background thread, compute tokens = min(B, tokens + R * (now - last_refill_time)) at request time. Store two values per user: tokens and last_refill_time. In Redis, a Lua script reads both fields, applies the formula, and conditionally decrements — atomically — in a single round trip.

Leaky Bucket

The leaky bucket models a FIFO queue of capacity B. Incoming requests enqueue if the queue is not full; otherwise they are rejected. Requests dequeue at a constant rate R per second. The output rate is smooth and constant regardless of burst input — the "leak" is metered. This is ideal for rate-limiting outbound calls to a downstream service that cannot handle bursts. Contrast with token bucket: token bucket smooths the average rate but permits instantaneous bursts; leaky bucket enforces a strict maximum output rate. Implementation schedules a dequeue task at interval 1/R seconds, or equivalently computes the next allowed request time as next_allowed = last_request_time + 1/R and rejects requests arriving before that time.

Fixed Window Counter

Divide time into fixed windows (e.g., one-minute buckets). Maintain a counter per (user, window) key. On each request: increment the counter; if it exceeds the limit, reject. Reset by setting a TTL on the key equal to the window size. Implementation in Redis: INCR user:1234:1714000000 followed by EXPIRE on the first increment. The critical flaw is the boundary burst: a user can send limit requests in the last second of window N and limit requests in the first second of window N+1, achieving 2× the intended rate in a two-second span. Simple to implement and reason about, but only appropriate when boundary bursts are tolerable.

Sliding Window Log

Store the timestamp of every request in a sorted set per user. On each new request: remove all entries older than now - window_size (ZREMRANGEBYSCORE), count remaining entries (ZCARD), reject if count >= limit, otherwise add the new timestamp (ZADD). This gives exact rate limiting with no boundary burst artifact. The cost is memory: O(requests_per_window) entries per user. For a limit of 1000 req/min across 1M users, that is up to 1 billion entries in the sorted set — untenable at scale. Suitable for low-volume APIs with strict accuracy requirements where per-user request counts are small.

Sliding Window Counter

Approximates the sliding window using only two counters per user: the current window count and the previous window count. The estimated count for any point in time is:

estimated = current_window_count + previous_window_count * (1 - elapsed_fraction)

where elapsed_fraction is how far into the current window we are (0.0 to 1.0). This approximation assumes requests were uniformly distributed in the previous window. The error is bounded by the variance in the previous window’s distribution — in practice well under 1% for smooth traffic. Memory cost is O(1) per user (two counters + window boundary timestamp). This is the algorithm used by Cloudflare and most production rate limiters where accuracy and memory efficiency must both be high.

GCRA — Generic Cell Rate Algorithm

GCRA (also called the "virtual scheduler" or "leaky bucket as a meter") tracks a single value per user: the Theoretical Arrival Time (TAT) of the next cell. On each request:

TAT = max(now, TAT) + emission_interval where emission_interval = 1 / rate
Allow if TAT - now <= burst_tolerance, otherwise reject

The burst tolerance B controls how many requests can arrive simultaneously: B = (burst_size - 1) * emission_interval. GCRA is mathematically equivalent to a leaky bucket at output. It requires only one stored variable per user, making it extremely memory-efficient. It is used in ATM network traffic shaping, and Nginx’s limit_req module implements a variant of it.

Distributed Rate Limiting

Single-node rate limiting breaks down when requests are load-balanced across multiple servers. Options in increasing complexity:

Centralized Redis: all nodes hit a single Redis instance using a Lua script for atomic check-and-increment. Consistent but adds ~1ms latency and creates a Redis bottleneck.
Redis Cluster: shard users by user_id % num_shards; each shard handles a subset of users independently. Scales horizontally but requires careful key routing.
Local approximation + periodic sync: each server maintains a local token bucket and periodically syncs with a shared store. Requests are allowed locally without a remote call. Drift between servers means slightly more than the limit may pass, but this is acceptable for most APIs.
Sticky sessions: route all requests from a given user to the same server via consistent hashing on the load balancer. Local rate limiting is then perfectly accurate. Works until server failure requires re-routing.

Adaptive Rate Limiting

Static rate limits ignore downstream health. Adaptive rate limiting adjusts allowed rates based on system signals. When backend latency exceeds a threshold (e.g., p99 > 500ms), reduce the rate limit. When latency recovers, increase it. AIMD (Additive Increase Multiplicative Decrease) is the standard control law: increase the limit by a fixed amount each interval without congestion; halve it on overload signal (borrowed from TCP congestion control). On the client side, 429 responses should trigger exponential backoff with full jitter — random delay in [0, min(cap, base * 2^attempt)] — rather than synchronized retries that create retry storms. Libraries like resilience4j and Hystrix implement these patterns out of the box.

How does the token bucket algorithm work?

The token bucket maintains a counter (bucket) with a maximum capacity. Tokens are added at a fixed rate (e.g., 100 per second). Each request consumes one token. If the bucket has tokens, the request is allowed and a token is consumed. If empty, the request is rejected or queued. Token bucket allows bursts up to bucket capacity while enforcing an average rate.

What is the difference between token bucket and leaky bucket?

Token bucket allows bursting: tokens accumulate when the system is idle, enabling a burst of requests at the maximum rate equal to bucket capacity. Leaky bucket enforces a strict output rate: requests enter a queue (bucket) and are processed at a fixed rate regardless of arrival pattern. Leaky bucket eliminates bursts at the cost of added latency; token bucket is burst-friendly.

How does a sliding window log rate limiter work?

The sliding window log stores a timestamp for each request in a sorted log (e.g., Redis sorted set). On each new request, it evicts timestamps older than the window size, then checks if the remaining count is below the limit. If so, it adds the new timestamp and allows the request. This is exact but memory-intensive: storage is O(requests per window) per user.

What is GCRA (Generic Cell Rate Algorithm)?

GCRA (also called virtual scheduling or leaky bucket as a meter) tracks the Theoretical Arrival Time (TAT) of the next allowed request. On each request, TAT = max(now, last_TAT) + emission_interval. If TAT – now exceeds the burst tolerance, the request is rejected. GCRA is exact and stateless (only stores TAT), making it efficient for distributed rate limiting with a shared clock.

How do you implement distributed rate limiting across multiple servers?

Centralized counter in Redis using INCR with TTL: all servers increment a shared counter; if it exceeds the limit, reject. More precisely, use a Lua script for atomic check-and-increment. For low-latency, use a token bucket with Redis (EVAL script atomically reads bucket state, adds tokens for elapsed time, checks capacity, decrements). Alternatively, use local approximations with periodic sync (sliding window counter with local + remote state) for reduced Redis round-trips.