Rate limiting is a core system design topic. This guide covers every major algorithm, their trade-offs, and Redis implementation patterns you need for interviews.
Fixed Window Counter
Divide time into fixed windows (e.g., each minute). Keep a counter per window. Simple and cheap – O(1) time and space per request.
# Redis commands for fixed window
def is_allowed_fixed_window(user_id, limit, window_seconds):
key = f"rl:{user_id}:{int(time.time() // window_seconds)}"
count = redis.incr(key)
if count == 1:
redis.expire(key, window_seconds)
return count <= limit
The boundary problem: if the limit is 100 req/min and a user sends 100 requests at 00:59 and 100 at 01:01, they get 200 requests in a 2-second span – double the intended limit. This burst at window boundaries is the main weakness.
Sliding Window Log
Store the timestamp of every request in a Redis sorted set. On each request, remove timestamps older than the window, then check count. Accurate but O(requests) memory.
# Redis commands for sliding window log
def is_allowed_sliding_log(user_id, limit, window_seconds):
now = time.time()
key = f"rl:log:{user_id}"
pipeline = redis.pipeline()
# Remove entries outside the window
pipeline.zremrangebyscore(key, 0, now - window_seconds)
# Add current request
pipeline.zadd(key, {str(now): now})
# Count requests in window
pipeline.zcard(key)
pipeline.expire(key, window_seconds)
results = pipeline.execute()
count = results[2]
return count <= limit
Memory cost: storing one entry per request. For a user making 1000 req/min, that is 1000 entries in the sorted set. Fine for low-traffic APIs, problematic at scale.
Sliding Window Counter (Approximate)
Combine two adjacent fixed windows with a weighted approximation. Memory efficient with only ~0.003% error rate.
def is_allowed_sliding_counter(user_id, limit, window_seconds):
now = time.time()
current_window = int(now // window_seconds)
prev_window = current_window - 1
current_key = f"rl:{user_id}:{current_window}"
prev_key = f"rl:{user_id}:{prev_window}"
current_count = int(redis.get(current_key) or 0)
prev_count = int(redis.get(prev_key) or 0)
# How far into the current window are we? (0.0 to 1.0)
elapsed = (now % window_seconds) / window_seconds
# Weight previous window by remaining overlap
approximate_count = prev_count * (1 - elapsed) + current_count
if approximate_count >= limit:
return False
pipeline = redis.pipeline()
pipeline.incr(current_key)
pipeline.expire(current_key, window_seconds * 2)
pipeline.execute()
return True
This is what Cloudflare uses in production. The error is bounded because the approximation assumes previous window requests were uniformly distributed – the worst case error is small enough to be acceptable for rate limiting.
Token Bucket
A bucket holds up to capacity tokens. Tokens drip in at refill_rate per second. Each request consumes one token. Allows controlled bursts up to bucket capacity.
# Lua script for atomic token bucket in Redis
TOKEN_BUCKET_LUA = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill based on elapsed time
local elapsed = now - last_refill
local new_tokens = math.min(capacity, tokens + elapsed * refill_rate)
if new_tokens >= requested then
new_tokens = new_tokens - requested
redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
return 1 -- allowed
else
redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
return 0 -- denied
end
"""
def is_allowed_token_bucket(user_id, capacity, refill_rate):
key = f"rl:bucket:{user_id}"
result = redis.eval(TOKEN_BUCKET_LUA, 1, key, capacity, refill_rate, time.time(), 1)
return result == 1
AWS API Gateway and most cloud providers use token bucket. The Lua script ensures atomicity – check and update happen in a single Redis operation, preventing race conditions.
Leaky Bucket
Requests enter a queue (the bucket). A processor drains the queue at a fixed rate. Bursts are absorbed by the queue; excess requests that overflow the bucket are dropped.
Key difference from token bucket: token bucket allows bursts to be processed immediately (as long as tokens are available). Leaky bucket always processes at constant rate – bursts add latency, not speed.
Use case: traffic shaping for downstream services that cannot handle bursts. The outbound rate is always smooth regardless of inbound spike. Adds latency proportional to queue depth.
Distributed Rate Limiting
For distributed systems, use a single Redis node (or Redis Cluster with consistent hashing) as the shared counter store.
# Lua script for atomic sliding window counter (distributed-safe)
SLIDING_WINDOW_LUA = """
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
-- Remove old entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
-- Count current entries
local count = redis.call('ZCARD', key)
if count < limit then
-- Add this request
redis.call('ZADD', key, now, now .. math.random())
redis.call('EXPIRE', key, window)
return 1 -- allowed
end
return 0 -- denied
"""
For Redis Cluster: shard by user_id using consistent hashing. All requests for user X go to the same shard. This avoids cross-shard coordination at the cost of uneven load if some users are much more active than others.
Race condition without Lua: between ZCARD (read) and ZADD (write), another request could slip through. Lua scripts in Redis execute atomically – no other command can run between lines.
Rate Limit Response Headers
Always return rate limit info in response headers so clients can back off gracefully.
X-RateLimit-Limit: 100 # max requests per window
X-RateLimit-Remaining: 47 # requests left in current window
X-RateLimit-Reset: 1685000060 # Unix timestamp when window resets
Retry-After: 30 # seconds until client can retry (on 429)
Return HTTP 429 Too Many Requests when the limit is exceeded. Include Retry-After so well-behaved clients implement exponential backoff instead of hammering the server.
Multi-Tier Rate Limiting
Production systems apply multiple rate limits simultaneously, each with different scopes and limits.
| Tier | Key | Typical limit | Purpose |
|---|---|---|---|
| Per-IP | ip:1.2.3.4 | 1000/hour | Block scrapers, DDoS |
| Per-User | user:12345 | 500/hour | Fair usage per account |
| Per-API-key | key:abc123 | 10000/hour | Tiered pricing plans |
| Per-Endpoint | user:12345:/search | 100/min | Protect expensive ops |
| Global | global:api | 1M/min | System-wide protection |
Check limits in order from most specific to least specific. A request is denied if any tier exceeds its limit. Use a pipeline to check all tiers in one Redis round trip.
Algorithm Comparison
| Algorithm | Memory | Burst handling | Accuracy | Best for |
|---|---|---|---|---|
| Fixed window | O(1) | 2x at boundary | Approximate | Simple internal APIs |
| Sliding log | O(requests) | Exact | Exact | Low-traffic, strict limits |
| Sliding counter | O(1) | Smoothed | ~99.997% | High-traffic production |
| Token bucket | O(1) | Controlled bursts OK | Exact | APIs with burst tolerance |
| Leaky bucket | O(queue size) | Queued | Exact | Traffic shaping |
Stripe system design interviews cover rate limiting and API protection. See design patterns for Stripe interview: rate limiter system design.
Cloudflare engineering covers rate limiting at massive scale. See system design patterns for Cloudflare interview: rate limiting and traffic shaping design.
Datadog system design covers API rate limiting and monitoring. See patterns for Datadog interview: rate limiting and API infrastructure design.