Rate Limiter System Low-Level Design

Rate limiting is a core system design topic. This guide covers every major algorithm, their trade-offs, and Redis implementation patterns you need for interviews.

Fixed Window Counter

Divide time into fixed windows (e.g., each minute). Keep a counter per window. Simple and cheap – O(1) time and space per request.

# Redis commands for fixed window
def is_allowed_fixed_window(user_id, limit, window_seconds):
    key = f"rl:{user_id}:{int(time.time() // window_seconds)}"
    count = redis.incr(key)
    if count == 1:
        redis.expire(key, window_seconds)
    return count <= limit

The boundary problem: if the limit is 100 req/min and a user sends 100 requests at 00:59 and 100 at 01:01, they get 200 requests in a 2-second span – double the intended limit. This burst at window boundaries is the main weakness.

Sliding Window Log

Store the timestamp of every request in a Redis sorted set. On each request, remove timestamps older than the window, then check count. Accurate but O(requests) memory.

# Redis commands for sliding window log
def is_allowed_sliding_log(user_id, limit, window_seconds):
    now = time.time()
    key = f"rl:log:{user_id}"
    pipeline = redis.pipeline()
    # Remove entries outside the window
    pipeline.zremrangebyscore(key, 0, now - window_seconds)
    # Add current request
    pipeline.zadd(key, {str(now): now})
    # Count requests in window
    pipeline.zcard(key)
    pipeline.expire(key, window_seconds)
    results = pipeline.execute()
    count = results[2]
    return count <= limit

Memory cost: storing one entry per request. For a user making 1000 req/min, that is 1000 entries in the sorted set. Fine for low-traffic APIs, problematic at scale.

Sliding Window Counter (Approximate)

Combine two adjacent fixed windows with a weighted approximation. Memory efficient with only ~0.003% error rate.

def is_allowed_sliding_counter(user_id, limit, window_seconds):
    now = time.time()
    current_window = int(now // window_seconds)
    prev_window = current_window - 1
    
    current_key = f"rl:{user_id}:{current_window}"
    prev_key = f"rl:{user_id}:{prev_window}"
    
    current_count = int(redis.get(current_key) or 0)
    prev_count = int(redis.get(prev_key) or 0)
    
    # How far into the current window are we? (0.0 to 1.0)
    elapsed = (now % window_seconds) / window_seconds
    
    # Weight previous window by remaining overlap
    approximate_count = prev_count * (1 - elapsed) + current_count
    
    if approximate_count >= limit:
        return False
    
    pipeline = redis.pipeline()
    pipeline.incr(current_key)
    pipeline.expire(current_key, window_seconds * 2)
    pipeline.execute()
    return True

This is what Cloudflare uses in production. The error is bounded because the approximation assumes previous window requests were uniformly distributed – the worst case error is small enough to be acceptable for rate limiting.

Token Bucket

A bucket holds up to capacity tokens. Tokens drip in at refill_rate per second. Each request consumes one token. Allows controlled bursts up to bucket capacity.

# Lua script for atomic token bucket in Redis
TOKEN_BUCKET_LUA = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill based on elapsed time
local elapsed = now - last_refill
local new_tokens = math.min(capacity, tokens + elapsed * refill_rate)

if new_tokens >= requested then
    new_tokens = new_tokens - requested
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
    return 1  -- allowed
else
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
    return 0  -- denied
end
"""

def is_allowed_token_bucket(user_id, capacity, refill_rate):
    key = f"rl:bucket:{user_id}"
    result = redis.eval(TOKEN_BUCKET_LUA, 1, key, capacity, refill_rate, time.time(), 1)
    return result == 1

AWS API Gateway and most cloud providers use token bucket. The Lua script ensures atomicity – check and update happen in a single Redis operation, preventing race conditions.

Leaky Bucket

Requests enter a queue (the bucket). A processor drains the queue at a fixed rate. Bursts are absorbed by the queue; excess requests that overflow the bucket are dropped.

Key difference from token bucket: token bucket allows bursts to be processed immediately (as long as tokens are available). Leaky bucket always processes at constant rate – bursts add latency, not speed.

Use case: traffic shaping for downstream services that cannot handle bursts. The outbound rate is always smooth regardless of inbound spike. Adds latency proportional to queue depth.

Distributed Rate Limiting

For distributed systems, use a single Redis node (or Redis Cluster with consistent hashing) as the shared counter store.

# Lua script for atomic sliding window counter (distributed-safe)
SLIDING_WINDOW_LUA = """
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

-- Remove old entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

-- Count current entries
local count = redis.call('ZCARD', key)

if count < limit then
    -- Add this request
    redis.call('ZADD', key, now, now .. math.random())
    redis.call('EXPIRE', key, window)
    return 1  -- allowed
end

return 0  -- denied
"""

For Redis Cluster: shard by user_id using consistent hashing. All requests for user X go to the same shard. This avoids cross-shard coordination at the cost of uneven load if some users are much more active than others.

Race condition without Lua: between ZCARD (read) and ZADD (write), another request could slip through. Lua scripts in Redis execute atomically – no other command can run between lines.

Rate Limit Response Headers

Always return rate limit info in response headers so clients can back off gracefully.

X-RateLimit-Limit: 100          # max requests per window
X-RateLimit-Remaining: 47       # requests left in current window
X-RateLimit-Reset: 1685000060   # Unix timestamp when window resets
Retry-After: 30                 # seconds until client can retry (on 429)

Return HTTP 429 Too Many Requests when the limit is exceeded. Include Retry-After so well-behaved clients implement exponential backoff instead of hammering the server.

Multi-Tier Rate Limiting

Production systems apply multiple rate limits simultaneously, each with different scopes and limits.

TierKeyTypical limitPurpose
Per-IPip:1.2.3.41000/hourBlock scrapers, DDoS
Per-Useruser:12345500/hourFair usage per account
Per-API-keykey:abc12310000/hourTiered pricing plans
Per-Endpointuser:12345:/search100/minProtect expensive ops
Globalglobal:api1M/minSystem-wide protection

Check limits in order from most specific to least specific. A request is denied if any tier exceeds its limit. Use a pipeline to check all tiers in one Redis round trip.

Algorithm Comparison

AlgorithmMemoryBurst handlingAccuracyBest for
Fixed windowO(1)2x at boundaryApproximateSimple internal APIs
Sliding logO(requests)ExactExactLow-traffic, strict limits
Sliding counterO(1)Smoothed~99.997%High-traffic production
Token bucketO(1)Controlled bursts OKExactAPIs with burst tolerance
Leaky bucketO(queue size)QueuedExactTraffic shaping

Stripe system design interviews cover rate limiting and API protection. See design patterns for Stripe interview: rate limiter system design.

Cloudflare engineering covers rate limiting at massive scale. See system design patterns for Cloudflare interview: rate limiting and traffic shaping design.

Datadog system design covers API rate limiting and monitoring. See patterns for Datadog interview: rate limiting and API infrastructure design.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

Scroll to Top