Rate Limiter System Low-Level Design

Rate limiting is a core system design topic. This guide covers every major algorithm, their trade-offs, and Redis implementation patterns you need for interviews.

Fixed Window Counter

Divide time into fixed windows (e.g., each minute). Keep a counter per window. Simple and cheap – O(1) time and space per request.

# Redis commands for fixed window
def is_allowed_fixed_window(user_id, limit, window_seconds):
    key = f"rl:{user_id}:{int(time.time() // window_seconds)}"
    count = redis.incr(key)
    if count == 1:
        redis.expire(key, window_seconds)
    return count <= limit

The boundary problem: if the limit is 100 req/min and a user sends 100 requests at 00:59 and 100 at 01:01, they get 200 requests in a 2-second span – double the intended limit. This burst at window boundaries is the main weakness.

Sliding Window Log

Store the timestamp of every request in a Redis sorted set. On each request, remove timestamps older than the window, then check count. Accurate but O(requests) memory.

# Redis commands for sliding window log
def is_allowed_sliding_log(user_id, limit, window_seconds):
    now = time.time()
    key = f"rl:log:{user_id}"
    pipeline = redis.pipeline()
    # Remove entries outside the window
    pipeline.zremrangebyscore(key, 0, now - window_seconds)
    # Add current request
    pipeline.zadd(key, {str(now): now})
    # Count requests in window
    pipeline.zcard(key)
    pipeline.expire(key, window_seconds)
    results = pipeline.execute()
    count = results[2]
    return count <= limit

Memory cost: storing one entry per request. For a user making 1000 req/min, that is 1000 entries in the sorted set. Fine for low-traffic APIs, problematic at scale.

Sliding Window Counter (Approximate)

Combine two adjacent fixed windows with a weighted approximation. Memory efficient with only ~0.003% error rate.

def is_allowed_sliding_counter(user_id, limit, window_seconds):
    now = time.time()
    current_window = int(now // window_seconds)
    prev_window = current_window - 1
    
    current_key = f"rl:{user_id}:{current_window}"
    prev_key = f"rl:{user_id}:{prev_window}"
    
    current_count = int(redis.get(current_key) or 0)
    prev_count = int(redis.get(prev_key) or 0)
    
    # How far into the current window are we? (0.0 to 1.0)
    elapsed = (now % window_seconds) / window_seconds
    
    # Weight previous window by remaining overlap
    approximate_count = prev_count * (1 - elapsed) + current_count
    
    if approximate_count >= limit:
        return False
    
    pipeline = redis.pipeline()
    pipeline.incr(current_key)
    pipeline.expire(current_key, window_seconds * 2)
    pipeline.execute()
    return True

This is what Cloudflare uses in production. The error is bounded because the approximation assumes previous window requests were uniformly distributed – the worst case error is small enough to be acceptable for rate limiting.

Token Bucket

A bucket holds up to capacity tokens. Tokens drip in at refill_rate per second. Each request consumes one token. Allows controlled bursts up to bucket capacity.

# Lua script for atomic token bucket in Redis
TOKEN_BUCKET_LUA = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill based on elapsed time
local elapsed = now - last_refill
local new_tokens = math.min(capacity, tokens + elapsed * refill_rate)

if new_tokens >= requested then
    new_tokens = new_tokens - requested
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
    return 1  -- allowed
else
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
    return 0  -- denied
end
"""

def is_allowed_token_bucket(user_id, capacity, refill_rate):
    key = f"rl:bucket:{user_id}"
    result = redis.eval(TOKEN_BUCKET_LUA, 1, key, capacity, refill_rate, time.time(), 1)
    return result == 1

AWS API Gateway and most cloud providers use token bucket. The Lua script ensures atomicity – check and update happen in a single Redis operation, preventing race conditions.

Leaky Bucket

Requests enter a queue (the bucket). A processor drains the queue at a fixed rate. Bursts are absorbed by the queue; excess requests that overflow the bucket are dropped.

Key difference from token bucket: token bucket allows bursts to be processed immediately (as long as tokens are available). Leaky bucket always processes at constant rate – bursts add latency, not speed.

Use case: traffic shaping for downstream services that cannot handle bursts. The outbound rate is always smooth regardless of inbound spike. Adds latency proportional to queue depth.

Distributed Rate Limiting

For distributed systems, use a single Redis node (or Redis Cluster with consistent hashing) as the shared counter store.

# Lua script for atomic sliding window counter (distributed-safe)
SLIDING_WINDOW_LUA = """
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

-- Remove old entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

-- Count current entries
local count = redis.call('ZCARD', key)

if count < limit then
    -- Add this request
    redis.call('ZADD', key, now, now .. math.random())
    redis.call('EXPIRE', key, window)
    return 1  -- allowed
end

return 0  -- denied
"""

For Redis Cluster: shard by user_id using consistent hashing. All requests for user X go to the same shard. This avoids cross-shard coordination at the cost of uneven load if some users are much more active than others.

Race condition without Lua: between ZCARD (read) and ZADD (write), another request could slip through. Lua scripts in Redis execute atomically – no other command can run between lines.

Rate Limit Response Headers

Always return rate limit info in response headers so clients can back off gracefully.

X-RateLimit-Limit: 100          # max requests per window
X-RateLimit-Remaining: 47       # requests left in current window
X-RateLimit-Reset: 1685000060   # Unix timestamp when window resets
Retry-After: 30                 # seconds until client can retry (on 429)

Return HTTP 429 Too Many Requests when the limit is exceeded. Include Retry-After so well-behaved clients implement exponential backoff instead of hammering the server.

Multi-Tier Rate Limiting

Production systems apply multiple rate limits simultaneously, each with different scopes and limits.

TierKeyTypical limitPurpose
Per-IPip:1.2.3.41000/hourBlock scrapers, DDoS
Per-Useruser:12345500/hourFair usage per account
Per-API-keykey:abc12310000/hourTiered pricing plans
Per-Endpointuser:12345:/search100/minProtect expensive ops
Globalglobal:api1M/minSystem-wide protection

Check limits in order from most specific to least specific. A request is denied if any tier exceeds its limit. Use a pipeline to check all tiers in one Redis round trip.

Algorithm Comparison

AlgorithmMemoryBurst handlingAccuracyBest for
Fixed windowO(1)2x at boundaryApproximateSimple internal APIs
Sliding logO(requests)ExactExactLow-traffic, strict limits
Sliding counterO(1)Smoothed~99.997%High-traffic production
Token bucketO(1)Controlled bursts OKExactAPIs with burst tolerance
Leaky bucketO(queue size)QueuedExactTraffic shaping
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the sliding window log rate limiting algorithm?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Store each request timestamp in a Redis sorted set (key = ratelimit:{user_id}, score = timestamp). On each request: (1) ZADD the current timestamp. (2) ZREMRANGEBYSCORE to remove entries older than the window (current_time – window_seconds). (3) ZCARD to count remaining entries. (4) If count > limit, reject. Advantages: perfectly accurate, no boundary burst problem. Disadvantages: O(limit) memory per user (each request stored), and multiple Redis commands (use a Lua script for atomicity). Use when accuracy is critical (financial APIs, security-sensitive endpoints). For high-traffic consumer APIs, the sliding window counter approximation is preferred.”}},{“@type”:”Question”,”name”:”What is the fixed window boundary burst problem and how does sliding window counter solve it?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Fixed window problem: if the limit is 100 req/minute and windows reset at :00, a user can make 100 requests at :59 and 100 more at :01 – 200 requests in 2 seconds. Sliding window counter approximation: estimate the request count in the rolling window using two adjacent fixed window counters. Formula: count = current_window_count + previous_window_count * (1 – elapsed_fraction_of_current_window). Example with 60s window, 45s into current window: count = current_count + previous_count * 0.25. This reduces the burst to ~1.003x the limit (empirically <0.003% error). Uses only 2 counters per user regardless of request rate.”}},{“@type”:”Question”,”name”:”How does the token bucket algorithm work?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A token bucket has a maximum capacity of burst_size tokens. Tokens are added at a constant rate (rate tokens/second) up to the max capacity. Each request consumes one token. If no tokens available, the request is rejected (or queued). Allows controlled bursts up to burst_size. Implementation with Redis: store (tokens, last_refill_time) per user. On each request: compute elapsed time, add elapsed * rate tokens (capped at burst_size), subtract 1 for the request, store back. Use a Lua script for atomicity. AWS API Gateway and most cloud rate limiters use token bucket because it smooths traffic while allowing short bursts (e.g., mobile app opening with a burst of concurrent requests).”}},{“@type”:”Question”,”name”:”How do you implement distributed rate limiting across multiple servers?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Centralized Redis store: all app servers send rate limit checks to a single Redis cluster. Redis is fast enough for this (< 1ms p99). Use consistent hashing on user_id to route to the same Redis shard, avoiding cross-shard coordination. Use Lua scripts for atomic check-and-update to prevent race conditions between the read and write. Alternative: approximate local rate limiting per app server, accept slight over-allowing at startup or during failover. For multi-region: either use a global Redis (adds cross-region latency) or per-region limits (allow N requests per region per window, total ~N * regions – imprecise but low latency). Most systems choose per-region limits for latency reasons.”}},{“@type”:”Question”,”name”:”What rate limit response headers should an API return?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Standard headers: X-RateLimit-Limit (max requests in the window), X-RateLimit-Remaining (requests left in current window), X-RateLimit-Reset (Unix timestamp when the window resets or tokens refill). On 429 Too Many Requests: also include Retry-After (seconds to wait before retrying). These headers allow clients to implement proactive throttling instead of hitting 429s. Some APIs add X-RateLimit-Policy to describe the algorithm. The Retry-After header is critical: without it, clients may implement aggressive retry loops that amplify traffic. With it, well-behaved clients back off correctly.”}}]}

Stripe system design interviews cover rate limiting and API protection. See design patterns for Stripe interview: rate limiter system design.

Cloudflare engineering covers rate limiting at massive scale. See system design patterns for Cloudflare interview: rate limiting and traffic shaping design.

Datadog system design covers API rate limiting and monitoring. See patterns for Datadog interview: rate limiting and API infrastructure design.

Scroll to Top