Overview
A token bucket rate limiter controls how many requests a client can make within a time period while allowing short bursts above the average rate. Tokens accumulate in a conceptual bucket at a fixed refill rate R tokens per second up to a burst capacity B. Each request consumes one token; if the bucket is empty the request is rejected or queued. This LLD covers the algorithm mechanics, the database model for rate limit rules, a Redis Lua script for atomic distributed enforcement, multi-tier limits combining per-user, per-endpoint, and global constraints, a comparison with the sliding window variant, and the key design decisions that make this production-ready.
Token Bucket Algorithm
The algorithm state per client is two values: tokens (current token count) and last_refill_at (timestamp of last check). On each request:
- Compute elapsed time since
last_refill_at. - Add
elapsed_seconds * rate_per_secondtokens, capped atburst_capacity. - If
tokens >= 1: consume 1 token, allow request. - If
tokens < 1: reject with HTTP 429. ReturnRetry-Afterheader =(1 - tokens) / rate_per_secondseconds.
The key insight is that the burst capacity B lets a client that was idle accumulate tokens and then fire a burst of up to B requests before being throttled down to the sustainable rate R/sec.
Core Data Model
-- Rate limit rule definitions (stored in DB, cached in app memory)
CREATE TABLE RateLimitRule (
rule_id BIGSERIAL PRIMARY KEY,
key_pattern VARCHAR(128) NOT NULL, -- e.g. 'user:{user_id}', 'endpoint:/api/search'
scope VARCHAR(32) NOT NULL, -- 'user', 'endpoint', 'global', 'ip'
rate_per_second FLOAT NOT NULL, -- tokens added per second
burst_capacity INT NOT NULL, -- max tokens in bucket
plan VARCHAR(32), -- NULL=applies to all, or 'free','pro','enterprise'
is_active BOOLEAN NOT NULL DEFAULT TRUE,
description TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (key_pattern, plan)
);
-- Example rules:
-- key_pattern='user:{user_id}', scope='user', rate=10, burst=50, plan='free'
-- key_pattern='user:{user_id}', scope='user', rate=100, burst=500, plan='pro'
-- key_pattern='endpoint:/api/search', scope='endpoint', rate=1000, burst=2000, plan=NULL
-- key_pattern='global', scope='global', rate=50000, burst=100000, plan=NULL
-- Audit log for rate limit violations (optional, for monitoring)
CREATE TABLE RateLimitViolation (
violation_id BIGSERIAL PRIMARY KEY,
bucket_key VARCHAR(256) NOT NULL,
user_id BIGINT,
ip_address INET,
endpoint VARCHAR(255),
requested_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
tokens_at_time FLOAT NOT NULL,
rule_id BIGINT REFERENCES RateLimitRule(rule_id)
);
CREATE INDEX idx_violation_user ON RateLimitViolation(user_id, requested_at DESC);
CREATE INDEX idx_violation_ip ON RateLimitViolation(ip_address, requested_at DESC);
Redis Lua Script for Atomic Token Consumption
Atomicity is critical: without it, two concurrent requests could both read tokens=1, both decide to allow, and both consume a token — resulting in allowing two requests when only one token remained. A Lua script executes atomically on Redis (single-threaded command execution).
-- KEYS[1]: bucket key, e.g. "rl:user:42:global"
-- ARGV[1]: rate_per_second (float)
-- ARGV[2]: burst_capacity (int)
-- ARGV[3]: current unix timestamp as float (seconds)
-- ARGV[4]: tokens to consume (usually 1)
-- Returns: {allowed (0/1), remaining_tokens (float), retry_after_ms (int)}
local key = KEYS[1]
local rate = tonumber(ARGV[1])
local burst = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local consume = tonumber(ARGV[4])
local data = redis.call('HMGET', key, 'tokens', 'last_refill_at')
local tokens = tonumber(data[1])
local last_refill = tonumber(data[2])
if tokens == nil then
-- First request: initialize bucket full
tokens = burst
last_refill = now
end
-- Refill tokens based on elapsed time
local elapsed = math.max(0, now - last_refill)
tokens = math.min(burst, tokens + elapsed * rate)
local allowed = 0
local retry_after_ms = 0
if tokens >= consume then
tokens = tokens - consume
allowed = 1
else
-- Compute how long until 1 token is available
local deficit = consume - tokens
retry_after_ms = math.ceil((deficit / rate) * 1000)
end
-- TTL: bucket expires after burst/rate seconds of inactivity (2x for safety)
local ttl = math.ceil(burst / rate * 2)
redis.call('HMSET', key,
'tokens', tokens,
'last_refill_at', now)
redis.call('EXPIRE', key, ttl)
return {allowed, tokens * 1000, retry_after_ms}
-- Note: tokens returned as integer (x1000) to avoid Lua float precision issues
Python Integration: Multi-Tier Rate Limiting
Production systems apply multiple overlapping limits simultaneously. A single request is checked against per-user, per-endpoint, and global buckets — all must pass.
import time
import redis
import hashlib
from functools import lru_cache
redis_client = redis.Redis(host='redis', port=6379, decode_responses=True)
# Load Lua script once and store SHA
LUA_SCRIPT = open('/app/rate_limit.lua').read()
SCRIPT_SHA = redis_client.script_load(LUA_SCRIPT)
@lru_cache(maxsize=512)
def get_rules_for_plan(plan: str) -> list:
"""Cache rule lookup from DB by plan — refresh every 60s via TTL."""
return db.query("SELECT * FROM RateLimitRule WHERE plan=%s OR plan IS NULL AND is_active", plan)
def check_rate_limit(user_id: int, endpoint: str, user_plan: str, ip: str) -> dict:
"""
Enforce multi-tier token bucket limits.
Returns {'allowed': bool, 'retry_after_ms': int, 'violated_tier': str|None}
"""
now = time.time()
rules = get_rules_for_plan(user_plan)
# Build list of (bucket_key, rule) pairs to check
checks = []
for rule in rules:
if rule['scope'] == 'user':
key = f"rl:user:{user_id}:{rule['rule_id']}"
elif rule['scope'] == 'endpoint':
key = f"rl:ep:{hashlib.md5(endpoint.encode()).hexdigest()}:{rule['rule_id']}"
elif rule['scope'] == 'global':
key = f"rl:global:{rule['rule_id']}"
elif rule['scope'] == 'ip':
key = f"rl:ip:{ip}:{rule['rule_id']}"
else:
continue
checks.append((key, rule))
# Evaluate all tiers — pipeline EVALSHA calls for efficiency
pipe = redis_client.pipeline(transaction=False)
for bucket_key, rule in checks:
pipe.evalsha(
SCRIPT_SHA,
1, # number of keys
bucket_key, # KEYS[1]
rule['rate_per_second'], # ARGV[1]
rule['burst_capacity'], # ARGV[2]
now, # ARGV[3]
1, # ARGV[4] tokens to consume
)
results = pipe.execute()
for (bucket_key, rule), result in zip(checks, results):
allowed, remaining_x1000, retry_after_ms = result
if not allowed:
log_violation(bucket_key, user_id, ip, endpoint, remaining_x1000 / 1000, rule['rule_id'])
return {
'allowed': False,
'retry_after_ms': retry_after_ms,
'violated_tier': rule['scope'],
'violated_rule': rule['key_pattern'],
}
return {'allowed': True, 'retry_after_ms': 0, 'violated_tier': None}
def rate_limit_middleware(request, next_handler):
result = check_rate_limit(
user_id=request.user.id,
endpoint=request.path,
user_plan=request.user.plan,
ip=request.remote_addr,
)
if not result['allowed']:
return http_429(
retry_after_ms=result['retry_after_ms'],
violated=result['violated_tier'],
)
return next_handler(request)
Sliding Window Variant Comparison
The sliding window log algorithm stores the exact timestamp of each request in a sorted set and counts requests in the last N seconds. It is more precise than the token bucket but uses more memory.
# Sliding window log using Redis sorted set
def sliding_window_check(redis_client, key, limit, window_seconds):
now = time.time()
window_start = now - window_seconds
pipe = redis_client.pipeline()
pipe.zremrangebyscore(key, '-inf', window_start) # remove old entries
pipe.zadd(key, {str(now): now}) # add current request
pipe.zcard(key) # count in window
pipe.expire(key, window_seconds + 1)
_, _, count, _ = pipe.execute()
return count <= limit
# Comparison:
# Token Bucket: O(1) memory per bucket. Allows bursts up to burst_capacity.
# Small floating-point state (tokens, timestamp).
# Sliding Window: O(limit) memory per key (one entry per request).
# Precise count — no burst beyond limit in any window.
# Better for strict "no more than N per hour" compliance.
# Fixed Window: O(1) memory. Resets at boundary — double-rate attack possible.
#
# Choose token bucket when bursts are legitimate (API clients doing batch ops).
# Choose sliding window when strict compliance is required (billing, abuse prevention).
Key Design Decisions
- Lua atomicity prevents race conditions: The entire read-compute-write cycle for token consumption runs as a single atomic Redis command via EVALSHA. Two concurrent requests for the same bucket can never both see the same token count because Redis processes Lua scripts single-threaded. This eliminates the need for distributed locks and keeps latency under 1ms per check.
- Burst capacity handles legitimate traffic spikes: A burst_capacity of 50 for a rate of 10/sec means an idle client can send 50 requests immediately without being throttled, while sustained throughput is capped at 10/sec. This correctly handles clients that batch requests after a quiet period — a common pattern for webhook consumers, mobile sync clients, and cron-based API callers.
- Per-tier rate limits by plan: Maintaining separate RateLimitRule rows per plan (free/pro/enterprise) with different rate and burst values means a free-plan user hitting the per-user limit does not affect pro users’ buckets. Checking all tiers in a single Redis pipeline keeps the overhead to one round-trip regardless of how many tiers apply.
- TTL-based bucket expiry prevents memory leaks: Each bucket key is given a TTL of
2 * burst_capacity / rateseconds. A bucket for rate=10, burst=50 expires after 10 seconds of inactivity, automatically cleaning up state for churned users or inactive endpoints without any explicit eviction job.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why is Redis used for token bucket rate limiting?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Redis provides atomic Lua script execution, sub-millisecond latency, and TTL-based key expiry—critical for high-throughput rate limit checks.”
}
},
{
“@type”: “Question”,
“name”: “How does the token bucket algorithm handle burst traffic?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The bucket accumulates tokens up to a max_burst capacity; short spikes can consume the burst reserve without being rejected.”
}
},
{
“@type”: “Question”,
“name”: “How are per-user and per-endpoint limits composed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A request passes multiple independent bucket checks (user, endpoint, global) in a pipeline; the most restrictive failing check determines the 429 response.”
}
},
{
“@type”: “Question”,
“name”: “What does retry_after_ms communicate to clients?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “It tells the client exactly how many milliseconds until enough tokens refill for the request to succeed, enabling smart backoff.”
}
}
]
}
See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering
See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety