Low-Level Design: API Rate Limiter – Token Bucket, Sliding Window, and Distributed Throttling

Core Entities

Two primary entities drive the design:

@dataclass
class RateLimitPolicy:
    policy_id: str
    key_type: str        # USER | IP | API_KEY | GLOBAL
    limit: int           # max requests
    window_seconds: int  # time window
    algorithm: str       # TOKEN_BUCKET | SLIDING_WINDOW_LOG | SLIDING_WINDOW_COUNTER

@dataclass
class RateLimitCounter:
    key: str             # e.g. "user:123" or "ip:1.2.3.4"
    window_start: int    # Unix timestamp of window start
    count: int           # requests in current window

Token Bucket Algorithm

Tokens refill at rate r tokens/second up to capacity C. Each request consumes 1 token. Allows bursting up to C requests while enforcing average rate r.

import redis
import time

class TokenBucketRateLimiter:
    def __init__(self, redis_client, capacity, refill_rate):
        self.redis = redis_client
        self.capacity = capacity          # max tokens (burst limit)
        self.refill_rate = refill_rate    # tokens per second

    def is_allowed(self, key):
        now = time.time()
        bucket_key = f"tb:{key}"
        # MULTI/EXEC for atomicity (optimistic locking)
        with self.redis.pipeline() as pipe:
            while True:
                try:
                    pipe.watch(bucket_key)
                    data = pipe.hgetall(bucket_key)
                    if data:
                        tokens = float(data[b'tokens'])
                        last_refill = float(data[b'last_refill'])
                    else:
                        tokens = self.capacity
                        last_refill = now
                    # Refill tokens based on elapsed time
                    elapsed = now - last_refill
                    tokens = min(self.capacity, tokens + elapsed * self.refill_rate)
                    if tokens < 1:
                        pipe.reset()
                        return False  # Rate limited
                    tokens -= 1
                    pipe.multi()
                    pipe.hset(bucket_key, mapping={'tokens': tokens, 'last_refill': now})
                    pipe.expire(bucket_key, int(self.capacity / self.refill_rate) + 10)
                    pipe.execute()
                    return True
                except redis.WatchError:
                    continue  # Retry on concurrent modification

Trade-offs: allows bursting, smooth average rate, but state is per-instance without Redis. Redis MULTI/EXEC (optimistic locking) handles concurrent access but can retry under high contention – use Lua script for better performance.

Sliding Window Log

Store timestamp of every request in a sorted set. Count requests in [now – window, now]. Exact accuracy but O(limit) memory per key.

class SlidingWindowLogLimiter:
    def __init__(self, redis_client, limit, window_seconds):
        self.redis = redis_client
        self.limit = limit
        self.window = window_seconds

    def is_allowed(self, key):
        now = time.time()
        window_start = now - self.window
        log_key = f"swl:{key}"

        pipe = self.redis.pipeline()
        # Remove expired entries
        pipe.zremrangebyscore(log_key, 0, window_start)
        # Count requests in window
        pipe.zcard(log_key)
        # Add current request timestamp
        pipe.zadd(log_key, {str(now): now})
        pipe.expire(log_key, self.window + 1)
        results = pipe.execute()

        count = results[1]
        if count >= self.limit:
            # Remove the request we just added
            self.redis.zrem(log_key, str(now))
            return False
        return True

Memory cost: O(limit) per key – if limit is 10,000 req/hour, that is 10,000 entries in the sorted set. Acceptable for low-traffic APIs, problematic at scale.

Sliding Window Counter (Fixed + Fractional)

Approximate sliding window using two fixed-window counters with weighted interpolation. O(1) memory per key.

class SlidingWindowCounterLimiter:
    def __init__(self, redis_client, limit, window_seconds):
        self.redis = redis_client
        self.limit = limit
        self.window = window_seconds

    def is_allowed(self, key):
        now = time.time()
        current_window = int(now // self.window)
        prev_window = current_window - 1
        elapsed_in_current = now % self.window

        curr_key = f"swc:{key}:{current_window}"
        prev_key = f"swc:{key}:{prev_window}"

        pipe = self.redis.pipeline()
        pipe.get(curr_key)
        pipe.get(prev_key)
        results = pipe.execute()

        curr_count = int(results[0] or 0)
        prev_count = int(results[1] or 0)

        # Approximate sliding window:
        # weight of previous window decreases as we move into current window
        weight = 1 - (elapsed_in_current / self.window)
        estimated = prev_count * weight + curr_count

        if estimated >= self.limit:
            return False

        pipe = self.redis.pipeline()
        pipe.incr(curr_key)
        pipe.expire(curr_key, self.window * 2)
        pipe.execute()
        return True

Accuracy: assumes uniform distribution of requests in the previous window. In practice, error is at most 10% for typical traffic patterns. Memory: O(1) per key. This is what Cloudflare uses in production.

Distributed Rate Limiting

Fixed window with Redis INCR: Redis is single-threaded, so INCR is atomic. Simple and fast.

def is_allowed_fixed_window(redis_client, key, limit, window):
    now = int(time.time())
    window_key = f"{key}:{now // window}"
    count = redis_client.incr(window_key)
    if count == 1:
        redis_client.expire(window_key, window)
    return count <= limit

# Lua script for atomic token bucket (no WatchError retries)
TOKEN_BUCKET_LUA = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
if tokens < 1 then
    return 0
end
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 10)
return 1
"""

Multi-region rate limiting: strict global consistency adds 50-100ms latency (cross-region round trip). Instead:

Each region maintains its own Redis counter
Regions sync every 100ms via pub/sub or background job
Allow 10% overage tolerance – briefly exceed limit rather than adding latency
Use consistent hashing to route same user/IP to same region when possible

Response Headers and 429 Handling

Standard rate limit response headers that clients expect:

def build_rate_limit_headers(limit, remaining, reset_timestamp):
    """Standard rate limit headers."""
    return {
        "X-RateLimit-Limit": str(limit),           # max requests in window
        "X-RateLimit-Remaining": str(remaining),    # requests left in window
        "X-RateLimit-Reset": str(reset_timestamp),  # Unix timestamp when window resets
    }

def build_429_response(retry_after_seconds):
    """429 Too Many Requests response."""
    return {
        "status": 429,
        "headers": {
            "Retry-After": str(retry_after_seconds),  # seconds to wait before retry
            "Content-Type": "application/json",
        },
        "body": {
            "error": "rate_limit_exceeded",
            "message": f"Too many requests. Retry after {retry_after_seconds} seconds.",
        }
    }

# RateLimit-Policy header (newer standard, RFC 9110 draft):
# RateLimit-Policy: 100;w=3600;burst=10;comment="sliding window"

Retry-After: required on 429 responses. Can be a number of seconds or an HTTP date. Most clients will respect this and back off. X-RateLimit-* headers should be sent on every response, not just 429s, so clients can proactively slow down before hitting the limit.

Frequently Asked Questions

What is the difference between token bucket and sliding window rate limiting?

Token bucket allows bursting: if you have accumulated tokens, you can send a burst of requests up to capacity C. Sliding window log is exact and does not allow bursting beyond the limit – it counts actual requests in the past window_seconds. For APIs that want to allow short bursts (CDN, mobile apps), use token bucket. For strict per-window limits (payment APIs), use sliding window.

Why use a Lua script for Redis rate limiting instead of MULTI/EXEC?

Lua scripts execute atomically on the Redis server – no round trips needed and no WatchError retries under contention. MULTI/EXEC (optimistic locking) can fail and retry repeatedly under high concurrency, degrading to O(n) attempts. The Lua script is a single network round trip and always succeeds atomically, giving predictable O(1) latency.

How accurate is the sliding window counter approximation?

The approximation assumes requests were uniformly distributed in the previous window. The maximum error occurs when all previous requests happened at the very end of the previous window – in that case you might allow up to (2 * limit – 1) requests in a true sliding window. In practice with real traffic, the error is typically under 5-10%. Cloudflare uses this approach for its global rate limiting.

How do you handle rate limiting across multiple data centers without adding latency?

Each region maintains its own Redis counter. Regions sync their counts every 100ms via pub/sub or a gossip protocol. Accept a tolerance of roughly 10% overage – briefly exceeding the limit is better than adding 50-100ms cross-region latency to every request. For user-specific limits, use consistent hashing to route the same user to the same region as much as possible.

What rate limit headers should every API return?

Return X-RateLimit-Limit (window max), X-RateLimit-Remaining (requests left), and X-RateLimit-Reset (Unix timestamp of window reset) on every response. On 429, add Retry-After (seconds to wait). Send these on all responses, not just 429s – well-behaved clients will proactively throttle when Remaining approaches 0, reducing hammering and 429s.