Rate Limiting System Low-Level Design (Token Bucket, Leaky Bucket)

Why Rate Limiting?

Rate limiting protects services from abuse, ensures fair usage, and prevents cascading failures from traffic spikes. Applied at: API gateways (per-user or per-IP limits), payment endpoints (fraud prevention), search endpoints (prevent scraping), notification services (prevent spam).

Four Rate Limiting Algorithms

1. Token Bucket

A bucket holds up to capacity tokens. Tokens are added at a fixed rate (e.g., 10/second). Each request consumes 1 token. If the bucket is empty, the request is rejected. Allows bursts up to the capacity. Most common algorithm — used by AWS, Stripe.

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.last_refill = time.time()

    def allow(self):
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity,
                          self.tokens + elapsed * self.refill_rate)
        self.last_refill = now
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

2. Leaky Bucket

Requests enter a queue (the bucket) and are processed at a fixed output rate. Excess requests are dropped if the queue is full. Provides a smooth constant output rate — useful for traffic shaping. Implemented as a FIFO queue with a fixed drain rate.

3. Fixed Window Counter

Count requests per fixed time window (e.g., 100 requests per minute). Simple: INCR counter in Redis with TTL=60s. Problem: boundary burst — 100 requests at 0:59 and 100 at 1:01 = 200 requests in 2 seconds.

def allow_fixed_window(user_id, limit=100, window=60):
    key = f"rate:{user_id}:{int(time.time() // window)}"
    count = redis.incr(key)
    if count == 1:
        redis.expire(key, window)
    return count <= limit

4. Sliding Window Log

Keep a log of timestamps of recent requests per user. On each request: remove timestamps older than 1 minute, count remaining — if < limit, allow and add timestamp. Accurate but memory-intensive (stores every request timestamp). Use Redis sorted set: ZADD with score=timestamp, ZREMRANGEBYSCORE to expire old entries.

5. Sliding Window Counter (Recommended)

Approximate sliding window using two fixed window counters. For a 60s window: current_window_count + previous_window_count * (overlap_fraction).

def allow_sliding_window(user_id, limit=100, window=60):
    now = time.time()
    current_window = int(now // window)
    prev_window = current_window - 1
    overlap = now % window / window  # fraction of current window elapsed

    curr_key = f"rate:{user_id}:{current_window}"
    prev_key = f"rate:{user_id}:{prev_window}"

    curr_count = int(redis.get(curr_key) or 0)
    prev_count = int(redis.get(prev_key) or 0)

    # Weighted estimate: prev window contributes proportionally
    estimated = prev_count * (1 - overlap) + curr_count
    if estimated >= limit:
        return False

    redis.incr(curr_key)
    redis.expire(curr_key, window * 2)
    return True

Distributed Rate Limiting with Redis

For a distributed system (multiple API servers), rate limits must be enforced globally — not per-server. Use Redis as the central counter store. Atomic operations via Lua script prevent race conditions:

-- Lua script for atomic token bucket check:
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now

local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)

if tokens >= 1 then
    tokens = tokens - 1
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, 3600)
    return 1  -- allowed
else
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    return 0  -- rejected
end

Rate Limit Headers

Return in every response: X-RateLimit-Limit (total allowed), X-RateLimit-Remaining (remaining in window), X-RateLimit-Reset (Unix timestamp when window resets), Retry-After (seconds until next allowed request, on 429). Status 429 Too Many Requests on rejection.

Key Design Decisions

  • Token bucket for API rate limits — allows bursts, easy to understand
  • Sliding window counter for accuracy without memory overhead of log
  • Redis Lua script for atomicity — prevents race conditions in distributed environment
  • Per-user + per-IP limits — per-IP catches unauthenticated abuse, per-user catches authenticated abuse

Stripe system design covers API rate limiting and throttling. See common questions for Stripe interview: rate limiting and API design.

Uber system design covers rate limiting for high-traffic APIs. Review patterns for Uber interview: rate limiting and traffic shaping.

Atlassian system design covers API gateways and rate limiting. See design patterns for Atlassian interview: rate limiting and API gateway design.

Scroll to Top