Rate Limiting System Low-Level Design (Token Bucket, Leaky Bucket)

Why Rate Limiting?

Rate limiting protects services from abuse, ensures fair usage, and prevents cascading failures from traffic spikes. Applied at: API gateways (per-user or per-IP limits), payment endpoints (fraud prevention), search endpoints (prevent scraping), notification services (prevent spam).

Four Rate Limiting Algorithms

1. Token Bucket

A bucket holds up to capacity tokens. Tokens are added at a fixed rate (e.g., 10/second). Each request consumes 1 token. If the bucket is empty, the request is rejected. Allows bursts up to the capacity. Most common algorithm — used by AWS, Stripe.

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.last_refill = time.time()

    def allow(self):
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity,
                          self.tokens + elapsed * self.refill_rate)
        self.last_refill = now
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

2. Leaky Bucket

Requests enter a queue (the bucket) and are processed at a fixed output rate. Excess requests are dropped if the queue is full. Provides a smooth constant output rate — useful for traffic shaping. Implemented as a FIFO queue with a fixed drain rate.

3. Fixed Window Counter

Count requests per fixed time window (e.g., 100 requests per minute). Simple: INCR counter in Redis with TTL=60s. Problem: boundary burst — 100 requests at 0:59 and 100 at 1:01 = 200 requests in 2 seconds.

def allow_fixed_window(user_id, limit=100, window=60):
    key = f"rate:{user_id}:{int(time.time() // window)}"
    count = redis.incr(key)
    if count == 1:
        redis.expire(key, window)
    return count <= limit

4. Sliding Window Log

Keep a log of timestamps of recent requests per user. On each request: remove timestamps older than 1 minute, count remaining — if < limit, allow and add timestamp. Accurate but memory-intensive (stores every request timestamp). Use Redis sorted set: ZADD with score=timestamp, ZREMRANGEBYSCORE to expire old entries.

5. Sliding Window Counter (Recommended)

Approximate sliding window using two fixed window counters. For a 60s window: current_window_count + previous_window_count * (overlap_fraction).

def allow_sliding_window(user_id, limit=100, window=60):
    now = time.time()
    current_window = int(now // window)
    prev_window = current_window - 1
    overlap = now % window / window  # fraction of current window elapsed

    curr_key = f"rate:{user_id}:{current_window}"
    prev_key = f"rate:{user_id}:{prev_window}"

    curr_count = int(redis.get(curr_key) or 0)
    prev_count = int(redis.get(prev_key) or 0)

    # Weighted estimate: prev window contributes proportionally
    estimated = prev_count * (1 - overlap) + curr_count
    if estimated >= limit:
        return False

    redis.incr(curr_key)
    redis.expire(curr_key, window * 2)
    return True

Distributed Rate Limiting with Redis

For a distributed system (multiple API servers), rate limits must be enforced globally — not per-server. Use Redis as the central counter store. Atomic operations via Lua script prevent race conditions:

-- Lua script for atomic token bucket check:
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now

local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)

if tokens >= 1 then
    tokens = tokens - 1
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, 3600)
    return 1  -- allowed
else
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    return 0  -- rejected
end

Rate Limit Headers

Return in every response: X-RateLimit-Limit (total allowed), X-RateLimit-Remaining (remaining in window), X-RateLimit-Reset (Unix timestamp when window resets), Retry-After (seconds until next allowed request, on 429). Status 429 Too Many Requests on rejection.

Key Design Decisions

  • Token bucket for API rate limits — allows bursts, easy to understand
  • Sliding window counter for accuracy without memory overhead of log
  • Redis Lua script for atomicity — prevents race conditions in distributed environment
  • Per-user + per-IP limits — per-IP catches unauthenticated abuse, per-user catches authenticated abuse


{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the difference between token bucket and leaky bucket rate limiting?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Token bucket: a virtual bucket holds up to N tokens. Tokens are added at a constant rate (e.g., 10/second). Each request consumes 1 token. If the bucket is empty, the request is rejected. Key property: allows bursts up to the bucket capacity — if no requests were made for 5 seconds, 50 tokens accumulate and a burst of 50 requests is allowed. Token bucket is the most common API rate limiting algorithm (used by AWS, Stripe). Leaky bucket: requests enter a FIFO queue and are drained at a constant rate. Excess requests (queue full) are dropped. Key property: output rate is perfectly smooth — no bursts, constant drain rate. Useful for traffic shaping (e.g., enforcing a constant bitrate). For API rate limiting, token bucket is preferred because it allows legitimate burst traffic. Leaky bucket is preferred for network egress shaping where smoothness is critical.”}},{“@type”:”Question”,”name”:”What is the sliding window counter and how does it fix the fixed window boundary burst?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Fixed window counter issue: a limit of 100 req/min allows 100 requests at 0:59 and 100 at 1:01 — 200 requests in 2 seconds. Sliding window log: keep timestamps of all requests in the window, count them. Accurate but O(requests) memory per user. Sliding window counter approximation: maintain counters for two adjacent fixed windows. The sliding window estimate = prev_window_count * (1 – elapsed_fraction_of_current_window) + current_window_count. Example: limit 100/min, 30 seconds into the current window, prev=80, curr=60. Estimate = 80*(1-0.5)+60 = 100. At the boundary, the "overhang" from the previous window is counted proportionally. This catches boundary bursts while using only O(1) memory per user. Accuracy: the approximation is within ~0.1% of the true sliding window for uniform traffic.”}},{“@type”:”Question”,”name”:”How do you implement distributed rate limiting with Redis?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”In a distributed system, rate limits must be enforced globally across multiple API servers. Use Redis as the central store. For fixed window: INCR rate:{user_id}:{window} returns the new count atomically; SET expiry with EXPIRE. For sliding window or token bucket: use a Redis Lua script to perform the check-and-update atomically. Lua scripts run atomically in Redis (no other commands can interleave). The script reads the current state (tokens or window count), checks the limit, updates the counter if allowed, and returns allowed/rejected in a single round trip. For very high throughput (> 100K checks/second per user): use Redis Cluster, shard by user_id. Local in-process cache with a 1-second TTL can absorb most requests without a Redis call — only sync every N requests or when the local estimate approaches the limit.”}},{“@type”:”Question”,”name”:”What HTTP response code and headers should a rate limiter return?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Return HTTP 429 Too Many Requests when a request is rate limited. Include these headers: X-RateLimit-Limit: the total allowed requests in the window (e.g., 1000). X-RateLimit-Remaining: requests remaining in the current window (e.g., 0 when at the limit). X-RateLimit-Reset: Unix timestamp when the current window resets. Retry-After: seconds until the client can retry (integer). The Retry-After header is the most important — without it, clients will retry immediately and generate more rejected requests. Set it to the exact number of seconds until the next token is available. For token bucket, this is 1/refill_rate. For fixed window, this is seconds until the window resets. Return these headers even on successful requests (with the current remaining count) so clients can proactively throttle.”}},{“@type”:”Question”,”name”:”How do you rate limit at the API gateway versus the application layer?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”API gateway rate limiting (e.g., Kong, AWS API Gateway, Nginx): applied before requests reach the application. Enforced per-IP or per-API-key. Benefits: protects all downstream services uniformly, no application code changes needed, fast rejection without application overhead. Limitation: coarse-grained — hard to implement per-endpoint or per-user-tier limits without complex gateway config. Application layer rate limiting: implemented in code (middleware), has access to the authenticated user, user tier, and request context. Can implement tiered limits (free: 100/hour, paid: 10,000/hour), endpoint-specific limits (strict on /search, lenient on /static), and business rules (ban users with payment failures). Best practice: use both. API gateway blocks obvious abuse (IP-level floods, DDoS) cheaply. Application layer enforces per-user business rules with access to the full request context.”}}]}

Stripe system design covers API rate limiting and throttling. See common questions for Stripe interview: rate limiting and API design.

Uber system design covers rate limiting for high-traffic APIs. Review patterns for Uber interview: rate limiting and traffic shaping.

Atlassian system design covers API gateways and rate limiting. See design patterns for Atlassian interview: rate limiting and API gateway design.

Scroll to Top