What is the token bucket algorithm for rate limiting?

Token bucket: a bucket holds up to 'capacity' tokens, refilling at a fixed 'rate' per second. Each request consumes 1 token. If the bucket has tokens, the request is allowed; otherwise rejected. Key property: allows bursts up to capacity tokens, then limits to the refill rate. Example: capacity=100, rate=10/sec — a client can burst 100 requests immediately, then is limited to 10/sec. This is the most common algorithm for API rate limiting. Implementation uses Redis with a Lua script to atomically read tokens, add new tokens based on elapsed time, consume 1, and write back — all in one atomic operation.

What is the difference between token bucket and sliding window rate limiting?

Token bucket allows controlled bursts: a client can accumulate tokens during idle periods and spend them in a burst. This matches real usage patterns (a user makes occasional bursts of requests). Fixed window counter resets at window boundaries — a client could make 2x the limit by sending requests at the end of one window and start of the next. Sliding window counter approximates a true sliding window using current and previous window counts — more accurate than fixed window, uses O(1) memory vs O(requests) for a true sliding window log. Sliding window counter is used by Cloudflare at scale (documented in their engineering blog).

How do you implement rate limiting in a distributed system?

Challenge: multiple API gateway instances each maintain local counters, creating race conditions. Solutions: (1) Centralized Redis with Lua scripts — all gateways share Redis, Lua ensures atomic check-and-increment. Adds ~1ms latency per request but is accurate. (2) Redis Cluster with consistent hashing — each client_id maps to one Redis shard, eliminating cross-shard coordination. (3) Local counter with periodic sync — each gateway maintains a local limit (total_limit / num_gateways) and syncs with Redis every 100ms. Very fast but allows temporary bursts. Choose based on accuracy requirements: centralized Redis for strict enforcement (financial APIs), local+sync for approximate limits at massive scale.

What HTTP headers should a rate-limited API return?

Standard rate limit headers (RFC 6585 and emerging IETF draft RateLimit-*): X-RateLimit-Limit: total requests allowed in window; X-RateLimit-Remaining: requests remaining in current window; X-RateLimit-Reset: Unix timestamp when the window resets (allows client to calculate exact wait time). On 429 Too Many Requests: add Retry-After header with seconds to wait before retrying. Some APIs also include X-RateLimit-Policy to identify which limit was hit (per-user, per-endpoint, global). Always return these headers on ALL responses, not just 429s — clients use them for adaptive throttling.

How do you rate limit by different dimensions?

Rate limit keys determine what counts as a 'unit': (1) API key — most common for public APIs; identifies the application/developer. (2) User ID — for authenticated endpoints; prevents one user from impacting others. (3) IP address — for unauthenticated endpoints; first line of defense against abuse. (4) IP + endpoint — different limits per endpoint (e.g., auth endpoint stricter than read endpoints). (5) User + endpoint — fine-grained per-user per-route limits (Pro users get higher limits on expensive endpoints). Real systems combine multiple dimensions: global IP limit + per-user limit + per-endpoint limit, with the most restrictive applying. Apply limits at multiple layers: CDN (global DDoS), API gateway (per-client), service (per-resource).

System Design Interview: API Rate Limiter Deep Dive (All Algorithms)

⏱ 8 min read

System Design Interview: Design an API Rate Limiter (Deep Dive)

API rate limiting is a fundamental system design topic — nearly every API at scale needs it. This deep-dive covers algorithms, distributed implementation, and real-world edge cases. Commonly tested at Stripe, Cloudflare, LinkedIn, and any company with public APIs.

Why Rate Limiting?

Prevent abuse: Stop malicious clients from overwhelming your API
Ensure fair usage: Prevent one client from consuming all capacity
Cost control: ML inference APIs cost money per call — rate limits protect revenue
Availability: Protect downstream services from traffic spikes
Business tiers: Free tier (1K req/day), Pro tier (100K req/day)

Where to Implement Rate Limiting

Client → CDN (global rate limit) → API Gateway (per-route limit) → Microservice
                                           ↓
                                   Rate Limiter Service
                                           ↓
                                     Redis Cluster

Best placement: API Gateway (centralized, before application logic runs). For Cloudflare-scale: implement at CDN edge nodes.

Algorithm 1: Token Bucket

import time
import redis

class TokenBucketRateLimiter:
    """
    Token bucket: bucket holds capacity tokens, refills at rate tokens/second.
    Request consumes 1 token. Allows bursts up to capacity.
    
    Best for: APIs that want to allow short bursts (bursty traffic).
    Used by: AWS, Stripe, most API gateways.
    """

    def __init__(self, redis_client, capacity: int, refill_rate: float):
        self.redis = redis_client
        self.capacity = capacity      # max tokens (burst size)
        self.refill_rate = refill_rate  # tokens added per second

    def allow_request(self, client_id: str) -> bool:
        """Atomic check-and-consume using Redis Lua script"""
        lua_script = """
        local key = KEYS[1]
        local capacity = tonumber(ARGV[1])
        local refill_rate = tonumber(ARGV[2])
        local now = tonumber(ARGV[3])
        local requested = tonumber(ARGV[4])

        local data = redis.call('HMGET', key, 'tokens', 'last_refill')
        local tokens = tonumber(data[1]) or capacity
        local last_refill = tonumber(data[2]) or now

        -- Add tokens based on time elapsed
        local elapsed = now - last_refill
        tokens = math.min(capacity, tokens + elapsed * refill_rate)

        if tokens >= requested then
            tokens = tokens - requested
            redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
            redis.call('EXPIRE', key, 3600)
            return 1  -- allowed
        else
            redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
            redis.call('EXPIRE', key, 3600)
            return 0  -- denied
        end
        """
        key = f"rate_limit:token:{client_id}"
        now = time.time()
        result = self.redis.eval(lua_script, 1, key,
                                self.capacity, self.refill_rate, now, 1)
        return bool(result)

Algorithm 2: Sliding Window Counter

class SlidingWindowCounterRateLimiter:
    """
    Approximation of sliding window using two fixed windows.
    Current window count + (1 - elapsed_fraction) * previous window count.
    
    Cheaper than true sliding window log (no per-request storage).
    Accurate to within ~0.1% for most traffic patterns.
    
    Best for: High-throughput APIs where exact precision isn't required.
    """

    def __init__(self, redis_client, limit: int, window_seconds: int):
        self.redis = redis_client
        self.limit = limit
        self.window = window_seconds

    def allow_request(self, client_id: str) -> tuple[bool, dict]:
        now = time.time()
        current_window = int(now // self.window)
        prev_window = current_window - 1
        elapsed_in_window = now % self.window

        curr_key = f"rate_limit:sw:{client_id}:{current_window}"
        prev_key = f"rate_limit:sw:{client_id}:{prev_window}"

        pipe = self.redis.pipeline()
        pipe.incr(curr_key)
        pipe.expire(curr_key, self.window * 2)
        pipe.get(prev_key)
        curr_count, _, prev_count = pipe.execute()

        curr_count = int(curr_count)
        prev_count = int(prev_count) if prev_count else 0

        # Weight previous window by remaining fraction
        weight = 1 - (elapsed_in_window / self.window)
        estimated_count = curr_count + prev_count * weight

        remaining = max(0, self.limit - estimated_count)
        reset_at = (current_window + 1) * self.window

        headers = {
            'X-RateLimit-Limit': str(self.limit),
            'X-RateLimit-Remaining': str(int(remaining)),
            'X-RateLimit-Reset': str(int(reset_at)),
        }

        if estimated_count > self.limit:
            headers['Retry-After'] = str(int(reset_at - now))
            return False, headers

        return True, headers

Algorithm 3: Leaky Bucket

class LeakyBucketRateLimiter:
    """
    Leaky bucket: requests enter a queue (bucket); processed at constant rate.
    Unlike token bucket, smooths traffic — no burst allowed.
    
    Best for: network traffic shaping, video streaming (constant bitrate).
    Less common for API rate limiting (no burst tolerance).
    """
    def __init__(self, rate_per_second: int, capacity: int):
        self.rate = rate_per_second    # processing rate (leak rate)
        self.capacity = capacity       # max queue size
        self.queue = []               # pending requests

    def try_add_request(self, request) -> bool:
        # Drain processed requests first
        now = time.time()
        to_process = min(len(self.queue),
                        int((now - self.last_drain) * self.rate))
        self.queue = self.queue[to_process:]
        self.last_drain = now

        if len(self.queue) >= self.capacity:
            return False  # bucket full, drop request

        self.queue.append(request)
        return True

Distributed Rate Limiting

The Distributed Challenge

With multiple API gateway instances, each has a local counter. Race condition: client sends 100 requests across 10 gateways → each counts 10 → limit of 50 is bypassed (each gateway thinks limit not reached). Solutions:

Option 1: Centralized Redis (Most Common)

class DistributedRateLimiter:
    """
    All gateway instances share a single Redis cluster.
    Lua script ensures atomicity — no race conditions.
    
    Pros: Accurate, simple
    Cons: Redis becomes a bottleneck, adds ~1ms latency per request
    """
    # All implementations above use this approach
    # Redis operations are O(1) and atomic via Lua scripts
    pass

# For very high throughput: Redis Cluster with consistent hashing
# Each client_id maps to a specific Redis shard — no cross-shard operations

Option 2: Local Counter + Sync

class LocalCounterRateLimiter:
    """
    Each gateway maintains local counter.
    Periodically sync with Redis to redistribute quota.
    
    Pros: No Redis round-trip per request (very fast)
    Cons: Allows up to N*local_limit burst (N = number of gateways)
    Use when: approximate rate limiting is acceptable
    """
    def __init__(self, total_limit: int, num_gateways: int,
                 sync_interval_seconds: float = 0.1):
        self.local_limit = total_limit // num_gateways
        self.local_count = 0
        self.sync_interval = sync_interval_seconds
        self.last_sync = time.time()

    def allow_request(self, client_id: str) -> bool:
        if time.time() - self.last_sync > self.sync_interval:
            self._sync_with_redis(client_id)

        self.local_count += 1
        return self.local_count <= self.local_limit

    def _sync_with_redis(self, client_id: str):
        """Push local count to Redis, pull global count"""
        # Atomic: add local count, get global total, reset local
        pass

Rate Limit by What?

def get_rate_limit_key(request) -> str:
    """
    Different granularities — choose based on requirements:
    """
    # By API key (most common for public APIs)
    return f"api_key:{request.headers['X-API-Key']}"

    # By user ID (authenticated endpoints)
    return f"user:{request.user_id}"

    # By IP (unauthenticated endpoints / DDoS protection)
    return f"ip:{request.remote_addr}"

    # By IP + endpoint (finer grained)
    return f"ip:{request.remote_addr}:endpoint:{request.path}"

    # By user + endpoint (different limits per endpoint)
    return f"user:{request.user_id}:endpoint:{request.path}"

Response Headers

from flask import Flask, request, jsonify, make_response

app = Flask(__name__)
rate_limiter = SlidingWindowCounterRateLimiter(redis_client, limit=100, window_seconds=60)

@app.before_request
def check_rate_limit():
    client_id = request.headers.get('X-API-Key', request.remote_addr)
    allowed, headers = rate_limiter.allow_request(client_id)

    if not allowed:
        response = make_response(jsonify({
            'error': 'rate_limit_exceeded',
            'message': 'Too many requests. See Retry-After header.',
        }), 429)
        for key, value in headers.items():
            response.headers[key] = value
        return response

    # Store headers to add to success response
    request.rate_limit_headers = headers

@app.after_request
def add_rate_limit_headers(response):
    if hasattr(request, 'rate_limit_headers'):
        for key, value in request.rate_limit_headers.items():
            response.headers[key] = value
    return response

# Standard headers per RFC 6585 and emerging IETF draft:
# X-RateLimit-Limit: 100
# X-RateLimit-Remaining: 73
# X-RateLimit-Reset: 1697500860 (Unix timestamp)
# Retry-After: 45 (seconds to wait, only on 429)

Algorithm Comparison

Algorithm	Burst Allowed?	Memory	Accuracy	Best For
Token Bucket	Yes (up to capacity)	O(1) per client	Exact	General APIs
Leaky Bucket	No	O(queue size)	Exact	Traffic shaping
Fixed Window	Yes (2x at boundary)	O(1)	Approximate	Simple cases
Sliding Window Log	No	O(requests)	Exact	Strict limits
Sliding Window Counter	Minimal	O(1)	~99.9%	High-scale APIs

Interview recommendation: Start with Token Bucket (intuitive, allows burst) or Sliding Window Counter (accurate, O(1) memory). Mention that Cloudflare uses Sliding Window Counter in their published architecture.