System Design Interview: Rate Limiter

⏱ 8 min read

Rate limiters protect APIs from overload, prevent abuse, and enforce fair usage. Almost every system design interview that involves an API will ask about rate limiting at some point — either as a standalone question or as a component you need to add. Understanding the algorithms, their tradeoffs, and a production-grade distributed implementation is essential.

Why Rate Limit?

Protect downstream services from overload (cascading failures)
Prevent abuse (scraping, brute-force login, DDoS amplification)
Enforce API tier quotas (free: 100 req/day, pro: 10,000 req/day)
Control costs for expensive operations (ML inference, external API calls)

Rate Limiting Algorithms

1. Token Bucket

A bucket holds tokens up to a capacity. Tokens are added at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected (or queued).

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.last_refill = time.time()

    def allow(self) -> bool:
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity,
                          self.tokens + elapsed * self.refill_rate)
        self.last_refill = now

        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

Properties: allows bursts up to bucket capacity; smooth average rate. Used by AWS, Stripe, and most production systems. Simple to implement with Redis.

2. Leaky Bucket

Requests enter a queue (the bucket) and are processed at a fixed output rate. If the queue is full, incoming requests are dropped. Produces a perfectly smooth output rate — no bursts allowed.

Properties: uniform output rate, good for smoothing traffic to downstream services. Less common for API rate limiting (callers get unpredictable delays), better for traffic shaping in network devices.

3. Fixed Window Counter

Divide time into fixed windows (e.g., 1-minute buckets). Count requests per window; reject if over limit.

# Redis implementation
def allow_fixed_window(user_id: str, limit: int, window_sec: int) -> bool:
    now = int(time.time())
    window_key = f"rl:{user_id}:{now // window_sec}"

    count = redis.incr(window_key)
    if count == 1:
        redis.expire(window_key, window_sec * 2)  # cleanup

    return count <= limit

Problem: boundary burst. A user can make limit requests at :59 and another limit at :01 of the next minute — 2x the limit in 2 seconds. This is the key weakness to mention in interviews.

4. Sliding Window Log

Store a timestamp log of each request. On each request, remove entries older than the window; count remaining; allow if under limit.

def allow_sliding_log(user_id: str, limit: int, window_sec: int) -> bool:
    now = time.time()
    key = f"rl:log:{user_id}"

    pipe = redis.pipeline()
    # Remove entries outside window
    pipe.zremrangebyscore(key, 0, now - window_sec)
    # Count remaining
    pipe.zcard(key)
    # Add current request
    pipe.zadd(key, {str(now): now})
    pipe.expire(key, window_sec)
    _, count, *_ = pipe.execute()

    return count < limit  # count is BEFORE adding current request

Properties: accurate — no boundary burst. Cost: O(requests in window) memory per user. At 1000 req/min per user, 1 million users = 60 GB of log data. Not practical at scale.

5. Sliding Window Counter (Hybrid) — Recommended

Combine the space efficiency of fixed windows with the accuracy of sliding windows. Estimate the current window count using a weighted sum of the previous window and current window.

def allow_sliding_counter(user_id: str, limit: int, window_sec: int) -> bool:
    now = time.time()
    current_window = int(now // window_sec)
    prev_window = current_window - 1

    curr_key = f"rl:{user_id}:{current_window}"
    prev_key = f"rl:{user_id}:{prev_window}"

    pipe = redis.pipeline()
    pipe.get(curr_key)
    pipe.get(prev_key)
    curr_count, prev_count = pipe.execute()

    curr_count = int(curr_count or 0)
    prev_count = int(prev_count or 0)

    # What fraction of previous window overlaps with our rolling window?
    elapsed_in_current = now - (current_window * window_sec)
    prev_weight = 1 - (elapsed_in_current / window_sec)

    estimated_count = prev_count * prev_weight + curr_count

    if estimated_count >= limit:
        return False

    redis.incr(curr_key)
    redis.expire(curr_key, window_sec * 2)
    return True

Properties: O(1) memory per user (just two counters), highly accurate (within ~0.003% of true sliding window for most traffic patterns), used by Cloudflare and similar systems.

Algorithm Comparison

Algorithm	Memory	Accuracy	Burst Handling	Use Case
Token Bucket	O(1)	High	Allows up to capacity	API rate limiting (most common)
Leaky Bucket	O(queue size)	Perfect smooth	Drops or delays	Traffic shaping
Fixed Window	O(1)	Low (boundary burst)	2x limit possible	Simple internal quotas
Sliding Log	O(requests)	Perfect	Precise	Low-traffic, exact limits
Sliding Counter	O(1)	Very high (~99.997%)	Smooth	Production at scale

Distributed Rate Limiter Architecture

  Client → [API Gateway] → [Rate Limiter Middleware]
                                    |
                              [Redis Cluster]
                                    |
                    [Rule Service] ← [Config DB]

Rate Limiter Middleware:
  1. Identify the rate limit key (IP, user_id, API key, endpoint)
  2. Apply the algorithm against Redis
  3. Set response headers: X-RateLimit-Limit, X-RateLimit-Remaining,
     X-RateLimit-Reset, Retry-After
  4. If allowed: forward to backend
  5. If denied: return 429 Too Many Requests

Redis Lua Script for Atomicity

Increment and check must be atomic to prevent race conditions. Use a Lua script (executes atomically on Redis):

-- token_bucket.lua
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])  -- tokens per second
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

local bucket = redis.call("HMGET", key, "tokens", "last_refill")
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)

if tokens >= requested then
    tokens = tokens - requested
    redis.call("HMSET", key, "tokens", tokens, "last_refill", now)
    redis.call("EXPIRE", key, math.ceil(capacity / refill_rate) * 2)
    return 1  -- allowed
else
    redis.call("HMSET", key, "tokens", tokens, "last_refill", now)
    redis.call("EXPIRE", key, math.ceil(capacity / refill_rate) * 2)
    return 0  -- denied
end

Rate Limit Rules and Hierarchies

Production systems layer multiple rules:

IP-level: 100 req/s — protects against unauthenticated DDoS
User-level: 1000 req/min — per-user API quota
API key-level: tier-based (free/pro/enterprise)
Endpoint-level: expensive endpoints (e.g., ML inference: 10 req/min)
Global: total system capacity, circuit-breaker level

Apply rules in order from cheapest check to most expensive. IP check is a hash lookup; user check requires authentication; endpoint rules add another lookup.

Handling Distributed Rate Limiter Race Conditions

In a multi-datacenter setup, each region has its own Redis. A user can exceed limits by hitting different datacenters simultaneously (race). Solutions:

Centralized Redis: single source of truth, but adds latency for inter-region traffic (50-150ms)
Local rate limiter + sync: each region enforces limit / N (N regions), sync periodically. Simple, low latency, slightly inaccurate
Eventual consistency: accept slight over-limit in exchange for low latency — add a small tolerance buffer (e.g., allow 110% of limit)

Response Headers

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1735689600  # Unix timestamp when limit resets
Retry-After: 30                # seconds until next retry
Content-Type: application/json

{"error": "rate_limit_exceeded", "message": "Too many requests. Retry after 30 seconds."}

Interview Discussion Points

What is the boundary burst problem? Fixed window allows 2x limit at window boundaries
Why use Redis over a local in-memory counter? Shared across multiple app server instances
How do you handle Redis unavailability? Fail open (allow all) or fail closed (deny all) — depends on business requirement. Most public APIs fail open to avoid impacting legitimate users during outages
How do you rate limit unauthenticated requests? By IP — use X-Forwarded-For carefully (can be spoofed), combine with other signals

Frequently Asked Questions

What is the best rate limiting algorithm for production APIs?

The sliding window counter is the best balance for most production systems: O(1) memory per user (just two counters), highly accurate (within 0.003% of a true sliding window), and trivial to implement in Redis. Token bucket is equally good for APIs that want to allow short bursts — it lets users accumulate tokens up to a capacity then spend them in a burst. Fixed window is the simplest but has the boundary burst problem (allows 2x the limit at window boundaries). Sliding window log is the most accurate but requires O(requests) memory per user, which is impractical at scale.

How do you implement a rate limiter in a distributed system?

Use a centralized Redis store with atomic Lua scripts. The Lua script reads the current counter, checks against the limit, increments if allowed, and sets expiry — all in one atomic operation, eliminating race conditions. For multi-region setups, options are: (1) centralized Redis with cross-region replication (adds 50-150ms latency), (2) per-region limits at limit/N (may allow slightly more than limit total), or (3) accept eventual consistency with a small tolerance buffer. Most companies choose per-region limits for latency reasons and accept occasional minor over-limit.

What HTTP status code should a rate limiter return?

Return 429 Too Many Requests with a Retry-After header indicating how many seconds until the limit resets. Also include X-RateLimit-Limit (the limit), X-RateLimit-Remaining (remaining requests in current window), and X-RateLimit-Reset (Unix timestamp when the window resets). The response body should include a human-readable error message and a machine-readable error code. Never return 403 Forbidden for rate limiting — that implies authorization failure, not temporary throttling.

Coinbase Interview Guide

Twitter Interview Guide

Cloudflare Interview Guide

Companies That Ask This Question

Stripe Engineering Interview Guide

Meta Engineering Interview Guide

Uber Engineering Interview Guide

Shopify Engineering Interview Guide

LinkedIn Engineering Interview Guide

🏢 Asked at: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence