System Design Interview: API Rate Limiting and Throttling

⏱ 12 min read

Why Rate Limiting?

Rate limiting protects services from overload, abuse, and cost explosions. Without it: a single misbehaving client can saturate your API and degrade service for all other clients (noisy neighbor); scrapers and bots can exhaust your compute budget; DDoS attacks can reach your application tier. Rate limiting is applied at multiple layers: at the API gateway (before requests reach application servers), per user/API key (fair sharing), per endpoint (expensive operations like report generation get tighter limits), and per IP (blocking abusive sources).

Algorithm 1: Token Bucket

Each client has a bucket that holds up to capacity tokens. Tokens are added at a fixed refill rate (e.g., 100 tokens per second). Each request consumes 1 token. If the bucket is empty, the request is rejected (HTTP 429). The bucket never exceeds capacity — excess tokens are discarded.


# Redis-based token bucket (Lua script for atomicity)
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])  -- tokens per second
local now = tonumber(ARGV[3])          -- current timestamp (ms)
local cost = tonumber(ARGV[4])         -- tokens this request needs (usually 1)

local data = redis.call("HMGET", key, "tokens", "last_refill")
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now

-- Add tokens based on elapsed time
local elapsed = (now - last_refill) / 1000.0
tokens = math.min(capacity, tokens + elapsed * refill_rate)

if tokens >= cost then
    tokens = tokens - cost
    redis.call("HMSET", key, "tokens", tokens, "last_refill", now)
    redis.call("EXPIRE", key, 3600)
    return 1  -- allowed
else
    redis.call("HMSET", key, "tokens", tokens, "last_refill", now)
    redis.call("EXPIRE", key, 3600)
    return 0  -- rejected
end

Token bucket allows bursting: a client that has been idle accumulates tokens up to capacity and can send a burst of requests before being throttled. A capacity of 200 with a refill rate of 100/sec means the client can burst 200 requests instantly, then sustains 100/sec. This is appropriate for APIs where occasional bursts are acceptable (user clicks, webhook retries).

Algorithm 2: Sliding Window Log

For each client, maintain a sorted set of timestamps for all requests in the current window. On each request: remove all timestamps older than window_size ago, count remaining entries, reject if count >= limit, otherwise add current timestamp and allow.


-- Redis sliding window log
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window_ms = tonumber(ARGV[2])  -- e.g., 60000 for 1 minute
local now = tonumber(ARGV[3])

-- Remove old entries
redis.call("ZREMRANGEBYSCORE", key, 0, now - window_ms)

-- Count entries in window
local count = redis.call("ZCARD", key)

if count < limit then
    redis.call("ZADD", key, now, now)  -- score=timestamp, member=timestamp
    redis.call("EXPIRE", key, math.ceil(window_ms / 1000) + 1)
    return 1  -- allowed
else
    return 0  -- rejected
end

Sliding window log is precise — it accurately counts requests in any window ending at “now”. Downside: memory usage grows with traffic volume (one ZADD entry per request). For 1000 requests/minute per user, that is 1000 entries per user. At scale this can be expensive.

Algorithm 3: Fixed Window Counter

The simplest approach: divide time into fixed buckets (e.g., 1-minute windows). Each client gets a counter per bucket. Increment on each request; reject if counter exceeds limit.


# Python: fixed window counter with Redis
import redis, time

r = redis.Redis()

def is_allowed(user_id: str, limit: int, window_seconds: int = 60) -> bool:
    window = int(time.time()) // window_seconds  # current window number
    key = f"ratelimit:{user_id}:{window}"
    count = r.incr(key)
    if count == 1:
        r.expire(key, window_seconds * 2)  # TTL for cleanup
    return count <= limit

Fixed window problem: boundary spikes. A client can make limit requests at 00:59 and another limit requests at 01:00 — 2× the limit in 2 seconds. Sliding window eliminates this but costs more memory.

Algorithm 4: Sliding Window Counter (Hybrid)

A practical compromise: use two consecutive fixed window counters and interpolate. For a request at time T: count = (previous_window_count × overlap_ratio) + current_window_count. This smooths the boundary spike without storing per-request timestamps.


def sliding_window_counter(user_id: str, limit: int, window_s: int = 60) -> bool:
    now = time.time()
    current_window = int(now) // window_s
    prev_window = current_window - 1
    elapsed_in_window = now % window_s

    prev_key = f"rl:{user_id}:{prev_window}"
    curr_key = f"rl:{user_id}:{current_window}"

    prev_count = int(r.get(prev_key) or 0)
    curr_count = int(r.get(curr_key) or 0)

    # Weight: how much of the previous window overlaps with our window
    weight = 1.0 - (elapsed_in_window / window_s)
    estimated = prev_count * weight + curr_count

    if estimated < limit:
        r.incr(curr_key)
        r.expire(curr_key, window_s * 2)
        return True
    return False

Distributed Rate Limiting Architecture

For a multi-region API, rate limiting must be globally consistent — a user should not bypass limits by routing to different regions. Options:

Centralized Redis: all regions query a single Redis cluster. Simple and accurate but adds cross-region latency (50-150ms roundtrip). Inappropriate for latency-sensitive APIs.
Local + eventual sync: each region enforces a fraction of the global limit (e.g., 3 regions → each allows 40% of global limit). Fast (local Redis query in 1-2ms) but allows brief over-limit in edge cases. Cloudflare uses this approach.
Regional rate limits with a global coordinator: enforce per-region limits locally; a background process syncs usage to a global store for reporting and global limit enforcement. Fast for most cases; global enforcement lags by ~1 second.

Rate Limit Headers

APIs should communicate rate limit status to clients via HTTP headers so clients can back off gracefully:


HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 347
X-RateLimit-Reset: 1735689600  # Unix timestamp when bucket resets/refills
Retry-After: 30                # seconds to wait (on 429)

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
Retry-After: 30

Rate Limit Tiers

Production APIs use multi-dimensional rate limiting:

Per API key: primary enforcement — each customer has a plan-based limit (free: 100/hour, pro: 10,000/hour)
Per IP: secondary DDoS protection — even with valid API keys, a single IP is limited (prevents credential stuffing)
Per endpoint: expensive endpoints (ML inference, bulk exports) get tighter limits regardless of plan
Global: circuit breaker for the entire service — if total RPS exceeds server capacity, shed load gracefully

Nginx and Kong Rate Limiting

At the infrastructure level, Nginx and Kong API Gateway handle rate limiting before requests reach application code:


# Nginx rate limiting configuration
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
limit_req_zone $http_x_api_key zone=api_key:10m rate=100r/s;

server {
    location /api/ {
        limit_req zone=api burst=20 nodelay;
        limit_req zone=api_key burst=200 nodelay;
        limit_req_status 429;
        proxy_pass http://backend;
    }
}

Nginx uses a leaky bucket implementation internally. The burst parameter allows that many requests to queue; nodelay serves queued requests immediately rather than spacing them out.

Interview Questions

Design a rate limiter for a public API with 1M users (token bucket vs sliding window trade-offs)
How do you handle rate limiting in a multi-region active-active deployment?
A client is being throttled but they claim their requests are below the limit — how do you debug?
Design rate limiting for a payments API where fairness and accuracy are critical (no false throttling)
How do you prevent thundering herd when rate limit windows reset simultaneously for many clients?

Frequently Asked Questions

What is the difference between token bucket and sliding window rate limiting?

Token bucket and sliding window are the two most common rate limiting algorithms, and they differ in how they handle bursting and precision. Token bucket: each client has a bucket with capacity C tokens. Tokens refill at rate R per second. Each request consumes 1 token. A client can burst up to C requests instantly (using accumulated tokens), then sustains R requests per second. This is forgiving — if a client is idle, it builds up tokens and can burst. Appropriate for APIs where occasional spikes are acceptable. Token bucket is O(1) per request in Redis (atomic Lua script with HMGET/HMSET). Sliding window log: tracks the timestamp of every request in a sorted set. On each request, remove timestamps older than the window, count remaining, reject if at limit. No bursting — the limit is the exact count in any rolling window ending at now. Appropriate when strict fairness is required (payments, compliance-sensitive APIs). Cost: O(log N) per request for ZADD/ZREMRANGEBYSCORE, plus memory proportional to requests in the window. Sliding window counter (hybrid): maintains two fixed-window counters and interpolates using time elapsed in the current window. Approximates the sliding window at O(1) cost with slightly less precision. In practice, token bucket is used for most API rate limiting (GitHub, Stripe, Twilio) because bursting is acceptable and the implementation is efficient.

How do you implement distributed rate limiting across multiple API servers?

Distributed rate limiting is hard because enforcing a global limit requires coordination between servers. Three approaches: (1) Centralized Redis: all servers query a single Redis (or Redis Cluster) instance on every request. A Lua script atomically increments the counter and checks the limit. Accurate to within 1 request but adds a Redis network roundtrip (1-5ms on same datacenter, 50-150ms cross-region). For intra-datacenter APIs this is usually acceptable. For ultra-low-latency APIs (sub-millisecond response requirement), this adds unacceptable overhead. (2) Local approximation with sync: each server maintains a local counter and periodically (every second) syncs to a shared Redis store. The server enforces a fraction of the global limit locally (e.g., 3 servers → each allows 40% to handle some imbalance). Fast (local memory check in microseconds) but allows brief over-limit: in 1-second sync intervals, all 3 servers could each allow 40% = 120% of limit. Cloudflare uses this approach. (3) Token bucket with Redis: use the Lua-based token bucket in Redis — a single atomic script handles the token math. Redis processes ~100K Lua scripts/second per node, which is sufficient for most APIs. Shard by user_id to distribute load: user_id hash determines which Redis node handles that user's rate limiting. This is the recommended approach for most production systems.

How do you handle rate limiting for different user tiers without a per-request Redis call?

For SaaS products with multiple pricing tiers (free, pro, enterprise), rate limits vary by user. Making a Redis call per request to look up the user's tier and check their counter adds latency. Optimization strategies: (1) JWT-embedded limits: include the user's rate limit tier in the JWT token payload (e.g., {"sub": "user123", "plan": "pro", "rpm": 10000}). The API gateway validates the JWT and reads the limit without a database or cache lookup. The token is refreshed when the user upgrades/downgrades (token rotation on plan change). Rate limiting counters are still in Redis but the limit value comes from the JWT, not a lookup. (2) API key prefix encoding: embed the tier in the API key itself (sk_pro_xxxx vs sk_free_xxxx). The gateway extracts the prefix to determine the limit. No lookup needed. (3) Cache user tier with long TTL: cache the user's plan tier for 60 seconds. Plan changes take up to 60 seconds to propagate — acceptable for most use cases. The rate limit counter check is still per-request, but the limit value retrieval is cached. In practice, the rate limit counter itself (the Redis INCR/Lua call) is the only per-request operation needed — keeping the response to be idempotent. Tier lookup is done at authentication time (once per request anyway for JWT validation) or cached.

Atlassian Interview Guide

Shopify Interview Guide

Companies That Ask This Question

Stripe Engineering Interview Guide

Cloudflare Engineering Interview Guide

Uber Engineering Interview Guide

Twitter/X Engineering Interview Guide

DoorDash Engineering Interview Guide

Coinbase Engineering Interview Guide

Plaid Engineering Interview Guide