System Design Interview: Design an API Rate Limiter (Deep Dive)
API rate limiting is a fundamental system design topic — nearly every API at scale needs it. This deep-dive covers algorithms, distributed implementation, and real-world edge cases. Commonly tested at Stripe, Cloudflare, LinkedIn, and any company with public APIs.
Why Rate Limiting?
- Prevent abuse: Stop malicious clients from overwhelming your API
- Ensure fair usage: Prevent one client from consuming all capacity
- Cost control: ML inference APIs cost money per call — rate limits protect revenue
- Availability: Protect downstream services from traffic spikes
- Business tiers: Free tier (1K req/day), Pro tier (100K req/day)
Where to Implement Rate Limiting
Client → CDN (global rate limit) → API Gateway (per-route limit) → Microservice
↓
Rate Limiter Service
↓
Redis Cluster
Best placement: API Gateway (centralized, before application logic runs). For Cloudflare-scale: implement at CDN edge nodes.
Algorithm 1: Token Bucket
import time
import redis
class TokenBucketRateLimiter:
"""
Token bucket: bucket holds capacity tokens, refills at rate tokens/second.
Request consumes 1 token. Allows bursts up to capacity.
Best for: APIs that want to allow short bursts (bursty traffic).
Used by: AWS, Stripe, most API gateways.
"""
def __init__(self, redis_client, capacity: int, refill_rate: float):
self.redis = redis_client
self.capacity = capacity # max tokens (burst size)
self.refill_rate = refill_rate # tokens added per second
def allow_request(self, client_id: str) -> bool:
"""Atomic check-and-consume using Redis Lua script"""
lua_script = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now
-- Add tokens based on time elapsed
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
if tokens >= requested then
tokens = tokens - requested
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return 1 -- allowed
else
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return 0 -- denied
end
"""
key = f"rate_limit:token:{client_id}"
now = time.time()
result = self.redis.eval(lua_script, 1, key,
self.capacity, self.refill_rate, now, 1)
return bool(result)
Algorithm 2: Sliding Window Counter
class SlidingWindowCounterRateLimiter:
"""
Approximation of sliding window using two fixed windows.
Current window count + (1 - elapsed_fraction) * previous window count.
Cheaper than true sliding window log (no per-request storage).
Accurate to within ~0.1% for most traffic patterns.
Best for: High-throughput APIs where exact precision isn't required.
"""
def __init__(self, redis_client, limit: int, window_seconds: int):
self.redis = redis_client
self.limit = limit
self.window = window_seconds
def allow_request(self, client_id: str) -> tuple[bool, dict]:
now = time.time()
current_window = int(now // self.window)
prev_window = current_window - 1
elapsed_in_window = now % self.window
curr_key = f"rate_limit:sw:{client_id}:{current_window}"
prev_key = f"rate_limit:sw:{client_id}:{prev_window}"
pipe = self.redis.pipeline()
pipe.incr(curr_key)
pipe.expire(curr_key, self.window * 2)
pipe.get(prev_key)
curr_count, _, prev_count = pipe.execute()
curr_count = int(curr_count)
prev_count = int(prev_count) if prev_count else 0
# Weight previous window by remaining fraction
weight = 1 - (elapsed_in_window / self.window)
estimated_count = curr_count + prev_count * weight
remaining = max(0, self.limit - estimated_count)
reset_at = (current_window + 1) * self.window
headers = {
'X-RateLimit-Limit': str(self.limit),
'X-RateLimit-Remaining': str(int(remaining)),
'X-RateLimit-Reset': str(int(reset_at)),
}
if estimated_count > self.limit:
headers['Retry-After'] = str(int(reset_at - now))
return False, headers
return True, headers
Algorithm 3: Leaky Bucket
class LeakyBucketRateLimiter:
"""
Leaky bucket: requests enter a queue (bucket); processed at constant rate.
Unlike token bucket, smooths traffic — no burst allowed.
Best for: network traffic shaping, video streaming (constant bitrate).
Less common for API rate limiting (no burst tolerance).
"""
def __init__(self, rate_per_second: int, capacity: int):
self.rate = rate_per_second # processing rate (leak rate)
self.capacity = capacity # max queue size
self.queue = [] # pending requests
def try_add_request(self, request) -> bool:
# Drain processed requests first
now = time.time()
to_process = min(len(self.queue),
int((now - self.last_drain) * self.rate))
self.queue = self.queue[to_process:]
self.last_drain = now
if len(self.queue) >= self.capacity:
return False # bucket full, drop request
self.queue.append(request)
return True
Distributed Rate Limiting
The Distributed Challenge
With multiple API gateway instances, each has a local counter. Race condition: client sends 100 requests across 10 gateways → each counts 10 → limit of 50 is bypassed (each gateway thinks limit not reached). Solutions:
Option 1: Centralized Redis (Most Common)
class DistributedRateLimiter:
"""
All gateway instances share a single Redis cluster.
Lua script ensures atomicity — no race conditions.
Pros: Accurate, simple
Cons: Redis becomes a bottleneck, adds ~1ms latency per request
"""
# All implementations above use this approach
# Redis operations are O(1) and atomic via Lua scripts
pass
# For very high throughput: Redis Cluster with consistent hashing
# Each client_id maps to a specific Redis shard — no cross-shard operations
Option 2: Local Counter + Sync
class LocalCounterRateLimiter:
"""
Each gateway maintains local counter.
Periodically sync with Redis to redistribute quota.
Pros: No Redis round-trip per request (very fast)
Cons: Allows up to N*local_limit burst (N = number of gateways)
Use when: approximate rate limiting is acceptable
"""
def __init__(self, total_limit: int, num_gateways: int,
sync_interval_seconds: float = 0.1):
self.local_limit = total_limit // num_gateways
self.local_count = 0
self.sync_interval = sync_interval_seconds
self.last_sync = time.time()
def allow_request(self, client_id: str) -> bool:
if time.time() - self.last_sync > self.sync_interval:
self._sync_with_redis(client_id)
self.local_count += 1
return self.local_count <= self.local_limit
def _sync_with_redis(self, client_id: str):
"""Push local count to Redis, pull global count"""
# Atomic: add local count, get global total, reset local
pass
Rate Limit by What?
def get_rate_limit_key(request) -> str:
"""
Different granularities — choose based on requirements:
"""
# By API key (most common for public APIs)
return f"api_key:{request.headers['X-API-Key']}"
# By user ID (authenticated endpoints)
return f"user:{request.user_id}"
# By IP (unauthenticated endpoints / DDoS protection)
return f"ip:{request.remote_addr}"
# By IP + endpoint (finer grained)
return f"ip:{request.remote_addr}:endpoint:{request.path}"
# By user + endpoint (different limits per endpoint)
return f"user:{request.user_id}:endpoint:{request.path}"
Response Headers
from flask import Flask, request, jsonify, make_response
app = Flask(__name__)
rate_limiter = SlidingWindowCounterRateLimiter(redis_client, limit=100, window_seconds=60)
@app.before_request
def check_rate_limit():
client_id = request.headers.get('X-API-Key', request.remote_addr)
allowed, headers = rate_limiter.allow_request(client_id)
if not allowed:
response = make_response(jsonify({
'error': 'rate_limit_exceeded',
'message': 'Too many requests. See Retry-After header.',
}), 429)
for key, value in headers.items():
response.headers[key] = value
return response
# Store headers to add to success response
request.rate_limit_headers = headers
@app.after_request
def add_rate_limit_headers(response):
if hasattr(request, 'rate_limit_headers'):
for key, value in request.rate_limit_headers.items():
response.headers[key] = value
return response
# Standard headers per RFC 6585 and emerging IETF draft:
# X-RateLimit-Limit: 100
# X-RateLimit-Remaining: 73
# X-RateLimit-Reset: 1697500860 (Unix timestamp)
# Retry-After: 45 (seconds to wait, only on 429)
Algorithm Comparison
| Algorithm | Burst Allowed? | Memory | Accuracy | Best For |
|---|---|---|---|---|
| Token Bucket | Yes (up to capacity) | O(1) per client | Exact | General APIs |
| Leaky Bucket | No | O(queue size) | Exact | Traffic shaping |
| Fixed Window | Yes (2x at boundary) | O(1) | Approximate | Simple cases |
| Sliding Window Log | No | O(requests) | Exact | Strict limits |
| Sliding Window Counter | Minimal | O(1) | ~99.9% | High-scale APIs |
Interview recommendation: Start with Token Bucket (intuitive, allows burst) or Sliding Window Counter (accurate, O(1) memory). Mention that Cloudflare uses Sliding Window Counter in their published architecture.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the token bucket algorithm for rate limiting?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Token bucket: a bucket holds up to ‘capacity’ tokens, refilling at a fixed ‘rate’ per second. Each request consumes 1 token. If the bucket has tokens, the request is allowed; otherwise rejected. Key property: allows bursts up to capacity tokens, then limits to the refill rate. Example: capacity=100, rate=10/sec — a client can burst 100 requests immediately, then is limited to 10/sec. This is the most common algorithm for API rate limiting. Implementation uses Redis with a Lua script to atomically read tokens, add new tokens based on elapsed time, consume 1, and write back — all in one atomic operation.”}},{“@type”:”Question”,”name”:”What is the difference between token bucket and sliding window rate limiting?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Token bucket allows controlled bursts: a client can accumulate tokens during idle periods and spend them in a burst. This matches real usage patterns (a user makes occasional bursts of requests). Fixed window counter resets at window boundaries — a client could make 2x the limit by sending requests at the end of one window and start of the next. Sliding window counter approximates a true sliding window using current and previous window counts — more accurate than fixed window, uses O(1) memory vs O(requests) for a true sliding window log. Sliding window counter is used by Cloudflare at scale (documented in their engineering blog).”}},{“@type”:”Question”,”name”:”How do you implement rate limiting in a distributed system?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Challenge: multiple API gateway instances each maintain local counters, creating race conditions. Solutions: (1) Centralized Redis with Lua scripts — all gateways share Redis, Lua ensures atomic check-and-increment. Adds ~1ms latency per request but is accurate. (2) Redis Cluster with consistent hashing — each client_id maps to one Redis shard, eliminating cross-shard coordination. (3) Local counter with periodic sync — each gateway maintains a local limit (total_limit / num_gateways) and syncs with Redis every 100ms. Very fast but allows temporary bursts. Choose based on accuracy requirements: centralized Redis for strict enforcement (financial APIs), local+sync for approximate limits at massive scale.”}},{“@type”:”Question”,”name”:”What HTTP headers should a rate-limited API return?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Standard rate limit headers (RFC 6585 and emerging IETF draft RateLimit-*): X-RateLimit-Limit: total requests allowed in window; X-RateLimit-Remaining: requests remaining in current window; X-RateLimit-Reset: Unix timestamp when the window resets (allows client to calculate exact wait time). On 429 Too Many Requests: add Retry-After header with seconds to wait before retrying. Some APIs also include X-RateLimit-Policy to identify which limit was hit (per-user, per-endpoint, global). Always return these headers on ALL responses, not just 429s — clients use them for adaptive throttling.”}},{“@type”:”Question”,”name”:”How do you rate limit by different dimensions?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Rate limit keys determine what counts as a ‘unit’: (1) API key — most common for public APIs; identifies the application/developer. (2) User ID — for authenticated endpoints; prevents one user from impacting others. (3) IP address — for unauthenticated endpoints; first line of defense against abuse. (4) IP + endpoint — different limits per endpoint (e.g., auth endpoint stricter than read endpoints). (5) User + endpoint — fine-grained per-user per-route limits (Pro users get higher limits on expensive endpoints). Real systems combine multiple dimensions: global IP limit + per-user limit + per-endpoint limit, with the most restrictive applying. Apply limits at multiple layers: CDN (global DDoS), API gateway (per-client), service (per-resource).”}}]}