Rate limiting protects APIs from abuse, prevents resource exhaustion, and ensures fair usage across clients. Designing a rate limiter is a classic system design interview question that tests your knowledge of algorithms, distributed systems, and system architecture. This guide covers the major rate limiting algorithms, distributed rate limiting challenges, and production implementation patterns.
Rate Limiting Algorithms
Four algorithms, each with different characteristics: (1) Token Bucket — a bucket holds up to B tokens. Tokens are added at rate R per second. Each request consumes one token. If the bucket is empty, the request is rejected (HTTP 429). The bucket capacity B allows bursts (up to B requests instantly), while the refill rate R limits the sustained rate. Most popular algorithm — used by AWS, Stripe, and most API gateways. (2) Leaky Bucket — requests enter a queue (the bucket). The queue is processed at a fixed rate R. If the queue is full (capacity B), new requests are rejected. Unlike token bucket, leaky bucket produces a perfectly smooth output rate — no bursts. Use when you need a strictly uniform processing rate. (3) Fixed Window Counter — count requests per time window (e.g., per minute). If the count exceeds the limit, reject. Simple but has the boundary problem: 99 requests at 11:59:30 and 100 requests at 12:00:30 = 199 requests in 60 seconds, exceeding the 100/minute limit. (4) Sliding Window Log — store the timestamp of each request. Count requests in the last N seconds. Accurate but memory-intensive (stores every timestamp). (5) Sliding Window Counter — hybrid of fixed window and sliding window. Estimate the count using: previous_window_count * overlap_percentage + current_window_count. Low memory, good accuracy.
Distributed Rate Limiting
In a distributed system with multiple API servers, rate limiting must be centralized. If each server maintains its own counter, a client sending requests to different servers bypasses the limit. Centralized approach: store rate limiting state in Redis. Token bucket in Redis: use a Lua script for atomicity. The script checks the current token count, refills based on elapsed time, deducts one token if available, and returns allow/deny. Since Redis executes Lua scripts atomically, there are no race conditions. Sliding window in Redis: use a sorted set. ZADD with the current timestamp, ZREMRANGEBYSCORE to remove old entries, ZCARD to count. All in a single pipeline for efficiency. Race conditions: without atomic operations, two concurrent requests might both read “99 requests” (limit 100), both add one, and the count becomes 101. The Lua script approach eliminates this. Performance: Redis handles 100,000+ operations/sec on a single instance. For higher throughput, shard rate limiting keys across Redis Cluster nodes (each client key hashes to one node). Latency: a Redis round-trip adds 0.5-1ms to each API request. For ultra-low-latency requirements, use a local in-memory rate limiter with periodic sync to Redis (accepting slight inaccuracy).
Rate Limiter Architecture
Where to place the rate limiter: (1) API Gateway — the rate limiter runs at the API gateway level (Kong, AWS API Gateway, Envoy). Every request passes through the gateway before reaching backend services. Pros: centralized, no changes to backend services. Cons: the gateway becomes a bottleneck. (2) Middleware — rate limiting logic in a middleware layer within each service. Pros: per-service customization, no single point of failure. Cons: duplicated logic, harder to enforce global limits. (3) Service mesh (Istio/Envoy) — rate limiting at the sidecar proxy level. Configured declaratively via Istio EnvoyFilter or RateLimit service. Pros: transparent to application code, configurable per service. Rate limiting dimensions: (1) Per API key — each API consumer has a quota. Different tiers: free (100 req/hour), pro (10,000 req/hour). (2) Per IP address — for unauthenticated endpoints. Prevents DDoS and brute-force attacks. (3) Per user — for authenticated endpoints. Each user gets a fair share. (4) Per endpoint — critical endpoints (login, payment) have stricter limits than read endpoints. Response headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After (on 429). These help clients implement backoff.
Handling Rate-Limited Requests
When a request is rate limited: (1) Return HTTP 429 Too Many Requests with a Retry-After header indicating when the client can retry. (2) Include X-RateLimit-Remaining: 0 and X-RateLimit-Reset: {unix_timestamp} so the client knows when the limit resets. (3) For authenticated users, log the event for abuse monitoring. Frequent rate limiting of a legitimate user may indicate the limit is too low. (4) For unauthenticated endpoints under attack, consider escalating from rate limiting to IP blocking (temporary ban for 15 minutes after repeated rate limit violations). Client-side handling: implement exponential backoff with jitter when receiving 429 responses. Start with Retry-After if provided, otherwise use 1s, 2s, 4s, 8s delays. Add random jitter to prevent synchronized retries from multiple clients. Graceful degradation: instead of hard rejection, consider returning degraded responses for rate-limited requests. A search API might return cached results instead of rejecting entirely. This maintains a better user experience while protecting backend resources.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the token bucket algorithm for rate limiting?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Token bucket is the most popular rate limiting algorithm. A bucket holds up to B tokens (the burst capacity). Tokens are added at a constant rate R per second. Each request consumes one token. If the bucket has tokens, the request is allowed and a token is removed. If the bucket is empty, the request is rejected with HTTP 429. The bucket capacity B allows bursts: a client can make B requests instantly, then must wait for tokens to refill at rate R. Example: B=100, R=10/sec. A client can burst 100 requests, then sustain 10/sec. After the burst, the bucket refills in 10 seconds. Implementation in Redis: use a Lua script for atomicity. Store {last_refill_time, tokens} per client. On each request: compute elapsed time, add tokens (min(elapsed * R, B)), deduct one if available. The Lua script executes atomically, preventing race conditions from concurrent requests.”}},{“@type”:”Question”,”name”:”How do you implement distributed rate limiting across multiple servers?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”In a distributed system with multiple API servers, rate limiting must be centralized to prevent clients from bypassing limits by hitting different servers. Solution: store rate limiting state in Redis. Each API server checks Redis before processing a request. Token bucket in Redis: a Lua script atomically checks tokens, refills based on elapsed time, deducts one, and returns allow/deny. Since Redis Lua scripts are atomic, no race conditions occur. Sliding window in Redis: use a sorted set (ZADD timestamp, ZREMRANGEBYSCORE to remove old entries, ZCARD to count). Performance: Redis handles 100K+ operations/sec. A rate limit check adds 0.5-1ms latency per request. For ultra-low-latency requirements, use a local in-memory rate limiter that periodically syncs with Redis (accepts slight inaccuracy for lower latency). Sharding: for very high throughput, shard rate limit keys across Redis Cluster nodes.”}},{“@type”:”Question”,”name”:”What rate limiting algorithm should you choose?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Token bucket: allows bursts up to the bucket capacity while limiting sustained rate. Best for APIs where burst traffic is acceptable (web APIs, most use cases). Used by AWS, Stripe, and most API gateways. Leaky bucket: produces a perfectly smooth output rate with no bursts. The queue absorbs bursts but processes at a fixed rate. Best for systems requiring uniform processing (network traffic shaping). Fixed window counter: count per minute/hour. Simple but allows double the rate at window boundaries (99 requests at 11:59 + 100 at 12:00 = 199 in 60 seconds). Sliding window counter: estimates using weighted previous + current window. Good accuracy with low memory. Default recommendation: token bucket for most API rate limiting. It is intuitive (clients understand burst + sustained rate), efficient (O(1) per check), and widely adopted.”}},{“@type”:”Question”,”name”:”Where should a rate limiter be placed in the architecture?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Three placement options: (1) API Gateway — rate limiting at the entry point (Kong, AWS API Gateway, Envoy). Every request passes through. Pros: centralized, no backend changes. Cons: gateway becomes a potential bottleneck. (2) Middleware — rate limiting logic within each service. Pros: per-service customization, no SPOF. Cons: duplicated logic across services. (3) Service mesh (Istio/Envoy) — rate limiting at the sidecar proxy. Transparent to applications, configured declaratively. Rate limit by multiple dimensions: per API key (tier-based quotas), per IP (unauthenticated endpoints), per user (fair usage), and per endpoint (stricter on login/payment). Response: return 429 with headers X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After so clients can implement proper backoff.”}}]}