System Design: Rate Limiter — Token Bucket, Sliding Window, Leaky Bucket, Distributed Rate Limiting, API Gateway

Rate limiting protects APIs from abuse, prevents resource exhaustion, and ensures fair usage across clients. Designing a rate limiter is a classic system design interview question that tests your knowledge of algorithms, distributed systems, and system architecture. This guide covers the major rate limiting algorithms, distributed rate limiting challenges, and production implementation patterns.

Rate Limiting Algorithms

Four algorithms, each with different characteristics: (1) Token Bucket — a bucket holds up to B tokens. Tokens are added at rate R per second. Each request consumes one token. If the bucket is empty, the request is rejected (HTTP 429). The bucket capacity B allows bursts (up to B requests instantly), while the refill rate R limits the sustained rate. Most popular algorithm — used by AWS, Stripe, and most API gateways. (2) Leaky Bucket — requests enter a queue (the bucket). The queue is processed at a fixed rate R. If the queue is full (capacity B), new requests are rejected. Unlike token bucket, leaky bucket produces a perfectly smooth output rate — no bursts. Use when you need a strictly uniform processing rate. (3) Fixed Window Counter — count requests per time window (e.g., per minute). If the count exceeds the limit, reject. Simple but has the boundary problem: 99 requests at 11:59:30 and 100 requests at 12:00:30 = 199 requests in 60 seconds, exceeding the 100/minute limit. (4) Sliding Window Log — store the timestamp of each request. Count requests in the last N seconds. Accurate but memory-intensive (stores every timestamp). (5) Sliding Window Counter — hybrid of fixed window and sliding window. Estimate the count using: previous_window_count * overlap_percentage + current_window_count. Low memory, good accuracy.

Distributed Rate Limiting

In a distributed system with multiple API servers, rate limiting must be centralized. If each server maintains its own counter, a client sending requests to different servers bypasses the limit. Centralized approach: store rate limiting state in Redis. Token bucket in Redis: use a Lua script for atomicity. The script checks the current token count, refills based on elapsed time, deducts one token if available, and returns allow/deny. Since Redis executes Lua scripts atomically, there are no race conditions. Sliding window in Redis: use a sorted set. ZADD with the current timestamp, ZREMRANGEBYSCORE to remove old entries, ZCARD to count. All in a single pipeline for efficiency. Race conditions: without atomic operations, two concurrent requests might both read “99 requests” (limit 100), both add one, and the count becomes 101. The Lua script approach eliminates this. Performance: Redis handles 100,000+ operations/sec on a single instance. For higher throughput, shard rate limiting keys across Redis Cluster nodes (each client key hashes to one node). Latency: a Redis round-trip adds 0.5-1ms to each API request. For ultra-low-latency requirements, use a local in-memory rate limiter with periodic sync to Redis (accepting slight inaccuracy).

Rate Limiter Architecture

Where to place the rate limiter: (1) API Gateway — the rate limiter runs at the API gateway level (Kong, AWS API Gateway, Envoy). Every request passes through the gateway before reaching backend services. Pros: centralized, no changes to backend services. Cons: the gateway becomes a bottleneck. (2) Middleware — rate limiting logic in a middleware layer within each service. Pros: per-service customization, no single point of failure. Cons: duplicated logic, harder to enforce global limits. (3) Service mesh (Istio/Envoy) — rate limiting at the sidecar proxy level. Configured declaratively via Istio EnvoyFilter or RateLimit service. Pros: transparent to application code, configurable per service. Rate limiting dimensions: (1) Per API key — each API consumer has a quota. Different tiers: free (100 req/hour), pro (10,000 req/hour). (2) Per IP address — for unauthenticated endpoints. Prevents DDoS and brute-force attacks. (3) Per user — for authenticated endpoints. Each user gets a fair share. (4) Per endpoint — critical endpoints (login, payment) have stricter limits than read endpoints. Response headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After (on 429). These help clients implement backoff.

Handling Rate-Limited Requests

When a request is rate limited: (1) Return HTTP 429 Too Many Requests with a Retry-After header indicating when the client can retry. (2) Include X-RateLimit-Remaining: 0 and X-RateLimit-Reset: {unix_timestamp} so the client knows when the limit resets. (3) For authenticated users, log the event for abuse monitoring. Frequent rate limiting of a legitimate user may indicate the limit is too low. (4) For unauthenticated endpoints under attack, consider escalating from rate limiting to IP blocking (temporary ban for 15 minutes after repeated rate limit violations). Client-side handling: implement exponential backoff with jitter when receiving 429 responses. Start with Retry-After if provided, otherwise use 1s, 2s, 4s, 8s delays. Add random jitter to prevent synchronized retries from multiple clients. Graceful degradation: instead of hard rejection, consider returning degraded responses for rate-limited requests. A search API might return cached results instead of rejecting entirely. This maintains a better user experience while protecting backend resources.

Scroll to Top