Q: How does the token bucket algorithm work?

A token bucket has a maximum capacity of burst_size tokens. Tokens are added at a constant rate (rate tokens/second) up to the max capacity. Each request consumes one token. If no tokens available, the request is rejected (or queued). Allows controlled bursts up to burst_size. Implementation with Redis: store (tokens, last_refill_time) per user. On each request: compute elapsed time, add elapsed * rate tokens (capped at burst_size), subtract 1 for the request, store back. Use a Lua script for atomicity. AWS API Gateway and most cloud rate limiters use token bucket because it smooths traffic while allowing short bursts (e.g., mobile app opening with a burst of concurrent requests).

Q: How do you implement distributed rate limiting across multiple servers?

Centralized Redis store: all app servers send rate limit checks to a single Redis cluster. Redis is fast enough for this (< 1ms p99). Use consistent hashing on user_id to route to the same Redis shard, avoiding cross-shard coordination. Use Lua scripts for atomic check-and-update to prevent race conditions between the read and write. Alternative: approximate local rate limiting per app server, accept slight over-allowing at startup or during failover. For multi-region: either use a global Redis (adds cross-region latency) or per-region limits (allow N requests per region per window, total ~N * regions - imprecise but low latency). Most systems choose per-region limits for latency reasons.

Q: What rate limit response headers should an API return?

Standard headers: X-RateLimit-Limit (max requests in the window), X-RateLimit-Remaining (requests left in current window), X-RateLimit-Reset (Unix timestamp when the window resets or tokens refill). On 429 Too Many Requests: also include Retry-After (seconds to wait before retrying). These headers allow clients to implement proactive throttling instead of hitting 429s. Some APIs add X-RateLimit-Policy to describe the algorithm. The Retry-After header is critical: without it, clients may implement aggressive retry loops that amplify traffic. With it, well-behaved clients back off correctly.

Question 1

What is the sliding window log rate limiting algorithm?

Accepted Answer

Store each request timestamp in a Redis sorted set (key = ratelimit:{user_id}, score = timestamp). On each request: (1) ZADD the current timestamp. (2) ZREMRANGEBYSCORE to remove entries older than the window (current_time - window_seconds). (3) ZCARD to count remaining entries. (4) If count > limit, reject. Advantages: perfectly accurate, no boundary burst problem. Disadvantages: O(limit) memory per user (each request stored), and multiple Redis commands (use a Lua script for atomicity). Use when accuracy is critical (financial APIs, security-sensitive endpoints). For high-traffic consumer APIs, the sliding window counter approximation is preferred.

Question 2

What is the fixed window boundary burst problem and how does sliding window counter solve it?

Accepted Answer

Fixed window problem: if the limit is 100 req/minute and windows reset at :00, a user can make 100 requests at :59 and 100 more at :01 - 200 requests in 2 seconds. Sliding window counter approximation: estimate the request count in the rolling window using two adjacent fixed window counters. Formula: count = current_window_count + previous_window_count * (1 - elapsed_fraction_of_current_window). Example with 60s window, 45s into current window: count = current_count + previous_count * 0.25. This reduces the burst to ~1.003x the limit (empirically <0.003% error). Uses only 2 counters per user regardless of request rate.

Question 3

How does the token bucket algorithm work?

Accepted Answer

A token bucket has a maximum capacity of burst_size tokens. Tokens are added at a constant rate (rate tokens/second) up to the max capacity. Each request consumes one token. If no tokens available, the request is rejected (or queued). Allows controlled bursts up to burst_size. Implementation with Redis: store (tokens, last_refill_time) per user. On each request: compute elapsed time, add elapsed * rate tokens (capped at burst_size), subtract 1 for the request, store back. Use a Lua script for atomicity. AWS API Gateway and most cloud rate limiters use token bucket because it smooths traffic while allowing short bursts (e.g., mobile app opening with a burst of concurrent requests).

Question 4

How do you implement distributed rate limiting across multiple servers?

Accepted Answer

Centralized Redis store: all app servers send rate limit checks to a single Redis cluster. Redis is fast enough for this (< 1ms p99). Use consistent hashing on user_id to route to the same Redis shard, avoiding cross-shard coordination. Use Lua scripts for atomic check-and-update to prevent race conditions between the read and write. Alternative: approximate local rate limiting per app server, accept slight over-allowing at startup or during failover. For multi-region: either use a global Redis (adds cross-region latency) or per-region limits (allow N requests per region per window, total ~N * regions - imprecise but low latency). Most systems choose per-region limits for latency reasons.

Question 5

What rate limit response headers should an API return?

Accepted Answer

Standard headers: X-RateLimit-Limit (max requests in the window), X-RateLimit-Remaining (requests left in current window), X-RateLimit-Reset (Unix timestamp when the window resets or tokens refill). On 429 Too Many Requests: also include Retry-After (seconds to wait before retrying). These headers allow clients to implement proactive throttling instead of hitting 429s. Some APIs add X-RateLimit-Policy to describe the algorithm. The Retry-After header is critical: without it, clients may implement aggressive retry loops that amplify traffic. With it, well-behaved clients back off correctly.

Tier	Key	Typical limit	Purpose
Per-IP	ip:1.2.3.4	1000/hour	Block scrapers, DDoS
Per-User	user:12345	500/hour	Fair usage per account
Per-API-key	key:abc123	10000/hour	Tiered pricing plans
Per-Endpoint	user:12345:/search	100/min	Protect expensive ops
Global	global:api	1M/min	System-wide protection

Algorithm	Memory	Burst handling	Accuracy	Best for
Fixed window	O(1)	2x at boundary	Approximate	Simple internal APIs
Sliding log	O(requests)	Exact	Exact	Low-traffic, strict limits
Sliding counter	O(1)	Smoothed	~99.997%	High-traffic production
Token bucket	O(1)	Controlled bursts OK	Exact	APIs with burst tolerance
Leaky bucket	O(queue size)	Queued	Exact	Traffic shaping

Rate Limiter System Low-Level Design

Fixed Window Counter

Sliding Window Log

Sliding Window Counter (Approximate)

Token Bucket

Leaky Bucket

Distributed Rate Limiting

Rate Limit Response Headers

Multi-Tier Rate Limiting

Algorithm Comparison