Question 1

What is the token bucket algorithm for rate limiting?

Accepted Answer

Token bucket is the most popular rate limiting algorithm. A bucket holds up to B tokens (the burst capacity). Tokens are added at a constant rate R per second. Each request consumes one token. If the bucket has tokens, the request is allowed and a token is removed. If the bucket is empty, the request is rejected with HTTP 429. The bucket capacity B allows bursts: a client can make B requests instantly, then must wait for tokens to refill at rate R. Example: B=100, R=10/sec. A client can burst 100 requests, then sustain 10/sec. After the burst, the bucket refills in 10 seconds. Implementation in Redis: use a Lua script for atomicity. Store {last_refill_time, tokens} per client. On each request: compute elapsed time, add tokens (min(elapsed * R, B)), deduct one if available. The Lua script executes atomically, preventing race conditions from concurrent requests.

Question 2

How do you implement distributed rate limiting across multiple servers?

Accepted Answer

In a distributed system with multiple API servers, rate limiting must be centralized to prevent clients from bypassing limits by hitting different servers. Solution: store rate limiting state in Redis. Each API server checks Redis before processing a request. Token bucket in Redis: a Lua script atomically checks tokens, refills based on elapsed time, deducts one, and returns allow/deny. Since Redis Lua scripts are atomic, no race conditions occur. Sliding window in Redis: use a sorted set (ZADD timestamp, ZREMRANGEBYSCORE to remove old entries, ZCARD to count). Performance: Redis handles 100K+ operations/sec. A rate limit check adds 0.5-1ms latency per request. For ultra-low-latency requirements, use a local in-memory rate limiter that periodically syncs with Redis (accepts slight inaccuracy for lower latency). Sharding: for very high throughput, shard rate limit keys across Redis Cluster nodes.

Question 3

What rate limiting algorithm should you choose?

Accepted Answer

Token bucket: allows bursts up to the bucket capacity while limiting sustained rate. Best for APIs where burst traffic is acceptable (web APIs, most use cases). Used by AWS, Stripe, and most API gateways. Leaky bucket: produces a perfectly smooth output rate with no bursts. The queue absorbs bursts but processes at a fixed rate. Best for systems requiring uniform processing (network traffic shaping). Fixed window counter: count per minute/hour. Simple but allows double the rate at window boundaries (99 requests at 11:59 + 100 at 12:00 = 199 in 60 seconds). Sliding window counter: estimates using weighted previous + current window. Good accuracy with low memory. Default recommendation: token bucket for most API rate limiting. It is intuitive (clients understand burst + sustained rate), efficient (O(1) per check), and widely adopted.

Question 4

Where should a rate limiter be placed in the architecture?

Accepted Answer

Three placement options: (1) API Gateway -- rate limiting at the entry point (Kong, AWS API Gateway, Envoy). Every request passes through. Pros: centralized, no backend changes. Cons: gateway becomes a potential bottleneck. (2) Middleware -- rate limiting logic within each service. Pros: per-service customization, no SPOF. Cons: duplicated logic across services. (3) Service mesh (Istio/Envoy) -- rate limiting at the sidecar proxy. Transparent to applications, configured declaratively. Rate limit by multiple dimensions: per API key (tier-based quotas), per IP (unauthenticated endpoints), per user (fair usage), and per endpoint (stricter on login/payment). Response: return 429 with headers X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After so clients can implement proper backoff.

System Design: Rate Limiter — Token Bucket, Sliding Window, Leaky Bucket, Distributed Rate Limiting, API Gateway

Rate Limiting Algorithms

Distributed Rate Limiting

Rate Limiter Architecture

Handling Rate-Limited Requests