Question 1

What is the difference between token bucket and leaky bucket rate limiting?

Accepted Answer

Token bucket: a virtual bucket holds up to N tokens. Tokens are added at a constant rate (e.g., 10/second). Each request consumes 1 token. If the bucket is empty, the request is rejected. Key property: allows bursts up to the bucket capacity — if no requests were made for 5 seconds, 50 tokens accumulate and a burst of 50 requests is allowed. Token bucket is the most common API rate limiting algorithm (used by AWS, Stripe). Leaky bucket: requests enter a FIFO queue and are drained at a constant rate. Excess requests (queue full) are dropped. Key property: output rate is perfectly smooth — no bursts, constant drain rate. Useful for traffic shaping (e.g., enforcing a constant bitrate). For API rate limiting, token bucket is preferred because it allows legitimate burst traffic. Leaky bucket is preferred for network egress shaping where smoothness is critical.

Question 2

What is the sliding window counter and how does it fix the fixed window boundary burst?

Accepted Answer

Fixed window counter issue: a limit of 100 req/min allows 100 requests at 0:59 and 100 at 1:01 — 200 requests in 2 seconds. Sliding window log: keep timestamps of all requests in the window, count them. Accurate but O(requests) memory per user. Sliding window counter approximation: maintain counters for two adjacent fixed windows. The sliding window estimate = prev_window_count * (1 - elapsed_fraction_of_current_window) + current_window_count. Example: limit 100/min, 30 seconds into the current window, prev=80, curr=60. Estimate = 80*(1-0.5)+60 = 100. At the boundary, the "overhang" from the previous window is counted proportionally. This catches boundary bursts while using only O(1) memory per user. Accuracy: the approximation is within ~0.1% of the true sliding window for uniform traffic.

Question 3

How do you implement distributed rate limiting with Redis?

Accepted Answer

In a distributed system, rate limits must be enforced globally across multiple API servers. Use Redis as the central store. For fixed window: INCR rate:{user_id}:{window} returns the new count atomically; SET expiry with EXPIRE. For sliding window or token bucket: use a Redis Lua script to perform the check-and-update atomically. Lua scripts run atomically in Redis (no other commands can interleave). The script reads the current state (tokens or window count), checks the limit, updates the counter if allowed, and returns allowed/rejected in a single round trip. For very high throughput (> 100K checks/second per user): use Redis Cluster, shard by user_id. Local in-process cache with a 1-second TTL can absorb most requests without a Redis call — only sync every N requests or when the local estimate approaches the limit.

Question 4

What HTTP response code and headers should a rate limiter return?

Accepted Answer

Return HTTP 429 Too Many Requests when a request is rate limited. Include these headers: X-RateLimit-Limit: the total allowed requests in the window (e.g., 1000). X-RateLimit-Remaining: requests remaining in the current window (e.g., 0 when at the limit). X-RateLimit-Reset: Unix timestamp when the current window resets. Retry-After: seconds until the client can retry (integer). The Retry-After header is the most important — without it, clients will retry immediately and generate more rejected requests. Set it to the exact number of seconds until the next token is available. For token bucket, this is 1/refill_rate. For fixed window, this is seconds until the window resets. Return these headers even on successful requests (with the current remaining count) so clients can proactively throttle.

Question 5

How do you rate limit at the API gateway versus the application layer?

Accepted Answer

API gateway rate limiting (e.g., Kong, AWS API Gateway, Nginx): applied before requests reach the application. Enforced per-IP or per-API-key. Benefits: protects all downstream services uniformly, no application code changes needed, fast rejection without application overhead. Limitation: coarse-grained — hard to implement per-endpoint or per-user-tier limits without complex gateway config. Application layer rate limiting: implemented in code (middleware), has access to the authenticated user, user tier, and request context. Can implement tiered limits (free: 100/hour, paid: 10,000/hour), endpoint-specific limits (strict on /search, lenient on /static), and business rules (ban users with payment failures). Best practice: use both. API gateway blocks obvious abuse (IP-level floods, DDoS) cheaply. Application layer enforces per-user business rules with access to the full request context.

Rate Limiting System Low-Level Design (Token Bucket, Leaky Bucket)

Why Rate Limiting?

Four Rate Limiting Algorithms

1. Token Bucket

2. Leaky Bucket

3. Fixed Window Counter

4. Sliding Window Log

5. Sliding Window Counter (Recommended)

Distributed Rate Limiting with Redis

Rate Limit Headers

Key Design Decisions