Rate limiting protects services from abuse, ensures fair resource allocation, and prevents cascading failures under load. Designing a production-grade rate limiting service means choosing the right algorithm for each use case and enforcing limits consistently across a distributed fleet. This post covers the algorithms, data model, and distributed architecture you need for a system design interview.
Requirements
Functional Requirements
- Enforce per-user, per-IP, and per-API-key rate limits
- Support multiple algorithms: fixed window, sliding log, and token bucket
- Return standard HTTP 429 responses with Retry-After headers
- Allow per-endpoint limit configuration independent of global defaults
- Support burst allowances above the steady-state rate
Non-Functional Requirements
- Limit check latency under 2ms at p99 (added to every API request)
- Accurate enforcement within 1% error tolerance
- Survive Redis node failures without taking down the API
- Support 500,000 limit checks per second
Algorithm Comparison
Fixed Window
Divide time into fixed windows (e.g., 1-minute buckets). Increment a counter per key per window. Simple and fast: one Redis INCR plus TTL set. The weakness is the boundary spike problem: a client can send N requests just before the window ends and N more just after it starts, achieving 2N requests in a short period.
Sliding Window Log
Store a sorted set of request timestamps per key. On each request, remove entries older than the window, count remaining entries, and add the new timestamp if under the limit. Accurate but memory-intensive: stores one entry per request. Suitable for low-volume keys with strict accuracy requirements such as payment APIs.
Token Bucket
Each key has a bucket with a maximum capacity. Tokens refill at a steady rate. Each request consumes one token. Allows controlled bursting: a client that has been idle can accumulate tokens up to the bucket capacity and then send a burst. Implemented in Redis using two values: the current token count and the last refill timestamp. Atomic Lua script computes the refill, checks the count, and decrements in one round trip.
Sliding Window Counter (Recommended Default)
A practical hybrid: maintain counters for the current and previous fixed windows. Estimate the count in the rolling window as prev_count * (1 - elapsed/window) + curr_count. This smooths the boundary spike at the cost of slight approximation. One Redis hash operation; accurate within a few percent.
Data Model
Redis is the primary store for all counter state. Keys follow the pattern rl:{algorithm}:{dimension}:{entity_id}:{window}. For fixed window: rl:fw:user:42:1713312000. For token bucket: a Redis hash with fields tokens and last_refill. For sliding log: a Redis sorted set with score = timestamp and member = request_id.
Limit configurations are stored in a relational database and cached in memory on each gateway node. The config includes entity_type, scope (global or per-endpoint), algorithm, limit, window_seconds, burst_allowance, and action (block or throttle). Changes propagate via a config push event within seconds.
Distributed Enforcement
A single Redis cluster is the source of truth. All API gateway nodes perform limit checks against the same cluster, ensuring global accuracy. The Redis commands for each algorithm are wrapped in Lua scripts executed atomically with EVAL, eliminating race conditions without requiring distributed transactions.
For resilience, use a Redis cluster with at least 3 primary shards. Keys are distributed by hash slot. If the Redis cluster becomes unavailable, the gateway falls back to a local in-process counter. Local counters enforce limits per-node rather than globally, which allows more traffic through than intended but prevents a Redis outage from taking down the API entirely. This fail-open policy is configurable per endpoint.
API Design
- POST /internal/check — Accepts entity_id, entity_type, endpoint; returns allowed/denied plus headers
- GET /configs — List all limit configurations
- PUT /configs/{config_id} — Update a limit configuration; triggers cache invalidation
- GET /stats/{entity_id} — Return current consumption metrics for debugging
The check endpoint is on the critical path. It must return in under 2ms. The response includes X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and on denial Retry-After as per RFC 6585.
Scalability
Redis handles 500,000+ simple operations per second on a single node. Sharding by entity_id distributes load. Hot keys (a single entity hammering the API) are mitigated by local pre-checks: each gateway node tracks a local counter and only calls Redis when the local count approaches the per-node share of the limit. This reduces Redis calls by 80% under uniform load.
For very large deployments, use a two-layer approach: local token buckets at the gateway with periodic synchronization to a central Redis counter. The synchronization interval controls the accuracy-performance trade-off.
Interview Talking Points
Be prepared to justify your algorithm choice per use case (fixed window for simplicity, token bucket for bursting, sliding log for strict accuracy), explain the Lua atomicity guarantee, and describe the fail-open fallback. Interviewers often ask about the boundary spike problem specifically and whether you know the sliding window counter approximation as a practical fix.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do fixed window, sliding log, and token bucket rate-limiting algorithms compare?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Fixed window resets a counter at each interval boundary — simple but allows burst doubling at window edges. Sliding log tracks each request timestamp in a sorted set and counts within the rolling window — exact but memory-heavy. Token bucket adds tokens at a steady rate up to a capacity and deducts one per request — smooths bursts while allowing short spikes up to the bucket size. Token bucket is the most practical for API gateways.”
}
},
{
“@type”: “Question”,
“name”: “How do you enforce rate limits across multiple nodes using Redis?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Store the counter or token bucket state in Redis with a key per (user, endpoint, window). Use a single Redis primary to avoid split-brain, or Redis Cluster with consistent hashing so all nodes for a given key hit the same shard. Each application node performs an atomic increment-and-check on every request, keeping enforcement consistent regardless of which app server handles the call.”
}
},
{
“@type”: “Question”,
“name”: “Why use Lua scripts for rate-limit enforcement in Redis?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Lua scripts execute atomically on the Redis server — the read-modify-write (GET counter, compare, INCR, SET TTL) completes as a single operation with no race window. This eliminates the TOCTOU bug where two concurrent requests both read a count below the limit and both succeed. The script runs in a single round trip, reducing latency compared to a pipeline of separate commands.”
}
},
{
“@type”: “Question”,
“name”: “How should a rate-limit service communicate rejection to clients?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Return HTTP 429 Too Many Requests and include a Retry-After header specifying the number of seconds (or an HTTP date) after which the client may retry. Also add X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers so clients can adapt proactively. A well-formed Retry-After prevents thundering-herd retries by spreading reconnect attempts across the reset window.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems