Rate limiting protects services from abuse, ensures fair resource allocation, and prevents cascading failures under load. Designing a production-grade rate limiting service means choosing the right algorithm for each use case and enforcing limits consistently across a distributed fleet. This post covers the algorithms, data model, and distributed architecture you need for a system design interview.
Requirements
Functional Requirements
- Enforce per-user, per-IP, and per-API-key rate limits
- Support multiple algorithms: fixed window, sliding log, and token bucket
- Return standard HTTP 429 responses with Retry-After headers
- Allow per-endpoint limit configuration independent of global defaults
- Support burst allowances above the steady-state rate
Non-Functional Requirements
- Limit check latency under 2ms at p99 (added to every API request)
- Accurate enforcement within 1% error tolerance
- Survive Redis node failures without taking down the API
- Support 500,000 limit checks per second
Algorithm Comparison
Fixed Window
Divide time into fixed windows (e.g., 1-minute buckets). Increment a counter per key per window. Simple and fast: one Redis INCR plus TTL set. The weakness is the boundary spike problem: a client can send N requests just before the window ends and N more just after it starts, achieving 2N requests in a short period.
Sliding Window Log
Store a sorted set of request timestamps per key. On each request, remove entries older than the window, count remaining entries, and add the new timestamp if under the limit. Accurate but memory-intensive: stores one entry per request. Suitable for low-volume keys with strict accuracy requirements such as payment APIs.
Token Bucket
Each key has a bucket with a maximum capacity. Tokens refill at a steady rate. Each request consumes one token. Allows controlled bursting: a client that has been idle can accumulate tokens up to the bucket capacity and then send a burst. Implemented in Redis using two values: the current token count and the last refill timestamp. Atomic Lua script computes the refill, checks the count, and decrements in one round trip.
Sliding Window Counter (Recommended Default)
A practical hybrid: maintain counters for the current and previous fixed windows. Estimate the count in the rolling window as prev_count * (1 - elapsed/window) + curr_count. This smooths the boundary spike at the cost of slight approximation. One Redis hash operation; accurate within a few percent.
Data Model
Redis is the primary store for all counter state. Keys follow the pattern rl:{algorithm}:{dimension}:{entity_id}:{window}. For fixed window: rl:fw:user:42:1713312000. For token bucket: a Redis hash with fields tokens and last_refill. For sliding log: a Redis sorted set with score = timestamp and member = request_id.
Limit configurations are stored in a relational database and cached in memory on each gateway node. The config includes entity_type, scope (global or per-endpoint), algorithm, limit, window_seconds, burst_allowance, and action (block or throttle). Changes propagate via a config push event within seconds.
Distributed Enforcement
A single Redis cluster is the source of truth. All API gateway nodes perform limit checks against the same cluster, ensuring global accuracy. The Redis commands for each algorithm are wrapped in Lua scripts executed atomically with EVAL, eliminating race conditions without requiring distributed transactions.
For resilience, use a Redis cluster with at least 3 primary shards. Keys are distributed by hash slot. If the Redis cluster becomes unavailable, the gateway falls back to a local in-process counter. Local counters enforce limits per-node rather than globally, which allows more traffic through than intended but prevents a Redis outage from taking down the API entirely. This fail-open policy is configurable per endpoint.
API Design
- POST /internal/check — Accepts entity_id, entity_type, endpoint; returns allowed/denied plus headers
- GET /configs — List all limit configurations
- PUT /configs/{config_id} — Update a limit configuration; triggers cache invalidation
- GET /stats/{entity_id} — Return current consumption metrics for debugging
The check endpoint is on the critical path. It must return in under 2ms. The response includes X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and on denial Retry-After as per RFC 6585.
Scalability
Redis handles 500,000+ simple operations per second on a single node. Sharding by entity_id distributes load. Hot keys (a single entity hammering the API) are mitigated by local pre-checks: each gateway node tracks a local counter and only calls Redis when the local count approaches the per-node share of the limit. This reduces Redis calls by 80% under uniform load.
For very large deployments, use a two-layer approach: local token buckets at the gateway with periodic synchronization to a central Redis counter. The synchronization interval controls the accuracy-performance trade-off.
Interview Talking Points
Be prepared to justify your algorithm choice per use case (fixed window for simplicity, token bucket for bursting, sliding log for strict accuracy), explain the Lua atomicity guarantee, and describe the fail-open fallback. Interviewers often ask about the boundary spike problem specifically and whether you know the sliding window counter approximation as a practical fix.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems