Low Level Design: API Rate Limiter Design

An API rate limiter controls how many requests a client can make within a time window, protecting backend services from overload, preventing abuse, and enforcing fair usage. Rate limiting is a fundamental component in API gateways, authentication services, and any public-facing API. The design challenge is implementing accurate, low-latency rate limiting at scale across distributed servers without a single point of failure or excessive coordination overhead.

Rate Limiting Algorithms

Four primary algorithms exist: Fixed Window Counter: count requests per client per fixed time window (e.g., per minute). Simple, but boundary spikes — a client can make 2x the limit by requesting at the end of one window and the start of the next. Sliding Window Log: store timestamp of each request in a sorted set; count entries within the last T seconds. Accurate, but O(requests) memory per client. Sliding Window Counter: approximate sliding window using two fixed-window counts with linear interpolation — accurate and O(1) memory. Token Bucket: tokens refill at a constant rate; each request consumes one token. Allows bursts up to bucket capacity while enforcing average rate.

-- Redis sliding window counter (approximate, O(1) memory)
-- For client_id, check if request allowed at time T with limit L per minute:

local key_curr = "rl:" .. client_id .. ":" .. math.floor(T / 60)
local key_prev = "rl:" .. client_id .. ":" .. (math.floor(T / 60) - 1)

local curr_count = tonumber(redis.call("GET", key_curr) or 0)
local prev_count = tonumber(redis.call("GET", key_prev) or 0)

-- Weight of previous window based on position in current window
local elapsed_fraction = (T % 60) / 60
local estimate = prev_count * (1 - elapsed_fraction) + curr_count

if estimate >= L then
  return {0, "rate_limited"}  -- reject
end

redis.call("INCR", key_curr)
redis.call("EXPIRE", key_curr, 120)  -- 2 minute TTL
return {1, L - estimate - 1}  -- allowed, tokens remaining

Token Bucket Implementation

Token bucket is the most flexible algorithm: allows short bursts while enforcing average rate. Implementation with Redis: store (last_refill_time, token_count) per client. On each request: compute tokens added since last refill (min(bucket_capacity, stored_tokens + elapsed_seconds × refill_rate)), then check if ≥ 1 token available, deduct 1 if so. Use a Lua script for atomicity. This allows a client with rate limit 100 req/min to burst to 200 requests in 2 seconds (if they banked tokens) while still averaging 100/min.

Distributed Rate Limiting

With N API gateway instances, local-only rate limiting is inaccurate: each instance allows L requests, so a client can make N×L requests by distributing across instances. Centralized Redis rate limiting: all instances check a shared Redis counter — accurate but adds ~1-2ms latency per request and creates a dependency on Redis availability. Compromise: local counters synchronized to Redis every 100ms. Accept slight inaccuracy (~10% over-limit) in exchange for resilience to Redis failures. For stricter enforcement, use Redis Cluster with the rate limit key hashing to a fixed shard.

Rate Limit Headers and Response

Return informative headers with every response per RFC 6585: X-RateLimit-Limit: 100 (requests allowed per window), X-RateLimit-Remaining: 42 (requests remaining in current window), X-RateLimit-Reset: 1713400000 (Unix timestamp when the window resets). On rejection, return HTTP 429 Too Many Requests with a Retry-After header. Well-behaved clients use these headers to implement adaptive request pacing, reducing retry storms.

Multi-Tier Rate Limiting

Production systems layer multiple rate limits: per IP (prevent DDoS), per API key (enforce tier quotas), per user (prevent account abuse), per endpoint (protect expensive operations). Apply limits in order from cheapest to most expensive check: IP check (local cache) → API key check (Redis) → per-user check (Redis) → endpoint-specific check (Redis). Short-circuit on first rejection. Separate rate limits by granularity: 10 req/sec burst + 1000 req/min sustained + 50,000 req/day quota.

Key Interview Discussion Points

  • Fixed window boundary spike: clients can double their rate at minute boundaries — sliding window or token bucket prevents this
  • Redis Lua scripts for atomicity: read-modify-write on rate limit counters must be atomic to prevent race conditions
  • Graceful degradation: if Redis is unavailable, fail open (allow all requests) or fail closed (reject all) — business decision based on abuse risk vs. availability requirement
  • Rate limit by cost: GraphQL and LLM APIs charge by token/complexity, not request count — weight each request by cost before decrementing the bucket
  • Leaky bucket vs. token bucket: leaky bucket enforces a constant output rate (smooth); token bucket allows bursts up to capacity
Scroll to Top