Implement a Rate Limiter: Token Bucket, Leaky Bucket, Sliding Window
Rate limiting is one of the most-asked system-design topics in 2026. It comes up at every API-heavy company — Stripe, Cloudflare, Twilio, Datadog, AWS — and at any company building public APIs at scale. The problem is conceptually simple but has enough variations and trade-offs to make a 30–45 minute interview round substantive. This guide covers the four standard rate-limiter algorithms (token bucket, leaky bucket, fixed window, sliding window), working code for each, the distributed-systems concerns at scale, and the design decisions that distinguish strong interview answers.
The Problem
A rate limiter restricts the number of requests a client can make within a time window. Common requirements:
- Allow N requests per window (e.g., 100 requests per minute per user)
- Reject or delay requests that exceed the limit
- Reset or refill counters as time passes
- Operate at low latency (typically <1ms decision time)
- Scale to millions of unique clients
- Survive process restarts and distributed deployments
The right algorithm depends on whether you want to allow short bursts (token bucket), enforce smooth output (leaky bucket), or minimize implementation complexity (fixed window).
Algorithm 1: Token Bucket
Each client has a bucket that holds up to capacity tokens. Tokens refill at refill_rate per second. Each request consumes one token; if no tokens are available, the request is rejected.
This algorithm allows controlled bursts (up to capacity requests if the bucket is full) while limiting steady-state throughput to refill_rate. Most public APIs use token bucket because of this burst-tolerance property.
import time
from threading import Lock
class TokenBucket:
def __init__(self, capacity: int, refill_rate: float):
self.capacity = capacity
self.refill_rate = refill_rate # tokens per second
self.tokens = float(capacity)
self.last_refill = time.monotonic()
self.lock = Lock()
def allow(self) -> bool:
with self.lock:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(self.capacity,
self.tokens + elapsed * self.refill_rate)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
# Usage: 100 requests per minute, burst of 10
bucket = TokenBucket(capacity=10, refill_rate=100/60)
for _ in range(15):
print(bucket.allow())
Properties: Allows bursts up to capacity; long-term rate capped at refill_rate. O(1) per request. Memory: O(1) per client.
Algorithm 2: Leaky Bucket
Same metaphor (a bucket), different mechanics. Requests enter a queue; the queue is drained at a fixed rate. Excess requests overflow and are dropped.
Unlike token bucket, leaky bucket smooths the output rate — even if requests arrive in bursts, they’re processed at constant rate. Used in network traffic shaping where output smoothness matters more than burst tolerance.
from collections import deque
import time
from threading import Lock
class LeakyBucket:
def __init__(self, capacity: int, leak_rate: float):
self.capacity = capacity
self.leak_rate = leak_rate # requests per second
self.queue = deque()
self.last_leak = time.monotonic()
self.lock = Lock()
def allow(self) -> bool:
with self.lock:
now = time.monotonic()
elapsed = now - self.last_leak
leaked = int(elapsed * self.leak_rate)
for _ in range(min(leaked, len(self.queue))):
self.queue.popleft()
if leaked > 0:
self.last_leak = now
if len(self.queue) < self.capacity:
self.queue.append(now)
return True
return False
Properties: Smooth output rate; doesn’t allow bursts. Used in routers and traffic shapers where downstream consumers can’t handle bursts.
Algorithm 3: Fixed Window Counter
Count requests within fixed time windows (e.g., a 1-minute window). Reset the counter at the start of each window. Reject requests when the counter exceeds the limit.
import time
from threading import Lock
class FixedWindow:
def __init__(self, limit: int, window_seconds: int):
self.limit = limit
self.window = window_seconds
self.count = 0
self.window_start = self._current_window()
self.lock = Lock()
def _current_window(self) -> int:
return int(time.monotonic()) // self.window
def allow(self) -> bool:
with self.lock:
current = self._current_window()
if current != self.window_start:
self.window_start = current
self.count = 0
if self.count < self.limit:
self.count += 1
return True
return False
Properties: Simple, O(1) memory and time. Major flaw: 2× burst at window boundaries. If the limit is 100/minute and 100 requests come in the last second of one window plus 100 in the first second of the next, the client gets 200 requests in 2 seconds despite the “100/minute” limit.
Algorithm 4: Sliding Window Log
Track the timestamp of every request in the past window. Count timestamps within the window; reject if count exceeds limit.
from collections import deque
import time
from threading import Lock
class SlidingWindowLog:
def __init__(self, limit: int, window_seconds: int):
self.limit = limit
self.window = window_seconds
self.timestamps = deque()
self.lock = Lock()
def allow(self) -> bool:
with self.lock:
now = time.monotonic()
cutoff = now - self.window
while self.timestamps and self.timestamps[0] < cutoff:
self.timestamps.popleft()
if len(self.timestamps) < self.limit:
self.timestamps.append(now)
return True
return False
Properties: Exact rate enforcement; no boundary bursts. Cost: O(N) memory per client where N is the limit. For limit=10000, this stores 10000 timestamps per client — substantial at scale.
Algorithm 5: Sliding Window Counter (Hybrid)
Combines fixed-window simplicity with sliding-window accuracy. Track counters for the current and previous window; estimate the sliding count via weighted interpolation.
import time
from threading import Lock
class SlidingWindowCounter:
def __init__(self, limit: int, window_seconds: int):
self.limit = limit
self.window = window_seconds
self.current_count = 0
self.previous_count = 0
self.window_start = self._current_window_start()
self.lock = Lock()
def _current_window_start(self) -> float:
now = time.monotonic()
return now - (now % self.window)
def allow(self) -> bool:
with self.lock:
now = time.monotonic()
current_start = self._current_window_start()
if current_start != self.window_start:
self.previous_count = self.current_count if (
current_start - self.window_start == self.window) else 0
self.current_count = 0
self.window_start = current_start
elapsed_in_window = now - self.window_start
weight = (self.window - elapsed_in_window) / self.window
estimated = self.previous_count * weight + self.current_count
if estimated < self.limit:
self.current_count += 1
return True
return False
Properties: O(1) memory per client; approximate accuracy (weighted estimate). The standard production algorithm at most large-scale APIs because of the memory-vs-accuracy trade-off.
Distributed Rate Limiting
The above implementations are single-process. At scale, requests for the same client can hit any of N servers, requiring shared state.
Centralized counter (Redis)
Most common approach. Use Redis with atomic increments (INCR + EXPIRE) for fixed/sliding windows; or Lua scripts for token bucket atomicity.
# Pseudo-code: token bucket via Redis Lua script
LUA_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local tokens = tonumber(redis.call('hget', key, 'tokens') or capacity)
local last_refill = tonumber(redis.call('hget', key, 'last_refill') or now)
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
if tokens >= 1 then
redis.call('hset', key, 'tokens', tokens - 1, 'last_refill', now)
return 1
end
redis.call('hset', key, 'tokens', tokens, 'last_refill', now)
return 0
"""
Trade-offs: every request hits Redis (latency), single point of failure, single hot key for popular clients. Use Redis Cluster + sharded keys to scale.
Local + sync
Each server enforces its own limit (e.g., 100/N requests/min if there are N servers). Clients see the aggregate effect. Drift exists but bounded; eventually-consistent. Lower latency, less centralized; allows brief over-limit during traffic shifts.
Approximate algorithms
Count-min sketch for approximate counting at extreme scale. Trade accuracy for memory; useful when you have millions of distinct clients and per-client storage matters.
Common Interview Discussion Points
Choose the right algorithm
Token bucket for public APIs (allows bursts, intuitive for developers). Sliding window counter for high-volume internal APIs (memory-efficient, smooth enough). Fixed window for simple internal use cases where boundary bursts are acceptable.
Per-client identity
By API key, user ID, IP address, or composite. Discuss how to handle authenticated vs anonymous, NAT and shared IPs, and abuse mitigation.
Headers and response
Standard headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After. Standard status: 429 Too Many Requests. Discuss what makes the API friendly to clients (predictable, observable).
Different limits for different endpoints
Auth endpoints often have stricter limits than read endpoints. Discuss how to express this — per-route limits, tier-based limits (free vs paid users), endpoint-class limits.
Failure modes
What happens if Redis is down? Fail-open (let requests through) vs fail-closed (reject all). Most production systems fail-open with degraded local enforcement; security-sensitive endpoints fail-closed.
Frequently Asked Questions
What’s the typical interview answer to “design a rate limiter”?
Walk through token bucket as the default. Mention the alternatives (leaky bucket, fixed window, sliding window log, sliding window counter) with their trade-offs. Discuss distributed enforcement (Redis with Lua, eventual-consistency local enforcement). Cover failure modes and standard headers. The interviewer is testing breadth of knowledge plus specific implementation skill; aim for both.
Why does token bucket allow bursts?
The bucket starts full and refills slowly. A client that hasn’t sent requests recently has accumulated tokens up to capacity; they can spend them all in a burst before being throttled. Long-term rate is bounded by refill_rate; short-term burst is bounded by capacity. This matches what most APIs want — occasional bursts are fine, sustained high rate is not.
How do I handle clock skew in distributed rate limiting?
Use monotonic clocks on each node (time.monotonic() in Python; CLOCK_MONOTONIC in Linux) for local rate limiting. For centralized rate limiting via Redis, use Redis’s own time (Redis returns server time via TIME command, used in Lua scripts) so all nodes share a clock. Don’t trust client-supplied timestamps.
How does Cloudflare or AWS implement rate limiting at their scale?
Sharded approaches at every layer. Edge nodes do approximate per-IP rate limiting using count-min sketch or similar. Origin-side limiting uses centralized counters (Redis, custom services). Cloudflare in particular has published on distributed counters using their workers infrastructure; AWS uses tiered approaches across CloudFront, API Gateway, and per-service. The key insight: at extreme scale, exact rate limiting is too expensive; approximate algorithms with bounded error are universal.
What’s the difference between rate limiting and throttling?
Often used interchangeably; subtle distinction: rate limiting rejects requests over the limit (return 429); throttling delays them (queue and process slower). Some systems do hybrid — reject over a hard limit, throttle between soft and hard limits. Discuss which behavior your design produces; interviewers sometimes probe this distinction.
See also: Producer-Consumer Problem • Fair Server Processing • Order Book Dynamics
💡Strategies for Solving This Problem
System Design Meets Algorithms
Got this at Stripe in 2024. It's part system design, part implementation problem. They want to see if you understand rate limiting patterns AND can code them.
The Problem
Limit how many requests a user can make in a time window. For example: max 100 requests per hour per user.
Approach 1: Fixed Window Counter
Count requests in fixed time buckets (like hourly buckets starting at :00).
Pros: Simple, memory efficient
Cons: Burst problem at window edges. User could make 100 requests at 10:59 and 100 more at 11:00.
Approach 2: Sliding Window Log
Store timestamp of each request. For new request, count how many timestamps fall within the last window.
Pros: Accurate, no burst problem
Cons: Memory intensive (O(n) per user where n is limit)
Approach 3: Sliding Window Counter
Hybrid approach. Use buckets but calculate weighted average based on time position in current window.
Pros: Good accuracy with low memory
Cons: More complex logic
Approach 4: Token Bucket
Give user tokens at fixed rate. Each request consumes a token. Allows short bursts but enforces average rate.
Pros: Flexible, handles bursts smoothly
Cons: Need background process to refill tokens
What I Used at Stripe
Started with fixed window (simplest). Interviewer asked about burst problem. Discussed sliding window log and token bucket. Implemented token bucket because Stripe needs to handle payment bursts during sales events.
Key question they asked: "How would this scale to millions of users?" Discussed Redis, sharding by user ID, and what happens during cache failures (fail open vs fail closed).