Implement a Rate Limiter: Token Bucket, Leaky Bucket, Sliding Window

Implement a Rate Limiter: Token Bucket, Leaky Bucket, Sliding Window

Rate limiting is one of the most-asked system-design topics in 2026. It comes up at every API-heavy company — Stripe, Cloudflare, Twilio, Datadog, AWS — and at any company building public APIs at scale. The problem is conceptually simple but has enough variations and trade-offs to make a 30–45 minute interview round substantive. This guide covers the four standard rate-limiter algorithms (token bucket, leaky bucket, fixed window, sliding window), working code for each, the distributed-systems concerns at scale, and the design decisions that distinguish strong interview answers.

The Problem

A rate limiter restricts the number of requests a client can make within a time window. Common requirements:

  • Allow N requests per window (e.g., 100 requests per minute per user)
  • Reject or delay requests that exceed the limit
  • Reset or refill counters as time passes
  • Operate at low latency (typically <1ms decision time)
  • Scale to millions of unique clients
  • Survive process restarts and distributed deployments

The right algorithm depends on whether you want to allow short bursts (token bucket), enforce smooth output (leaky bucket), or minimize implementation complexity (fixed window).

Algorithm 1: Token Bucket

Each client has a bucket that holds up to capacity tokens. Tokens refill at refill_rate per second. Each request consumes one token; if no tokens are available, the request is rejected.

This algorithm allows controlled bursts (up to capacity requests if the bucket is full) while limiting steady-state throughput to refill_rate. Most public APIs use token bucket because of this burst-tolerance property.

import time
from threading import Lock

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.tokens = float(capacity)
        self.last_refill = time.monotonic()
        self.lock = Lock()

    def allow(self) -> bool:
        with self.lock:
            now = time.monotonic()
            elapsed = now - self.last_refill
            self.tokens = min(self.capacity,
                              self.tokens + elapsed * self.refill_rate)
            self.last_refill = now
            if self.tokens >= 1:
                self.tokens -= 1
                return True
            return False


# Usage: 100 requests per minute, burst of 10
bucket = TokenBucket(capacity=10, refill_rate=100/60)
for _ in range(15):
    print(bucket.allow())

Properties: Allows bursts up to capacity; long-term rate capped at refill_rate. O(1) per request. Memory: O(1) per client.

Algorithm 2: Leaky Bucket

Same metaphor (a bucket), different mechanics. Requests enter a queue; the queue is drained at a fixed rate. Excess requests overflow and are dropped.

Unlike token bucket, leaky bucket smooths the output rate — even if requests arrive in bursts, they’re processed at constant rate. Used in network traffic shaping where output smoothness matters more than burst tolerance.

from collections import deque
import time
from threading import Lock

class LeakyBucket:
    def __init__(self, capacity: int, leak_rate: float):
        self.capacity = capacity
        self.leak_rate = leak_rate  # requests per second
        self.queue = deque()
        self.last_leak = time.monotonic()
        self.lock = Lock()

    def allow(self) -> bool:
        with self.lock:
            now = time.monotonic()
            elapsed = now - self.last_leak
            leaked = int(elapsed * self.leak_rate)
            for _ in range(min(leaked, len(self.queue))):
                self.queue.popleft()
            if leaked > 0:
                self.last_leak = now
            if len(self.queue) < self.capacity:
                self.queue.append(now)
                return True
            return False

Properties: Smooth output rate; doesn’t allow bursts. Used in routers and traffic shapers where downstream consumers can’t handle bursts.

Algorithm 3: Fixed Window Counter

Count requests within fixed time windows (e.g., a 1-minute window). Reset the counter at the start of each window. Reject requests when the counter exceeds the limit.

import time
from threading import Lock

class FixedWindow:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window = window_seconds
        self.count = 0
        self.window_start = self._current_window()
        self.lock = Lock()

    def _current_window(self) -> int:
        return int(time.monotonic()) // self.window

    def allow(self) -> bool:
        with self.lock:
            current = self._current_window()
            if current != self.window_start:
                self.window_start = current
                self.count = 0
            if self.count < self.limit:
                self.count += 1
                return True
            return False

Properties: Simple, O(1) memory and time. Major flaw: 2× burst at window boundaries. If the limit is 100/minute and 100 requests come in the last second of one window plus 100 in the first second of the next, the client gets 200 requests in 2 seconds despite the “100/minute” limit.

Algorithm 4: Sliding Window Log

Track the timestamp of every request in the past window. Count timestamps within the window; reject if count exceeds limit.

from collections import deque
import time
from threading import Lock

class SlidingWindowLog:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window = window_seconds
        self.timestamps = deque()
        self.lock = Lock()

    def allow(self) -> bool:
        with self.lock:
            now = time.monotonic()
            cutoff = now - self.window
            while self.timestamps and self.timestamps[0] < cutoff:
                self.timestamps.popleft()
            if len(self.timestamps) < self.limit:
                self.timestamps.append(now)
                return True
            return False

Properties: Exact rate enforcement; no boundary bursts. Cost: O(N) memory per client where N is the limit. For limit=10000, this stores 10000 timestamps per client — substantial at scale.

Algorithm 5: Sliding Window Counter (Hybrid)

Combines fixed-window simplicity with sliding-window accuracy. Track counters for the current and previous window; estimate the sliding count via weighted interpolation.

import time
from threading import Lock

class SlidingWindowCounter:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window = window_seconds
        self.current_count = 0
        self.previous_count = 0
        self.window_start = self._current_window_start()
        self.lock = Lock()

    def _current_window_start(self) -> float:
        now = time.monotonic()
        return now - (now % self.window)

    def allow(self) -> bool:
        with self.lock:
            now = time.monotonic()
            current_start = self._current_window_start()
            if current_start != self.window_start:
                self.previous_count = self.current_count if (
                    current_start - self.window_start == self.window) else 0
                self.current_count = 0
                self.window_start = current_start
            elapsed_in_window = now - self.window_start
            weight = (self.window - elapsed_in_window) / self.window
            estimated = self.previous_count * weight + self.current_count
            if estimated < self.limit:
                self.current_count += 1
                return True
            return False

Properties: O(1) memory per client; approximate accuracy (weighted estimate). The standard production algorithm at most large-scale APIs because of the memory-vs-accuracy trade-off.

Distributed Rate Limiting

The above implementations are single-process. At scale, requests for the same client can hit any of N servers, requiring shared state.

Centralized counter (Redis)

Most common approach. Use Redis with atomic increments (INCR + EXPIRE) for fixed/sliding windows; or Lua scripts for token bucket atomicity.

# Pseudo-code: token bucket via Redis Lua script
LUA_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local tokens = tonumber(redis.call('hget', key, 'tokens') or capacity)
local last_refill = tonumber(redis.call('hget', key, 'last_refill') or now)
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
if tokens >= 1 then
    redis.call('hset', key, 'tokens', tokens - 1, 'last_refill', now)
    return 1
end
redis.call('hset', key, 'tokens', tokens, 'last_refill', now)
return 0
"""

Trade-offs: every request hits Redis (latency), single point of failure, single hot key for popular clients. Use Redis Cluster + sharded keys to scale.

Local + sync

Each server enforces its own limit (e.g., 100/N requests/min if there are N servers). Clients see the aggregate effect. Drift exists but bounded; eventually-consistent. Lower latency, less centralized; allows brief over-limit during traffic shifts.

Approximate algorithms

Count-min sketch for approximate counting at extreme scale. Trade accuracy for memory; useful when you have millions of distinct clients and per-client storage matters.

Common Interview Discussion Points

Choose the right algorithm

Token bucket for public APIs (allows bursts, intuitive for developers). Sliding window counter for high-volume internal APIs (memory-efficient, smooth enough). Fixed window for simple internal use cases where boundary bursts are acceptable.

Per-client identity

By API key, user ID, IP address, or composite. Discuss how to handle authenticated vs anonymous, NAT and shared IPs, and abuse mitigation.

Headers and response

Standard headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After. Standard status: 429 Too Many Requests. Discuss what makes the API friendly to clients (predictable, observable).

Different limits for different endpoints

Auth endpoints often have stricter limits than read endpoints. Discuss how to express this — per-route limits, tier-based limits (free vs paid users), endpoint-class limits.

Failure modes

What happens if Redis is down? Fail-open (let requests through) vs fail-closed (reject all). Most production systems fail-open with degraded local enforcement; security-sensitive endpoints fail-closed.

Frequently Asked Questions

What’s the typical interview answer to “design a rate limiter”?

Walk through token bucket as the default. Mention the alternatives (leaky bucket, fixed window, sliding window log, sliding window counter) with their trade-offs. Discuss distributed enforcement (Redis with Lua, eventual-consistency local enforcement). Cover failure modes and standard headers. The interviewer is testing breadth of knowledge plus specific implementation skill; aim for both.

Why does token bucket allow bursts?

The bucket starts full and refills slowly. A client that hasn’t sent requests recently has accumulated tokens up to capacity; they can spend them all in a burst before being throttled. Long-term rate is bounded by refill_rate; short-term burst is bounded by capacity. This matches what most APIs want — occasional bursts are fine, sustained high rate is not.

How do I handle clock skew in distributed rate limiting?

Use monotonic clocks on each node (time.monotonic() in Python; CLOCK_MONOTONIC in Linux) for local rate limiting. For centralized rate limiting via Redis, use Redis’s own time (Redis returns server time via TIME command, used in Lua scripts) so all nodes share a clock. Don’t trust client-supplied timestamps.

How does Cloudflare or AWS implement rate limiting at their scale?

Sharded approaches at every layer. Edge nodes do approximate per-IP rate limiting using count-min sketch or similar. Origin-side limiting uses centralized counters (Redis, custom services). Cloudflare in particular has published on distributed counters using their workers infrastructure; AWS uses tiered approaches across CloudFront, API Gateway, and per-service. The key insight: at extreme scale, exact rate limiting is too expensive; approximate algorithms with bounded error are universal.

What’s the difference between rate limiting and throttling?

Often used interchangeably; subtle distinction: rate limiting rejects requests over the limit (return 429); throttling delays them (queue and process slower). Some systems do hybrid — reject over a hard limit, throttle between soft and hard limits. Discuss which behavior your design produces; interviewers sometimes probe this distinction.

See also: Producer-Consumer ProblemFair Server ProcessingOrder Book Dynamics

💡Strategies for Solving This Problem

System Design Meets Algorithms

Got this at Stripe in 2024. It's part system design, part implementation problem. They want to see if you understand rate limiting patterns AND can code them.

The Problem

Limit how many requests a user can make in a time window. For example: max 100 requests per hour per user.

Approach 1: Fixed Window Counter

Count requests in fixed time buckets (like hourly buckets starting at :00).

Pros: Simple, memory efficient
Cons: Burst problem at window edges. User could make 100 requests at 10:59 and 100 more at 11:00.

Approach 2: Sliding Window Log

Store timestamp of each request. For new request, count how many timestamps fall within the last window.

Pros: Accurate, no burst problem
Cons: Memory intensive (O(n) per user where n is limit)

Approach 3: Sliding Window Counter

Hybrid approach. Use buckets but calculate weighted average based on time position in current window.

Pros: Good accuracy with low memory
Cons: More complex logic

Approach 4: Token Bucket

Give user tokens at fixed rate. Each request consumes a token. Allows short bursts but enforces average rate.

Pros: Flexible, handles bursts smoothly
Cons: Need background process to refill tokens

What I Used at Stripe

Started with fixed window (simplest). Interviewer asked about burst problem. Discussed sliding window log and token bucket. Implemented token bucket because Stripe needs to handle payment bursts during sales events.

Key question they asked: "How would this scale to millions of users?" Discussed Redis, sharding by user ID, and what happens during cache failures (fail open vs fail closed).

Scroll to Top