Token Bucket Rate Limiter Low-Level Design: Redis Implementation, Distributed Coordination, and Burst Handling

Overview

A token bucket rate limiter controls how many requests a client can make within a time period while allowing short bursts above the average rate. Tokens accumulate in a conceptual bucket at a fixed refill rate R tokens per second up to a burst capacity B. Each request consumes one token; if the bucket is empty the request is rejected or queued. This LLD covers the algorithm mechanics, the database model for rate limit rules, a Redis Lua script for atomic distributed enforcement, multi-tier limits combining per-user, per-endpoint, and global constraints, a comparison with the sliding window variant, and the key design decisions that make this production-ready.

Token Bucket Algorithm

The algorithm state per client is two values: tokens (current token count) and last_refill_at (timestamp of last check). On each request:

Compute elapsed time since last_refill_at.
Add elapsed_seconds * rate_per_second tokens, capped at burst_capacity.
If tokens >= 1: consume 1 token, allow request.
If tokens < 1: reject with HTTP 429. Return Retry-After header = (1 - tokens) / rate_per_second seconds.

The key insight is that the burst capacity B lets a client that was idle accumulate tokens and then fire a burst of up to B requests before being throttled down to the sustainable rate R/sec.

Core Data Model


-- Rate limit rule definitions (stored in DB, cached in app memory)
CREATE TABLE RateLimitRule (
    rule_id          BIGSERIAL PRIMARY KEY,
    key_pattern      VARCHAR(128)  NOT NULL,  -- e.g. 'user:{user_id}', 'endpoint:/api/search'
    scope            VARCHAR(32)   NOT NULL,  -- 'user', 'endpoint', 'global', 'ip'
    rate_per_second  FLOAT         NOT NULL,  -- tokens added per second
    burst_capacity   INT           NOT NULL,  -- max tokens in bucket
    plan             VARCHAR(32),             -- NULL=applies to all, or 'free','pro','enterprise'
    is_active        BOOLEAN       NOT NULL DEFAULT TRUE,
    description      TEXT,
    created_at       TIMESTAMPTZ   NOT NULL DEFAULT NOW(),
    UNIQUE (key_pattern, plan)
);

-- Example rules:
-- key_pattern='user:{user_id}', scope='user', rate=10, burst=50, plan='free'
-- key_pattern='user:{user_id}', scope='user', rate=100, burst=500, plan='pro'
-- key_pattern='endpoint:/api/search', scope='endpoint', rate=1000, burst=2000, plan=NULL
-- key_pattern='global', scope='global', rate=50000, burst=100000, plan=NULL

-- Audit log for rate limit violations (optional, for monitoring)
CREATE TABLE RateLimitViolation (
    violation_id   BIGSERIAL PRIMARY KEY,
    bucket_key     VARCHAR(256)  NOT NULL,
    user_id        BIGINT,
    ip_address     INET,
    endpoint       VARCHAR(255),
    requested_at   TIMESTAMPTZ   NOT NULL DEFAULT NOW(),
    tokens_at_time FLOAT         NOT NULL,
    rule_id        BIGINT        REFERENCES RateLimitRule(rule_id)
);

CREATE INDEX idx_violation_user ON RateLimitViolation(user_id, requested_at DESC);
CREATE INDEX idx_violation_ip   ON RateLimitViolation(ip_address, requested_at DESC);

Redis Lua Script for Atomic Token Consumption

Atomicity is critical: without it, two concurrent requests could both read tokens=1, both decide to allow, and both consume a token — resulting in allowing two requests when only one token remained. A Lua script executes atomically on Redis (single-threaded command execution).


-- KEYS[1]: bucket key, e.g. "rl:user:42:global"
-- ARGV[1]: rate_per_second (float)
-- ARGV[2]: burst_capacity (int)
-- ARGV[3]: current unix timestamp as float (seconds)
-- ARGV[4]: tokens to consume (usually 1)
-- Returns: {allowed (0/1), remaining_tokens (float), retry_after_ms (int)}

local key          = KEYS[1]
local rate         = tonumber(ARGV[1])
local burst        = tonumber(ARGV[2])
local now          = tonumber(ARGV[3])
local consume      = tonumber(ARGV[4])

local data = redis.call('HMGET', key, 'tokens', 'last_refill_at')
local tokens        = tonumber(data[1])
local last_refill   = tonumber(data[2])

if tokens == nil then
    -- First request: initialize bucket full
    tokens      = burst
    last_refill = now
end

-- Refill tokens based on elapsed time
local elapsed = math.max(0, now - last_refill)
tokens = math.min(burst, tokens + elapsed * rate)

local allowed = 0
local retry_after_ms = 0

if tokens >= consume then
    tokens  = tokens - consume
    allowed = 1
else
    -- Compute how long until 1 token is available
    local deficit = consume - tokens
    retry_after_ms = math.ceil((deficit / rate) * 1000)
end

-- TTL: bucket expires after burst/rate seconds of inactivity (2x for safety)
local ttl = math.ceil(burst / rate * 2)

redis.call('HMSET', key,
    'tokens',        tokens,
    'last_refill_at', now)
redis.call('EXPIRE', key, ttl)

return {allowed, tokens * 1000, retry_after_ms}
-- Note: tokens returned as integer (x1000) to avoid Lua float precision issues

Python Integration: Multi-Tier Rate Limiting

Production systems apply multiple overlapping limits simultaneously. A single request is checked against per-user, per-endpoint, and global buckets — all must pass.


import time
import redis
import hashlib
from functools import lru_cache

redis_client = redis.Redis(host='redis', port=6379, decode_responses=True)

# Load Lua script once and store SHA
LUA_SCRIPT = open('/app/rate_limit.lua').read()
SCRIPT_SHA  = redis_client.script_load(LUA_SCRIPT)


@lru_cache(maxsize=512)
def get_rules_for_plan(plan: str) -> list:
    """Cache rule lookup from DB by plan — refresh every 60s via TTL."""
    return db.query("SELECT * FROM RateLimitRule WHERE plan=%s OR plan IS NULL AND is_active", plan)


def check_rate_limit(user_id: int, endpoint: str, user_plan: str, ip: str) -> dict:
    """
    Enforce multi-tier token bucket limits.
    Returns {'allowed': bool, 'retry_after_ms': int, 'violated_tier': str|None}
    """
    now = time.time()
    rules = get_rules_for_plan(user_plan)

    # Build list of (bucket_key, rule) pairs to check
    checks = []
    for rule in rules:
        if rule['scope'] == 'user':
            key = f"rl:user:{user_id}:{rule['rule_id']}"
        elif rule['scope'] == 'endpoint':
            key = f"rl:ep:{hashlib.md5(endpoint.encode()).hexdigest()}:{rule['rule_id']}"
        elif rule['scope'] == 'global':
            key = f"rl:global:{rule['rule_id']}"
        elif rule['scope'] == 'ip':
            key = f"rl:ip:{ip}:{rule['rule_id']}"
        else:
            continue
        checks.append((key, rule))

    # Evaluate all tiers — pipeline EVALSHA calls for efficiency
    pipe = redis_client.pipeline(transaction=False)
    for bucket_key, rule in checks:
        pipe.evalsha(
            SCRIPT_SHA,
            1,                          # number of keys
            bucket_key,                 # KEYS[1]
            rule['rate_per_second'],    # ARGV[1]
            rule['burst_capacity'],     # ARGV[2]
            now,                        # ARGV[3]
            1,                          # ARGV[4] tokens to consume
        )
    results = pipe.execute()

    for (bucket_key, rule), result in zip(checks, results):
        allowed, remaining_x1000, retry_after_ms = result
        if not allowed:
            log_violation(bucket_key, user_id, ip, endpoint, remaining_x1000 / 1000, rule['rule_id'])
            return {
                'allowed': False,
                'retry_after_ms': retry_after_ms,
                'violated_tier': rule['scope'],
                'violated_rule': rule['key_pattern'],
            }

    return {'allowed': True, 'retry_after_ms': 0, 'violated_tier': None}


def rate_limit_middleware(request, next_handler):
    result = check_rate_limit(
        user_id=request.user.id,
        endpoint=request.path,
        user_plan=request.user.plan,
        ip=request.remote_addr,
    )
    if not result['allowed']:
        return http_429(
            retry_after_ms=result['retry_after_ms'],
            violated=result['violated_tier'],
        )
    return next_handler(request)

Sliding Window Variant Comparison

The sliding window log algorithm stores the exact timestamp of each request in a sorted set and counts requests in the last N seconds. It is more precise than the token bucket but uses more memory.


# Sliding window log using Redis sorted set
def sliding_window_check(redis_client, key, limit, window_seconds):
    now = time.time()
    window_start = now - window_seconds

    pipe = redis_client.pipeline()
    pipe.zremrangebyscore(key, '-inf', window_start)   # remove old entries
    pipe.zadd(key, {str(now): now})                    # add current request
    pipe.zcard(key)                                    # count in window
    pipe.expire(key, window_seconds + 1)
    _, _, count, _ = pipe.execute()

    return count <= limit

# Comparison:
# Token Bucket:     O(1) memory per bucket. Allows bursts up to burst_capacity.
#                   Small floating-point state (tokens, timestamp).
# Sliding Window:   O(limit) memory per key (one entry per request).
#                   Precise count — no burst beyond limit in any window.
#                   Better for strict "no more than N per hour" compliance.
# Fixed Window:     O(1) memory. Resets at boundary — double-rate attack possible.
#
# Choose token bucket when bursts are legitimate (API clients doing batch ops).
# Choose sliding window when strict compliance is required (billing, abuse prevention).

Key Design Decisions

Lua atomicity prevents race conditions: The entire read-compute-write cycle for token consumption runs as a single atomic Redis command via EVALSHA. Two concurrent requests for the same bucket can never both see the same token count because Redis processes Lua scripts single-threaded. This eliminates the need for distributed locks and keeps latency under 1ms per check.
Burst capacity handles legitimate traffic spikes: A burst_capacity of 50 for a rate of 10/sec means an idle client can send 50 requests immediately without being throttled, while sustained throughput is capped at 10/sec. This correctly handles clients that batch requests after a quiet period — a common pattern for webhook consumers, mobile sync clients, and cron-based API callers.
Per-tier rate limits by plan: Maintaining separate RateLimitRule rows per plan (free/pro/enterprise) with different rate and burst values means a free-plan user hitting the per-user limit does not affect pro users’ buckets. Checking all tiers in a single Redis pipeline keeps the overhead to one round-trip regardless of how many tiers apply.
TTL-based bucket expiry prevents memory leaks: Each bucket key is given a TTL of 2 * burst_capacity / rate seconds. A bucket for rate=10, burst=50 expires after 10 seconds of inactivity, automatically cleaning up state for churned users or inactive endpoints without any explicit eviction job.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why is Redis used for token bucket rate limiting?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Redis provides atomic Lua script execution, sub-millisecond latency, and TTL-based key expiry—critical for high-throughput rate limit checks.”
}
},
{
“@type”: “Question”,
“name”: “How does the token bucket algorithm handle burst traffic?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The bucket accumulates tokens up to a max_burst capacity; short spikes can consume the burst reserve without being rejected.”
}
},
{
“@type”: “Question”,
“name”: “How are per-user and per-endpoint limits composed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A request passes multiple independent bucket checks (user, endpoint, global) in a pipeline; the most restrictive failing check determines the 429 response.”
}
},
{
“@type”: “Question”,
“name”: “What does retry_after_ms communicate to clients?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “It tells the client exactly how many milliseconds until enough tokens refill for the request to succeed, enabling smart backoff.”
}
}
]
}