Low Level Design: Rate Limiter Service

What Is a Rate Limiter Service?

A rate limiter controls how often a client can call an API or consume a resource within a time window. Rather than embedding rate limiting logic inside each microservice, a standalone rate limiter service centralizes enforcement: any service calls the rate limiter before processing a request, receives an allow/deny decision, and proceeds accordingly. This architecture keeps individual services stateless with respect to quota tracking and enables global limits across horizontally scaled fleets.

Core Data Model

The rate limiter needs to persist quota configuration and runtime counters. Configuration lives in a relational database; counters live in Redis for sub-millisecond access.

Configuration Schema

CREATE TABLE rate_limit_rules (
    rule_id       BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
    client_key    VARCHAR(256)    NOT NULL,  -- user_id, api_key, or IP
    endpoint      VARCHAR(256)    NOT NULL DEFAULT '*',
    algorithm     ENUM('token_bucket','sliding_window_log','sliding_window_counter','fixed_window') NOT NULL,
    max_requests  INT UNSIGNED    NOT NULL,
    window_secs   INT UNSIGNED    NOT NULL,
    burst_size    INT UNSIGNED    NOT NULL DEFAULT 0,
    enabled       TINYINT(1)      NOT NULL DEFAULT 1,
    created_at    DATETIME        NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at    DATETIME        NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    PRIMARY KEY (rule_id),
    UNIQUE INDEX idx_client_endpoint (client_key, endpoint)
);

CREATE TABLE rate_limit_overrides (
    override_id   BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
    client_key    VARCHAR(256)    NOT NULL,
    rule_id       BIGINT UNSIGNED NOT NULL,
    max_requests  INT UNSIGNED    NOT NULL,
    valid_until   DATETIME        NOT NULL,
    PRIMARY KEY (override_id),
    INDEX idx_client (client_key),
    FOREIGN KEY (rule_id) REFERENCES rate_limit_rules(rule_id) ON DELETE CASCADE
);

CREATE TABLE rate_limit_violations (
    violation_id  BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
    client_key    VARCHAR(256)    NOT NULL,
    endpoint      VARCHAR(256)    NOT NULL,
    violated_at   DATETIME        NOT NULL DEFAULT CURRENT_TIMESTAMP,
    request_count INT UNSIGNED    NOT NULL,
    limit_value   INT UNSIGNED    NOT NULL,
    PRIMARY KEY (violation_id),
    INDEX idx_client_time (client_key, violated_at)
);

Algorithm 1: Token Bucket

Each client has a bucket that holds up to burst_size tokens. Tokens refill at a rate of max_requests / window_secs tokens per second. Each request consumes one token. If the bucket is empty, the request is denied.

function token_bucket_allow(client_key, rule):
    key        = "tb:" + client_key
    now        = current_time_seconds()
    refill_rate = rule.max_requests / rule.window_secs
    burst       = rule.burst_size or rule.max_requests

    -- atomic Redis Lua script:
    tokens, last_refill = redis.get(key)  -- stored as "tokens:timestamp"
    if tokens is None:
        tokens     = burst
        last_refill = now

    elapsed = now - last_refill
    tokens  = min(burst, tokens + elapsed * refill_rate)

    if tokens >= 1:
        tokens -= 1
        redis.set(key, tokens:now, EX=window_secs * 2)
        return ALLOW
    else:
        return DENY (retry_after = (1 - tokens) / refill_rate seconds)

Properties: Smooths bursty traffic up to burst_size. Tokens accumulate during idle periods, so a client that pauses can send a burst. Refill is continuous, not discrete, so there is no cliff at window boundaries.

Algorithm 2: Sliding Window Log

Store a timestamped log of every request in a Redis sorted set, scored by timestamp. To check a new request, count entries within [now – window_secs, now]. If count < limit, add the new entry and allow; otherwise deny.

function sliding_window_log_allow(client_key, rule):
    key   = "swl:" + client_key
    now   = current_time_ms()
    start = now - rule.window_secs * 1000

    -- atomic Lua script:
    redis.zremrangebyscore(key, 0, start)          -- evict old entries
    count = redis.zcard(key)
    if count < rule.max_requests:
        redis.zadd(key, now, now)                  -- score = member = timestamp
        redis.expire(key, rule.window_secs + 1)
        return ALLOW
    else:
        oldest = redis.zrange(key, 0, 0, withscores=True)[0].score
        return DENY (retry_after = (oldest + rule.window_secs * 1000 - now) / 1000)

Properties: Perfectly accurate sliding window. Memory cost is O(requests per window) per client — expensive for high-volume APIs. Not suitable when max_requests is very large.

Algorithm 3: Sliding Window Counter

A memory-efficient approximation. Maintain integer counters for the current and previous fixed windows. Estimate the sliding-window count as:

count = prev_count * ((window_secs - elapsed_in_current_window) / window_secs)
      + curr_count

function sliding_window_counter_allow(client_key, rule):
    now          = current_time_seconds()
    window       = rule.window_secs
    curr_slot    = floor(now / window)
    prev_slot    = curr_slot - 1
    elapsed      = now mod window

    curr_key = "swc:" + client_key + ":" + str(curr_slot)
    prev_key = "swc:" + client_key + ":" + str(prev_slot)

    curr_count = int(redis.get(curr_key) or 0)
    prev_count = int(redis.get(prev_key) or 0)

    weight      = (window - elapsed) / window
    est_count   = prev_count * weight + curr_count

    if est_count < rule.max_requests:
        redis.incr(curr_key)
        redis.expire(curr_key, window * 2)
        return ALLOW
    else:
        return DENY

Properties: O(1) memory per client (two integer counters). Approximation error up to ~0.003% at boundary. Suitable for high-throughput APIs where slight approximation is acceptable.

Algorithm 4: Fixed Window Counter

Simplest algorithm. One counter per client per window slot. Resets at the start of each window. Vulnerable to the "boundary burst" problem: a client can send limit requests at the end of window N and limit requests at the start of window N+1, doubling the effective rate at the boundary.

function fixed_window_allow(client_key, rule):
    slot = floor(current_time_seconds() / rule.window_secs)
    key  = "fw:" + client_key + ":" + str(slot)
    count = redis.incr(key)
    if count == 1:
        redis.expire(key, rule.window_secs + 1)
    return ALLOW if count <= rule.max_requests else DENY

Service Architecture

The standalone rate limiter service exposes a gRPC or HTTP API with a single primary endpoint:

POST /check
{
  client_key: "user:42",
  endpoint:   "/api/v1/search",
  cost:        1
}

Response:
{
  allowed:     true,
  remaining:   47,
  reset_at:    1714000060,
  retry_after: 0
}

Internal flow: (1) Look up rule for (client_key, endpoint) from local in-process cache (TTL 30s, backed by the relational DB). (2) Check for active overrides. (3) Run the appropriate algorithm against Redis. (4) Return decision. (5) Asynchronously log violations to the violations table.

Distributed Rate Limiting with Redis

A single Redis instance is a SPOF. For high availability:

Redis Cluster — shard counter keys across multiple Redis primaries. Each client_key hashes to one shard. Use Redis hash tags ({client_key}) to ensure all keys for a client land on the same shard, enabling Lua script atomicity.
Redis Sentinel — single primary with automatic failover to a replica. Simpler, lower throughput ceiling.
Gossip-based approximate counters — each rate limiter node maintains a local counter and periodically gossips its count to peers. Trades strict accuracy for availability during network partitions. Suitable when a small overage is tolerable (e.g., 5% above limit during partition).

All Redis counter updates must be atomic. Use Lua scripts to combine read-compute-write into a single atomic operation; Redis executes Lua scripts single-threaded.

Key Design Decisions and Trade-offs

Algorithm choice — token bucket for burst-tolerant APIs; sliding window counter for strict per-second limits at scale; sliding window log only when perfect accuracy is required and request volume is low.
Local vs centralized — embedding rate limiting locally avoids a network round-trip but makes global enforcement across instances impossible. A centralized service adds ~1ms latency but enforces a true global limit.
Synchronous vs asynchronous — calling the rate limiter synchronously on the critical path adds latency. For very latency-sensitive paths, pre-fetch token grants in batches (acquire 100 tokens at once, consume locally, re-acquire when exhausted). This risks over-issuance if a node crashes with unspent tokens.
Hard vs soft limits — hard limits block immediately at threshold; soft limits allow a short burst above threshold with increasing denial probability (probabilistic throttling), smoothing client experience.

Failure Handling and Edge Cases

Redis unavailable — fail open (allow all requests) or fail closed (deny all). Fail open is usually correct for revenue-critical APIs; fail closed for security-sensitive endpoints. Implement a circuit breaker around Redis calls.
Clock skew — distributed nodes may have slightly different clocks. Use Redis server time (via TIME command inside Lua scripts) as the authoritative clock to avoid skew between rate limiter nodes.
Thundering herd at window reset — fixed windows reset all counters simultaneously, causing a traffic spike. Sliding window algorithms naturally smooth this.
Key expiration race — if a Redis key expires between the read and the write in a non-Lua path, the counter resets unexpectedly. Lua scripts eliminate this race by making read-modify-write atomic.
Retry storms — denied clients retrying immediately amplify load. The retry_after response field instructs clients to back off. Enforce exponential backoff guidance in client SDKs.

Scalability Considerations

Rule caching — load all rules into memory at startup; refresh on a short TTL or via pub/sub invalidation. Avoid a DB lookup on every request.
Horizontal scaling — the rate limiter service itself is stateless (state is in Redis). Add instances behind a load balancer freely.
Redis throughput — a single Redis instance handles ~100k ops/sec. At higher volumes, shard by client_key prefix across Redis Cluster nodes. Each Lua script touches only one shard.
Multi-region — for global APIs, run rate limiter + Redis per region with per-region quotas. Cross-region synchronization (e.g., CRDTs or periodic sync) is complex and only warranted when the same user’s traffic spans regions.
Observability — emit metrics: requests_allowed_total, requests_denied_total, redis_latency_histogram, rule_cache_hit_ratio. Alert on spike in denied requests (client misbehavior) and spike in Redis latency (approaching SPOF failure).

Summary

A standalone rate limiter service decouples quota enforcement from business logic and provides a consistent, globally accurate control plane for request throttling. Token bucket handles bursty workloads gracefully; sliding window counter delivers accuracy at O(1) memory cost; sliding window log gives perfect accuracy when memory is not a constraint. All counter operations must be atomic — Redis Lua scripts are the standard mechanism. The primary operational risks are Redis unavailability (mitigate with fail-open circuit breakers and cluster replication) and clock skew (mitigate by using Redis server time). At scale, rule caching, Redis Cluster sharding, and per-region quota assignment keep latency under 1ms on the critical path.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What are the main rate limiting algorithms: token bucket, sliding window, and fixed window?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Fixed window divides time into discrete buckets (e.g., one per minute) and counts requests in the current bucket. It’s simple but allows burst spikes at window boundaries. Sliding window log tracks a timestamp for every request within the last window and counts them, giving precise enforcement but at high memory cost. Sliding window counter approximates the log by combining the current and previous fixed-window counts weighted by overlap, balancing accuracy and memory. Token bucket maintains a bucket that refills at a constant rate up to a max capacity; each request consumes one token. Bursts are allowed up to bucket capacity, then requests are throttled. It smooths traffic and is easy to implement with Redis. Leaky bucket forces a constant outflow rate regardless of burst input, useful when downstream systems need uniform load.”
}
},
{
“@type”: “Question”,
“name”: “How does a distributed rate limiter share state across multiple instances?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A distributed rate limiter uses a shared backing store — typically Redis — to keep counters or token buckets consistent across all service instances. Each instance sends an atomic increment (INCR + EXPIRE) or a Lua script (for token bucket) to Redis so that every instance sees the same global count for a given key (e.g., user ID or API key). A Lua script is preferred because it bundles the read-modify-write in a single atomic operation, preventing race conditions. For very high throughput, you can shard keys across a Redis cluster. An alternative is a gossip-based or sliding-window approximation where each node tracks its local count and periodically syncs with peers, accepting slight inaccuracy in exchange for lower latency and no single point of failure.”
}
},
{
“@type”: “Question”,
“name”: “How does the token bucket algorithm work and what are its advantages?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The token bucket works by associating each rate-limited entity (user, API key, IP) with a bucket that holds up to a maximum number of tokens. Tokens are added at a fixed refill rate (e.g., 100 tokens/second). Each incoming request consumes one token; if the bucket is empty, the request is rejected or queued. In practice this is implemented without a background refill thread: you store the last refill timestamp alongside the token count, and on each request you compute how many tokens should have accumulated since the last access, cap at the max, then attempt to consume one. Advantages include: natural burst allowance up to bucket capacity, simple atomic implementation in Redis, predictable average throughput, and easy tuning via two parameters (rate and burst size).”
}
},
{
“@type”: “Question”,
“name”: “How should a rate limiter behave when the backing store (Redis) is unavailable?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When Redis is unavailable the rate limiter must choose between failing open (allow all requests) or failing closed (deny all requests). Failing open preserves availability but temporarily removes protection; failing closed protects downstream services but may cause an outage. The right choice depends on the use case: for abuse prevention, failing open is usually acceptable. For billing or security enforcement, failing closed may be required. A common production pattern is to fall back to a local in-process token bucket that enforces a conservative per-instance limit, providing partial protection without a hard dependency on Redis. Circuit breaker logic can detect Redis failure quickly and switch to the fallback, then restore Redis-backed enforcement once the store recovers. All fallback decisions should be logged and alerted on.”
}
}
]
}