API Throttling System Low-Level Design: Rate Limits, Quota Management, and Adaptive Throttling

API throttling enforces usage limits to protect infrastructure from abuse, ensure fair resource allocation across tenants, and give customers predictable quota guarantees. A robust throttling system operates at multiple granularities: per-endpoint burst limits, per-user rate limits, and per-plan monthly quotas.

Throttling Tiers

Three tiers compose the throttling hierarchy:

Per-endpoint — global rate limit on a specific endpoint regardless of caller. Protects expensive operations (e.g., report generation) from being called at arbitrary rates by any user.
Per-user — each user has a rate limit (requests per minute) derived from their plan. Enforced independently per user.
Per-plan (monthly quota) — each plan has a monthly quota. Once exhausted, all requests are rejected with 429 until the quota resets on the billing anniversary date.

All three limits are checked on each request. The first limit breached determines the response.

Sliding Window Counter with Redis

Fixed-window counters have a boundary problem: a user can send double their rate limit by bursting at the end of one window and the start of the next. Sliding window counters eliminate this using a Redis sorted set:

-- Conceptual Redis operations for sliding window (1-minute window)
ZADD user:{user_id}:endpoint:{endpoint} {now_ms} {unique_request_id}
ZREMRANGEBYSCORE user:{user_id}:endpoint:{endpoint} 0 {now_ms - 60000}
count = ZCARD user:{user_id}:endpoint:{endpoint}
EXPIRE user:{user_id}:endpoint:{endpoint} 60

Each request adds a timestamped member to the sorted set, removes members older than the window, and counts remaining members. The count is the number of requests in the sliding window. This runs as a Lua script for atomicity.

Monthly Quota Management

Monthly quotas are too high-volume to track purely in Redis without periodic persistence. The pattern:

A Redis counter tracks usage within the current hour: INCR quota:{user_id}:{YYYY-MM-DD-HH}
A background job runs hourly, reads all hourly buckets, and flushes the sum to the UserQuota table in PostgreSQL.
On each request, read the DB-persisted count plus the current Redis hourly bucket for the total used.
If used_count >= monthly_quota, return 429 with a Retry-After pointing to the reset_at timestamp.

Adaptive Throttling Under Load

When server CPU exceeds a threshold (e.g., 80%), globally reduce rate limits by a scaling factor. This prevents a legitimate traffic surge from degrading the service for everyone while still serving as much as the infrastructure can handle:

effective_limit = plan_limit * adaptive_scale_factor()

The scale factor is read from a shared Redis key updated by a background health monitor. When CPU returns below threshold, the factor recovers linearly over 60 seconds to avoid oscillation.

429 Response Format

A 429 response must include enough information for clients to implement correct retry logic:

HTTP/1.1 429 Too Many Requests
Retry-After: 42
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1713340800
Content-Type: application/json

{
  "error": "rate_limit_exceeded",
  "limit_type": "per_minute",
  "retry_after_seconds": 42
}

Retry-After is the integer number of seconds until the window resets. X-RateLimit-Reset is the Unix timestamp of the reset. Clients that respect these headers implement exponential backoff starting at Retry-After seconds.

Quota Reset on Billing Anniversary

Monthly quotas reset on the billing anniversary date, not on the calendar month. A user who subscribed on the 17th gets their quota reset on the 17th of each month. The reset_at column in UserQuota stores the exact UTC timestamp of the next reset. A scheduled job runs daily and updates reset_at for users whose anniversary falls on that day.

SQL Schema

CREATE TABLE plan (
    id                  SERIAL PRIMARY KEY,
    name                VARCHAR(50) NOT NULL UNIQUE,
    requests_per_minute INT NOT NULL,
    monthly_quota       INT NOT NULL,
    created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE user_quota (
    user_id     INT NOT NULL,
    plan_id     INT NOT NULL REFERENCES plan(id),
    month       DATE NOT NULL,  -- first day of quota period (anniversary-based)
    used_count  BIGINT NOT NULL DEFAULT 0,
    reset_at    TIMESTAMPTZ NOT NULL,
    PRIMARY KEY (user_id, month)
);
CREATE INDEX ON user_quota (reset_at) WHERE used_count > 0;

CREATE TABLE throttle_violation (
    id           BIGSERIAL PRIMARY KEY,
    user_id      INT NOT NULL,
    endpoint     VARCHAR(200) NOT NULL,
    violated_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    limit_type   VARCHAR(30) NOT NULL,
    -- per_minute | per_endpoint | monthly_quota
    limit_value  INT,
    actual_count INT
);
CREATE INDEX ON throttle_violation (user_id, violated_at DESC);

Python Implementation

import time, redis, psycopg2

r = redis.Redis()

LUA_SLIDING_WINDOW = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local req_id = ARGV[3]
redis.call("ZADD", key, now, req_id)
redis.call("ZREMRANGEBYSCORE", key, 0, now - window * 1000)
local count = redis.call("ZCARD", key)
redis.call("EXPIRE", key, window + 1)
return count
"""
sliding_window_script = r.register_script(LUA_SLIDING_WINDOW)

def sliding_window_count(user_id: int, endpoint: str,
                          window_seconds: int = 60) -> int:
    key = f"user:{user_id}:ep:{endpoint}"
    now_ms = int(time.time() * 1000)
    req_id = f"{now_ms}-{id(object())}"
    return sliding_window_script(keys=[key],
                                  args=[now_ms, window_seconds, req_id])

def adaptive_scale_factor() -> float:
    """Return a factor in (0,1] based on current server CPU load."""
    cpu_pct = float(r.get("server:cpu_pct") or 0)
    if cpu_pct  factor 1.0; 100% CPU -> factor 0.5
    return max(0.5, 1.0 - (cpu_pct - 80) / 40)

def check_throttle(user_id: int, endpoint: str, plan: dict) -> dict:
    """
    Returns {"allowed": True} or {"allowed": False, "retry_after": N, "limit_type": ...}
    """
    # 1. Monthly quota check
    used = get_monthly_used(user_id)  # DB + Redis hourly bucket
    if used >= plan["monthly_quota"]:
        reset_at = get_quota_reset_at(user_id)
        retry_after = max(0, int(reset_at - time.time()))
        return {"allowed": False, "retry_after": retry_after,
                "limit_type": "monthly_quota"}

    # 2. Per-minute sliding window check
    effective_limit = int(plan["requests_per_minute"] * adaptive_scale_factor())
    count = sliding_window_count(user_id, endpoint, window_seconds=60)
    if count > effective_limit:
        # time until oldest entry in window expires
        retry_after = 60 - int((int(time.time() * 1000) -
                                  get_oldest_entry_ms(user_id, endpoint)) / 1000)
        record_violation(user_id, endpoint, "per_minute",
                         effective_limit, count)
        return {"allowed": False, "retry_after": max(1, retry_after),
                "limit_type": "per_minute"}

    # 3. Increment monthly counter (Redis hourly bucket)
    r.incr(f"quota:{user_id}:{time.strftime('%Y-%m-%d-%H')}")
    return {"allowed": True, "remaining": effective_limit - count}

Sliding Window vs Fixed Window

Fixed windows reset at fixed clock boundaries (e.g., every minute on the minute). A user can exploit this by sending their full quota in the last second of one window and the first second of the next, doubling their effective burst rate. Sliding windows track the true request count in the past N seconds regardless of clock boundaries, preventing this exploit at the cost of higher Redis memory usage (one sorted set entry per request vs one counter per window).

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why use a sliding window instead of a fixed window for rate limiting?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Fixed windows allow clients to double their burst rate at window boundaries by sending requests at the end of one window and the start of the next. Sliding windows count requests in the true past N seconds, eliminating this boundary exploit. The tradeoff is higher Redis memory: a sorted set entry per request vs a single counter per window. For typical rate limits (100-1000 req/min), this is acceptable.”
}
},
{
“@type”: “Question”,
“name”: “How does monthly quota reset work for billing anniversary dates?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The reset_at timestamp in the UserQuota table stores the exact UTC time of the next quota reset, calculated from the user's billing signup date. A daily job advances reset_at by one month for users whose anniversary falls that day and resets used_count to 0. This means a user who signed up on the 31st gets reset on the last day of months with fewer days.”
}
},
{
“@type”: “Question”,
“name”: “How does adaptive throttling work under high server load?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A background health monitor samples server CPU every 10 seconds and writes the value to a shared Redis key. The throttle middleware reads this key on each request and computes a scale factor: at 80% CPU the factor is 1.0 (full limits), at 100% CPU the factor is 0.5 (half limits). Effective limits recover linearly over 60 seconds after CPU drops below threshold to prevent oscillation.”
}
},
{
“@type”: “Question”,
“name”: “How is the Retry-After header value calculated?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “For per-minute sliding windows, Retry-After is the time until the oldest request in the current window ages out of the 60-second window, freeing up one slot. This is computed as 60 minus the age (in seconds) of the oldest sorted set member. For monthly quotas, Retry-After is the number of seconds until reset_at. Always return at least 1 to avoid client tight-loops.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why use a sliding window counter instead of a fixed window for rate limiting?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Fixed windows allow bursts at window boundaries (double the limit in two adjacent seconds); sliding windows spread the count over a rolling interval, preventing boundary bursts.”
}
},
{
“@type”: “Question”,
“name”: “How is the monthly quota counter kept consistent between Redis and the database?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Redis holds the hot increment counter; a background job periodically flushes the Redis count to the UserQuota table and resets the Redis counter, with the DB value as the authoritative record.”
}
},
{
“@type”: “Question”,
“name”: “How is the Retry-After header calculated?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “For per-minute limits, it is the number of seconds until the oldest request in the sliding window falls outside the window; for monthly quotas, it is seconds until the billing anniversary date.”
}
},
{
“@type”: “Question”,
“name”: “How does adaptive throttling respond to server load?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A global scaling factor (0.0-1.0) is computed from CPU utilization; all rate limits are multiplied by this factor, uniformly reducing allowed throughput across all users during high load.”
}
}
]
}