Low Level Design: Token Refresh Service

What Is a Token Refresh Service?

A Token Refresh Service manages the lifecycle of short-lived access tokens and long-lived refresh tokens in an OAuth 2.0 / JWT-based authentication system. Access tokens expire quickly (minutes to hours); the refresh service silently issues new access tokens using a valid refresh token, keeping users authenticated without forcing re-login.

Data Model

Two tables track token families and refresh token state:

CREATE TABLE refresh_tokens (
    token_id        CHAR(64)     PRIMARY KEY,
    token_family    CHAR(64)     NOT NULL,          -- groups rotated tokens together
    user_id         BIGINT       NOT NULL,
    client_id       VARCHAR(128) NOT NULL,
    scope           TEXT,
    issued_at       TIMESTAMP    NOT NULL DEFAULT NOW(),
    expires_at      TIMESTAMP    NOT NULL,
    used_at         TIMESTAMP,                       -- NULL = not yet used
    is_revoked      BOOLEAN      NOT NULL DEFAULT FALSE,
    replaced_by     CHAR(64)                         -- FK to next token in chain
);

CREATE INDEX idx_rt_user_id     ON refresh_tokens(user_id);
CREATE INDEX idx_rt_family      ON refresh_tokens(token_family);
CREATE INDEX idx_rt_expires_at  ON refresh_tokens(expires_at);

Core Algorithm and Workflow

Initial Issuance

After successful login, generate a new token_family (random 32 bytes, hex-encoded).
Issue a refresh token (token_id = 32 random bytes) tied to that family. Store in DB.
Issue a signed JWT access token with short expiry (e.g., 15 minutes). Return both to the client.

Token Refresh Flow

Client sends expired access token + refresh token to POST /token/refresh.
Look up refresh token by token_id. Validate: not revoked, not expired, used_at IS NULL.
Mark current token as used (used_at = NOW()).
Generate a new refresh token in the same token_family; set replaced_by on the old row.
Issue a new JWT access token. Return both new tokens to the client.

Refresh Token Rotation and Reuse Detection

If a refresh token arrives that is already used_at IS NOT NULL, this signals a possible replay attack. Immediately revoke all tokens in the same token_family by setting is_revoked = TRUE on every row where token_family = ?. Force the user to re-authenticate.

Security Considerations and Failure Handling

Rotation: Always rotate refresh tokens on use. Single-use tokens reduce the window of abuse if a token is stolen.
Family revocation: Reuse detection via token families is the primary defense against stolen refresh tokens.
Storage on client: Store refresh tokens in HttpOnly cookies, not localStorage. LocalStorage is XSS-accessible.
Scope binding: Refresh tokens should only issue access tokens for the scopes originally granted.
Clock skew: Build a small grace period (30 seconds) into expiry checks to tolerate NTP drift between services.
DB failure: Fail closed. If the token store is unavailable, reject the refresh request and return 503. Do not issue tokens without DB confirmation.

Scalability Considerations

Read-heavy validation: Cache valid refresh token metadata in Redis. On use, invalidate the cache entry and write to Postgres atomically (use a DB transaction or optimistic locking).
Atomic single-use enforcement: Use a DB-level unique constraint or Redis SET NX to guarantee a token can only be consumed once, even under concurrent requests.
Partitioning: Partition refresh_tokens by issued_at month. Old partitions can be archived or dropped without a full-table operation.
Cleanup job: Run a periodic job to delete rows where expires_at < NOW() - 30 days, keeping the table bounded.

Summary

The Token Refresh Service is the backbone of seamless session continuity in token-based auth. The critical design points are single-use refresh tokens with rotation, token family tracking for reuse detection, and atomic single-consumption enforcement at the storage layer. Keep access tokens short-lived and treat any refresh token reuse as a compromise signal requiring full family revocation.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How would you design a token refresh service that handles millions of clients?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A token refresh service needs to be stateless and horizontally scalable, validating incoming refresh tokens against a persistent store (e.g., Redis or a SQL database) that records issued refresh tokens and their revocation status. Each refresh rotates the refresh token to limit replay attack windows, and the new access token is signed with a short TTL. Rate limiting per user and per IP prevents abuse of the refresh endpoint.”
}
},
{
“@type”: “Question”,
“name”: “What is refresh token rotation and why is it important?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Refresh token rotation means issuing a new refresh token every time one is used, invalidating the old one immediately. This limits the window during which a stolen refresh token can be exploited, since any use of an already-rotated token signals a potential compromise and can trigger revocation of the entire token family. It is a core recommendation in the OAuth 2.0 Security Best Current Practice specification.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle refresh token revocation at scale?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Revocation can be implemented by storing a revocation record in a fast lookup store like Redis and checking it on every refresh attempt, keeping the list small by only storing revoked-but-not-yet-expired tokens. Alternatively, token families can be tracked so revoking a family ID cascades to all descendant tokens without storing each individually. A background job prunes expired revocation records to keep the store lean.”
}
},
{
“@type”: “Question”,
“name”: “How would you ensure the token refresh service is resilient to outages?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The service should be deployed across multiple availability zones behind a load balancer, with the backing store replicated for high availability. Access tokens should have a long enough TTL (e.g., 15 minutes) that a brief refresh service outage does not immediately impact logged-in users. Circuit breakers and graceful degradation patterns can prevent cascading failures when the token store becomes slow or unavailable.”
}
}
]
}