API key rotation is a security control that limits the blast radius of a leaked credential. A well-designed rotation service lets clients migrate to new keys without downtime by maintaining a short dual-active window. Below is a concrete low-level design covering schema, key generation, rotation workflow, expiry, audit logging, and emergency revocation.
API Key Schema
Each key row stores the hash, never the plaintext. A short prefix is stored in clear text so support staff and dashboards can identify a key without exposing it.
CREATE TABLE api_keys (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
key_hash CHAR(64) NOT NULL UNIQUE, -- SHA-256 hex of the raw key
prefix VARCHAR(8) NOT NULL, -- e.g. "sk_live_ab"
owner_id UUID NOT NULL REFERENCES users(id),
scope ENUM('read','write','admin') NOT NULL DEFAULT 'read',
status ENUM('active','grace','expired','revoked') NOT NULL DEFAULT 'active',
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
expires_at TIMESTAMPTZ,
last_used_at TIMESTAMPTZ
);
Scopes follow least-privilege: read allows GET requests only, write allows mutations, admin allows key management operations. The scope column is checked on every authenticated request, not just at issuance.
Key Generation
- Generate 32 cryptographically random bytes using
CSPRNG(e.g.crypto/randin Go,secrets.token_bytes(32)in Python). - Base64url-encode the bytes to produce a 43-character URL-safe string. Prepend a typed prefix such as
sk_live_for production keys. - Hash the full string with SHA-256 and store only the hash. Return the plaintext to the caller exactly once — it is never stored and cannot be recovered.
- Store the first 8 characters as
prefixfor display in dashboards and logs.
On every inbound API request, hash the provided key with SHA-256 and look it up in the api_keys table. Reject keys with status IN ('expired', 'revoked') immediately. Update last_used_at asynchronously (fire-and-forget write or a buffered batch) to avoid adding latency to the hot path.
Rotation Workflow
Rotation follows a three-phase process designed to give clients time to update their configurations without any interruption:
- Issue new key. Create a new
api_keysrow withstatus = 'active'. Return the plaintext to the owner. The old key remains active — both keys are valid simultaneously. - Dual-active grace period. The old key transitions to
status = 'grace'. It is still accepted for authentication but flagged in logs. The default grace period is 7 days, configurable per owner. - Expiry. After the grace period, a scheduled cleanup job sets old keys to
status = 'expired'. Requests using expired keys receive HTTP 401 with a descriptive error pointing to the key prefix so the client knows which key to replace.
Automatic Expiry and Scheduled Cleanup
A background job runs every hour and executes:
UPDATE api_keys SET status = 'expired' WHERE status = 'grace' AND expires_at < now();
Keys can also carry an absolute expires_at regardless of rotation — useful for short-lived service tokens. The same cleanup job handles both cases in one pass.
Audit Log
Every lifecycle event writes a row to api_key_audit_log: key creation, rotation initiated, grace period started, expiry, and revocation. Each row records the key prefix (not the hash), the actor user ID, the IP address, and a timestamp. The audit log is append-only — rows are never updated or deleted — and is the source of truth for compliance reporting.
Emergency Revocation
A POST to /admin/api-keys/{id}/revoke immediately sets status = 'revoked' and writes an audit entry with the reason. Because authentication checks the database on every request, revocation takes effect within milliseconds — no cache TTL to wait out. If a distributed cache sits in front of the DB lookup, the revocation endpoint must also delete the cache entry by key hash. Revoked keys are retained in the table permanently for audit purposes.
Interview Talking Points
- Why store a hash and not the key itself? Breach of the keys table does not expose usable credentials.
- Why a typed prefix (sk_live_, sk_test_)? Enables secret scanning tools (e.g. GitHub secret scanning) to detect accidental commits.
- How do you handle a client that never rotates? Set a hard maximum lifetime policy (e.g. 1 year) enforced at issuance via
expires_at = now() + interval '365 days'. - How do you make revocation instant with a cache? Store a revocation bloom filter in Redis, updated on every revoke call, checked before the DB hit.
Frequently Asked Questions
What is an API key rotation service and why is it needed?
An API key rotation service automates the lifecycle of API credentials: generating new keys, transitioning clients from old to new keys without service interruption, and revoking expired or compromised keys. It is needed because static, long-lived API keys are a significant security risk — they accumulate in source code, CI/CD logs, and developer machines, and are routinely found in public repositories. Rotation limits the blast radius of a leaked key by ensuring it has a finite valid lifetime. Automated rotation removes the human coordination burden of manual key changes, enforces rotation schedules (e.g., every 90 days), and provides a consistent revocation path during security incidents. The service integrates with secrets managers (AWS Secrets Manager, HashiCorp Vault) to inject fresh credentials into consuming services transparently.
How does the grace period mechanism work during key rotation?
When a new key is generated, the old key enters a grace period during which both the old and new keys are simultaneously valid. This prevents immediate service disruption for clients that have not yet fetched the new credential. The rotation flow: (1) Generate and store the new key; (2) Mark the old key as PENDING_EXPIRY with a grace_expires_at timestamp (e.g., 24 hours); (3) Notify or push the new key to all registered consumers (via webhooks, Secrets Manager version rotation, or pub/sub); (4) Both keys are accepted by the authentication middleware during the grace window; (5) After grace_expires_at, the old key transitions to REVOKED and is rejected. The grace window length should be tuned to the slowest expected client refresh cycle — long enough that all clients can rotate, short enough to limit exposure of the old key.
How do you store API keys securely without storing plaintext?
Never store the raw API key value. The canonical approach mirrors password storage: (1) Generate a cryptographically random key (e.g., 32 bytes from /dev/urandom, Base64url-encoded for a 43-character string prefixed with an identifier like sk_live_...); (2) Hash the key with a fast but keyed hash — SHA-256 HMAC with a server-side secret, or Argon2id for stronger brute-force resistance — and store only the hash in the database; (3) Show once — return the plaintext key to the user only at creation time and never again (display a “copy now” prompt); (4) Prefix storage — store the first 8 characters of the key in plaintext alongside the hash so users can identify which key is which in the dashboard without exposing the full secret. On each inbound request, hash the presented key and compare to the stored hash in constant time (to prevent timing attacks) using hmac.compare_digest or equivalent.
How do you handle emergency key revocation?
Emergency revocation must be near-instantaneous because a leaked key poses immediate risk. The revocation flow: (1) Mark the key as REVOKED in the primary database with a revocation timestamp and reason; (2) Publish a key.revoked event to a message bus (Kafka, Redis Pub/Sub) consumed by all authentication service instances; (3) Each auth instance updates its in-process revocation cache (a hash set of revoked key hashes with a short TTL, e.g., 5 minutes for eviction after the key would no longer appear in traffic) to reject the key without a database round-trip on every request; (4) Simultaneously notify the key owner via email/Slack with rotation instructions; (5) Optionally trigger an automated replacement key generation and push. The critical design constraint is that cache TTL determines maximum revocation lag — for true emergency response, auth services should subscribe to the revocation event stream and update their cache in real time (push-based) rather than relying on TTL expiry (pull-based).
See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems
See also: Atlassian Interview Guide