Overview
An SSH key rotation service automates generation of new keypairs, distribution to all target hosts, a grace period with both old and new keys active, and revocation of old keys — eliminating long-lived static SSH keys across infrastructure. This LLD covers the data model, rotation flow, host distribution worker, certificate-based SSH, and audit logging.
Core Data Model
SSH Key Table
ssh_keys (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
principal VARCHAR NOT NULL, -- user or service identity
public_key TEXT NOT NULL, -- authorized_keys format
private_key_ref VARCHAR NOT NULL, -- pointer to secrets manager path
algorithm VARCHAR NOT NULL, -- rsa-4096 | ed25519
fingerprint VARCHAR NOT NULL, -- SHA256 fingerprint for dedup
status VARCHAR NOT NULL DEFAULT 'active', -- active/retiring/revoked
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
expires_at TIMESTAMPTZ NOT NULL,
rotated_from UUID REFERENCES ssh_keys(id) -- predecessor key
)
Host Inventory Table
hosts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
hostname VARCHAR NOT NULL UNIQUE,
ssh_host VARCHAR NOT NULL, -- IP or DNS for SSH connection
ssh_port INT NOT NULL DEFAULT 22,
auth_keys_path VARCHAR NOT NULL DEFAULT '/home/%s/.ssh/authorized_keys',
jump_host VARCHAR, -- bastion if needed
last_sync_at TIMESTAMPTZ,
last_sync_status VARCHAR -- ok/error
)
Rotation Job Table
rotation_jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
principal VARCHAR NOT NULL,
old_key_id UUID REFERENCES ssh_keys(id),
new_key_id UUID REFERENCES ssh_keys(id),
status VARCHAR NOT NULL DEFAULT 'pending', -- pending/distributing/grace/revoking/done/failed
grace_hours INT NOT NULL DEFAULT 24,
started_at TIMESTAMPTZ,
grace_until TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
error TEXT
)
Audit Log Table
ssh_audit_log (
id BIGSERIAL PRIMARY KEY,
event_type VARCHAR NOT NULL, -- generated/distributed/grace_start/revoked/failed
principal VARCHAR NOT NULL,
key_id UUID,
host VARCHAR,
actor VARCHAR NOT NULL, -- human user or "scheduler"
timestamp TIMESTAMPTZ NOT NULL DEFAULT now(),
detail JSONB
)
Rotation Flow
Step 1: Generate new keypair
ssh-keygen -t ed25519 -C "svc-deploy@2026-04-17" -f /tmp/newkey -N ""
-- or for RSA:
ssh-keygen -t rsa -b 4096 -C "svc-deploy@2026-04-17" -f /tmp/newkey -N ""
Step 2: Store private key in secrets manager
PUT /secrets/ssh/svc-deploy/private-key <private key PEM>
private_key_ref = "/ssh/svc-deploy/private-key"
Step 3: Insert new ssh_keys record (status=active)
INSERT INTO ssh_keys (principal, public_key, private_key_ref, algorithm, fingerprint, expires_at)
VALUES ('svc-deploy', 'ssh-ed25519 AAAA...', '/ssh/svc-deploy/private-key',
'ed25519', 'SHA256:...', now() + INTERVAL '90 days');
Step 4: Create rotation job
INSERT INTO rotation_jobs (principal, old_key_id, new_key_id, grace_hours)
VALUES ('svc-deploy', <old_id>, <new_id>, 24);
Step 5: Distribute new public key to all hosts (see Distribution Worker)
Step 6: Grace period — both old + new keys active for grace_hours
Step 7: Revoke old key after grace_until
Distribution Worker
-- For each host in inventory, append new pubkey atomically
func distributeKey(host Host, pubkey string, principal string) error {
authKeysPath := fmt.Sprintf(host.AuthKeysPath, principal)
tmpPath := authKeysPath + ".tmp"
// SSH to host and run atomic update
script := fmt.Sprintf(`
cp %s %s
echo "%s" >> %s
chmod 600 %s
mv %s %s
`, authKeysPath, tmpPath,
pubkey, tmpPath,
tmpPath,
tmpPath, authKeysPath)
return sshExec(host.SSHHost, host.SSHPort, script)
}
-- Update sync status
UPDATE hosts SET last_sync_at=now(), last_sync_status='ok'
WHERE id=$1;
Revocation: Remove Old Key
func revokeKey(host Host, oldPubkey string, principal string) error {
authKeysPath := fmt.Sprintf(host.AuthKeysPath, principal)
tmpPath := authKeysPath + ".tmp"
// Remove matching line atomically using grep -v
script := fmt.Sprintf(`
grep -v "%s" %s > %s
chmod 600 %s
mv %s %s
`, oldPubkey, authKeysPath, tmpPath, tmpPath, tmpPath, authKeysPath)
return sshExec(host.SSHHost, host.SSHPort, script)
}
-- Mark old key revoked
UPDATE ssh_keys SET status='revoked' WHERE id=$1;
Certificate-Based SSH Alternative
Instead of managing authorized_keys files, use short-lived SSH certificates signed by an internal CA. This eliminates per-host key distribution entirely:
-- CA setup (one-time)
ssh-keygen -t ed25519 -f /etc/ssh/ca_key -N ""
-- Add to each host sshd_config:
TrustedUserCAKeys /etc/ssh/ca_key.pub
-- Issue short-lived cert (24h TTL) on demand
ssh-keygen -s /etc/ssh/ca_key
-I "svc-deploy-$(date +%Y%m%d)"
-n svc-deploy
-V +24h
user_key.pub
-- Returns: user_key-cert.pub
-- Client uses cert for auth — no authorized_keys changes needed
ssh -i user_key -i user_key-cert.pub host.example.com
SSH Cert Issuance API
POST /ssh-certs
Request:
{
"principal": "svc-deploy",
"public_key": "ssh-ed25519 AAAA...",
"ttl_hours": 24,
"hosts": ["*.prod.internal"]
}
Response:
{
"certificate": "ssh-ed25519-cert-v01@openssh.com AAAA...",
"valid_before": "2026-04-18T12:00:00Z"
}
REST API
POST /keys/:principal/rotate -- trigger rotation (returns rotation job id)
GET /keys/:principal -- list active keys for principal
GET /keys/:principal/:key_id -- get key details + status
DELETE /keys/:principal/:key_id -- immediately revoke key
GET /rotation-jobs/:id -- check rotation job status
POST /ssh-certs -- issue short-lived SSH certificate
Scheduled Auto-Rotation
-- Daily job: find keys expiring within 14 days
SELECT id, principal FROM ssh_keys
WHERE status = 'active'
AND expires_at < now() + INTERVAL '14 days';
-- Enqueue rotation for each
INSERT INTO rotation_jobs (principal, old_key_id, grace_hours)
VALUES ($1, $2, 24);
High Availability Considerations
- Distribution worker uses idempotent append: running twice does not duplicate keys (dedup by fingerprint in authorized_keys)
- Failed host distributions are retried with exponential backoff; rotation job stays in distributing state until all hosts synced
- Grace period ensures zero downtime: services using old key continue to work while clients switch to new key
- SSH CA private key stored in HSM; cert issuance requires HSM signing operation
- Bastion/jump host support for hosts not directly reachable
- All operations append-only to audit log (immutable, no deletes allowed)
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why rotate SSH keys and how often should it happen?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “SSH keys are long-lived credentials that accumulate across hosts over time, making revocation difficult if a key is compromised. Rotation limits the window of exposure: a rotated key is removed from all authorized_keys files, so a leaked private key becomes useless after the rotation cycle. Best practice is to rotate every 30–90 days for service accounts and immediately on any suspected compromise. Certificate-based SSH with 24-hour TTLs eliminates the rotation problem entirely.”
}
},
{
“@type”: “Question”,
“name”: “How do you rotate SSH keys with zero downtime?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The rotation service uses a grace period: it appends the new public key to authorized_keys on all hosts before revoking the old one. For a configured overlap window (e.g., 24 hours), both keys are accepted. Clients and automation update their private key reference during this window. After the grace period expires, the old public key line is removed from every host's authorized_keys atomically using a tmp-file-and-mv pattern to avoid partial reads.”
}
},
{
“@type”: “Question”,
“name”: “What is certificate-based SSH and how does it improve on key-based auth?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Certificate-based SSH replaces per-host authorized_keys management with short-lived certificates signed by a trusted SSH CA. Each host trusts the CA public key (TrustedUserCAKeys in sshd_config). When a user or service needs access, they request a signed cert with a short TTL (e.g., 24 hours). The cert is valid for that window and then expires — no revocation, no authorized_keys updates needed. This scales to thousands of hosts without any per-host configuration changes.”
}
},
{
“@type”: “Question”,
“name”: “How does the distribution worker update authorized_keys safely on each host?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The worker SSHes to each host and performs an atomic update: it copies the current authorized_keys to a temp file, appends the new public key, sets permissions to 600, then renames the temp file over the original. The rename (mv) is atomic on POSIX filesystems, so the SSH daemon never reads a partially written file. For revocation, it uses grep -v to filter out the old key into a temp file and then renames. All operations are retried with exponential backoff and recorded in the audit log.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Coinbase Interview Guide