Overview
An SSH key rotation service automates generation of new keypairs, distribution to all target hosts, a grace period with both old and new keys active, and revocation of old keys — eliminating long-lived static SSH keys across infrastructure. This LLD covers the data model, rotation flow, host distribution worker, certificate-based SSH, and audit logging.
Core Data Model
SSH Key Table
ssh_keys (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
principal VARCHAR NOT NULL, -- user or service identity
public_key TEXT NOT NULL, -- authorized_keys format
private_key_ref VARCHAR NOT NULL, -- pointer to secrets manager path
algorithm VARCHAR NOT NULL, -- rsa-4096 | ed25519
fingerprint VARCHAR NOT NULL, -- SHA256 fingerprint for dedup
status VARCHAR NOT NULL DEFAULT 'active', -- active/retiring/revoked
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
expires_at TIMESTAMPTZ NOT NULL,
rotated_from UUID REFERENCES ssh_keys(id) -- predecessor key
)
Host Inventory Table
hosts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
hostname VARCHAR NOT NULL UNIQUE,
ssh_host VARCHAR NOT NULL, -- IP or DNS for SSH connection
ssh_port INT NOT NULL DEFAULT 22,
auth_keys_path VARCHAR NOT NULL DEFAULT '/home/%s/.ssh/authorized_keys',
jump_host VARCHAR, -- bastion if needed
last_sync_at TIMESTAMPTZ,
last_sync_status VARCHAR -- ok/error
)
Rotation Job Table
rotation_jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
principal VARCHAR NOT NULL,
old_key_id UUID REFERENCES ssh_keys(id),
new_key_id UUID REFERENCES ssh_keys(id),
status VARCHAR NOT NULL DEFAULT 'pending', -- pending/distributing/grace/revoking/done/failed
grace_hours INT NOT NULL DEFAULT 24,
started_at TIMESTAMPTZ,
grace_until TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
error TEXT
)
Audit Log Table
ssh_audit_log (
id BIGSERIAL PRIMARY KEY,
event_type VARCHAR NOT NULL, -- generated/distributed/grace_start/revoked/failed
principal VARCHAR NOT NULL,
key_id UUID,
host VARCHAR,
actor VARCHAR NOT NULL, -- human user or "scheduler"
timestamp TIMESTAMPTZ NOT NULL DEFAULT now(),
detail JSONB
)
Rotation Flow
Step 1: Generate new keypair
ssh-keygen -t ed25519 -C "svc-deploy@2026-04-17" -f /tmp/newkey -N ""
-- or for RSA:
ssh-keygen -t rsa -b 4096 -C "svc-deploy@2026-04-17" -f /tmp/newkey -N ""
Step 2: Store private key in secrets manager
PUT /secrets/ssh/svc-deploy/private-key <private key PEM>
private_key_ref = "/ssh/svc-deploy/private-key"
Step 3: Insert new ssh_keys record (status=active)
INSERT INTO ssh_keys (principal, public_key, private_key_ref, algorithm, fingerprint, expires_at)
VALUES ('svc-deploy', 'ssh-ed25519 AAAA...', '/ssh/svc-deploy/private-key',
'ed25519', 'SHA256:...', now() + INTERVAL '90 days');
Step 4: Create rotation job
INSERT INTO rotation_jobs (principal, old_key_id, new_key_id, grace_hours)
VALUES ('svc-deploy', <old_id>, <new_id>, 24);
Step 5: Distribute new public key to all hosts (see Distribution Worker)
Step 6: Grace period — both old + new keys active for grace_hours
Step 7: Revoke old key after grace_until
Distribution Worker
-- For each host in inventory, append new pubkey atomically
func distributeKey(host Host, pubkey string, principal string) error {
authKeysPath := fmt.Sprintf(host.AuthKeysPath, principal)
tmpPath := authKeysPath + ".tmp"
// SSH to host and run atomic update
script := fmt.Sprintf(`
cp %s %s
echo "%s" >> %s
chmod 600 %s
mv %s %s
`, authKeysPath, tmpPath,
pubkey, tmpPath,
tmpPath,
tmpPath, authKeysPath)
return sshExec(host.SSHHost, host.SSHPort, script)
}
-- Update sync status
UPDATE hosts SET last_sync_at=now(), last_sync_status='ok'
WHERE id=$1;
Revocation: Remove Old Key
func revokeKey(host Host, oldPubkey string, principal string) error {
authKeysPath := fmt.Sprintf(host.AuthKeysPath, principal)
tmpPath := authKeysPath + ".tmp"
// Remove matching line atomically using grep -v
script := fmt.Sprintf(`
grep -v "%s" %s > %s
chmod 600 %s
mv %s %s
`, oldPubkey, authKeysPath, tmpPath, tmpPath, tmpPath, authKeysPath)
return sshExec(host.SSHHost, host.SSHPort, script)
}
-- Mark old key revoked
UPDATE ssh_keys SET status='revoked' WHERE id=$1;
Certificate-Based SSH Alternative
Instead of managing authorized_keys files, use short-lived SSH certificates signed by an internal CA. This eliminates per-host key distribution entirely:
-- CA setup (one-time)
ssh-keygen -t ed25519 -f /etc/ssh/ca_key -N ""
-- Add to each host sshd_config:
TrustedUserCAKeys /etc/ssh/ca_key.pub
-- Issue short-lived cert (24h TTL) on demand
ssh-keygen -s /etc/ssh/ca_key
-I "svc-deploy-$(date +%Y%m%d)"
-n svc-deploy
-V +24h
user_key.pub
-- Returns: user_key-cert.pub
-- Client uses cert for auth — no authorized_keys changes needed
ssh -i user_key -i user_key-cert.pub host.example.com
SSH Cert Issuance API
POST /ssh-certs
Request:
{
"principal": "svc-deploy",
"public_key": "ssh-ed25519 AAAA...",
"ttl_hours": 24,
"hosts": ["*.prod.internal"]
}
Response:
{
"certificate": "ssh-ed25519-cert-v01@openssh.com AAAA...",
"valid_before": "2026-04-18T12:00:00Z"
}
REST API
POST /keys/:principal/rotate -- trigger rotation (returns rotation job id)
GET /keys/:principal -- list active keys for principal
GET /keys/:principal/:key_id -- get key details + status
DELETE /keys/:principal/:key_id -- immediately revoke key
GET /rotation-jobs/:id -- check rotation job status
POST /ssh-certs -- issue short-lived SSH certificate
Scheduled Auto-Rotation
-- Daily job: find keys expiring within 14 days
SELECT id, principal FROM ssh_keys
WHERE status = 'active'
AND expires_at < now() + INTERVAL '14 days';
-- Enqueue rotation for each
INSERT INTO rotation_jobs (principal, old_key_id, grace_hours)
VALUES ($1, $2, 24);
High Availability Considerations
- Distribution worker uses idempotent append: running twice does not duplicate keys (dedup by fingerprint in authorized_keys)
- Failed host distributions are retried with exponential backoff; rotation job stays in distributing state until all hosts synced
- Grace period ensures zero downtime: services using old key continue to work while clients switch to new key
- SSH CA private key stored in HSM; cert issuance requires HSM signing operation
- Bastion/jump host support for hosts not directly reachable
- All operations append-only to audit log (immutable, no deletes allowed)
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Coinbase Interview Guide