Low Level Design: SSH Key Rotation Service

Overview

An SSH key rotation service automates generation of new keypairs, distribution to all target hosts, a grace period with both old and new keys active, and revocation of old keys — eliminating long-lived static SSH keys across infrastructure. This LLD covers the data model, rotation flow, host distribution worker, certificate-based SSH, and audit logging.

Core Data Model

SSH Key Table

ssh_keys (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  principal       VARCHAR NOT NULL,     -- user or service identity
  public_key      TEXT NOT NULL,        -- authorized_keys format
  private_key_ref VARCHAR NOT NULL,     -- pointer to secrets manager path
  algorithm       VARCHAR NOT NULL,     -- rsa-4096 | ed25519
  fingerprint     VARCHAR NOT NULL,     -- SHA256 fingerprint for dedup
  status          VARCHAR NOT NULL DEFAULT 'active', -- active/retiring/revoked
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
  expires_at      TIMESTAMPTZ NOT NULL,
  rotated_from    UUID REFERENCES ssh_keys(id)   -- predecessor key
)

Host Inventory Table

hosts (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  hostname        VARCHAR NOT NULL UNIQUE,
  ssh_host        VARCHAR NOT NULL,         -- IP or DNS for SSH connection
  ssh_port        INT NOT NULL DEFAULT 22,
  auth_keys_path  VARCHAR NOT NULL DEFAULT '/home/%s/.ssh/authorized_keys',
  jump_host       VARCHAR,                  -- bastion if needed
  last_sync_at    TIMESTAMPTZ,
  last_sync_status VARCHAR                  -- ok/error
)

Rotation Job Table

rotation_jobs (
  id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  principal     VARCHAR NOT NULL,
  old_key_id    UUID REFERENCES ssh_keys(id),
  new_key_id    UUID REFERENCES ssh_keys(id),
  status        VARCHAR NOT NULL DEFAULT 'pending', -- pending/distributing/grace/revoking/done/failed
  grace_hours   INT NOT NULL DEFAULT 24,
  started_at    TIMESTAMPTZ,
  grace_until   TIMESTAMPTZ,
  completed_at  TIMESTAMPTZ,
  error         TEXT
)

Audit Log Table

ssh_audit_log (
  id         BIGSERIAL PRIMARY KEY,
  event_type VARCHAR NOT NULL,    -- generated/distributed/grace_start/revoked/failed
  principal  VARCHAR NOT NULL,
  key_id     UUID,
  host       VARCHAR,
  actor      VARCHAR NOT NULL,    -- human user or "scheduler"
  timestamp  TIMESTAMPTZ NOT NULL DEFAULT now(),
  detail     JSONB
)

Rotation Flow

Step 1: Generate new keypair
  ssh-keygen -t ed25519 -C "svc-deploy@2026-04-17" -f /tmp/newkey -N ""
  -- or for RSA:
  ssh-keygen -t rsa -b 4096 -C "svc-deploy@2026-04-17" -f /tmp/newkey -N ""

Step 2: Store private key in secrets manager
  PUT /secrets/ssh/svc-deploy/private-key <private key PEM>
  private_key_ref = "/ssh/svc-deploy/private-key"

Step 3: Insert new ssh_keys record (status=active)
  INSERT INTO ssh_keys (principal, public_key, private_key_ref, algorithm, fingerprint, expires_at)
  VALUES ('svc-deploy', 'ssh-ed25519 AAAA...', '/ssh/svc-deploy/private-key',
          'ed25519', 'SHA256:...', now() + INTERVAL '90 days');

Step 4: Create rotation job
  INSERT INTO rotation_jobs (principal, old_key_id, new_key_id, grace_hours)
  VALUES ('svc-deploy', <old_id>, <new_id>, 24);

Step 5: Distribute new public key to all hosts (see Distribution Worker)

Step 6: Grace period — both old + new keys active for grace_hours

Step 7: Revoke old key after grace_until

Distribution Worker

-- For each host in inventory, append new pubkey atomically
func distributeKey(host Host, pubkey string, principal string) error {
    authKeysPath := fmt.Sprintf(host.AuthKeysPath, principal)
    tmpPath := authKeysPath + ".tmp"

    // SSH to host and run atomic update
    script := fmt.Sprintf(`
        cp %s %s
        echo "%s" >> %s
        chmod 600 %s
        mv %s %s
    `, authKeysPath, tmpPath,
       pubkey, tmpPath,
       tmpPath,
       tmpPath, authKeysPath)

    return sshExec(host.SSHHost, host.SSHPort, script)
}

-- Update sync status
UPDATE hosts SET last_sync_at=now(), last_sync_status='ok'
WHERE id=$1;

Revocation: Remove Old Key

func revokeKey(host Host, oldPubkey string, principal string) error {
    authKeysPath := fmt.Sprintf(host.AuthKeysPath, principal)
    tmpPath := authKeysPath + ".tmp"

    // Remove matching line atomically using grep -v
    script := fmt.Sprintf(`
        grep -v "%s" %s > %s
        chmod 600 %s
        mv %s %s
    `, oldPubkey, authKeysPath, tmpPath, tmpPath, tmpPath, authKeysPath)

    return sshExec(host.SSHHost, host.SSHPort, script)
}

-- Mark old key revoked
UPDATE ssh_keys SET status='revoked' WHERE id=$1;

Certificate-Based SSH Alternative

Instead of managing authorized_keys files, use short-lived SSH certificates signed by an internal CA. This eliminates per-host key distribution entirely:

-- CA setup (one-time)
ssh-keygen -t ed25519 -f /etc/ssh/ca_key -N ""
-- Add to each host sshd_config:
TrustedUserCAKeys /etc/ssh/ca_key.pub

-- Issue short-lived cert (24h TTL) on demand
ssh-keygen -s /etc/ssh/ca_key 
  -I "svc-deploy-$(date +%Y%m%d)" 
  -n svc-deploy 
  -V +24h 
  user_key.pub
-- Returns: user_key-cert.pub

-- Client uses cert for auth — no authorized_keys changes needed
ssh -i user_key -i user_key-cert.pub host.example.com

SSH Cert Issuance API

POST /ssh-certs
Request:
{
  "principal": "svc-deploy",
  "public_key": "ssh-ed25519 AAAA...",
  "ttl_hours": 24,
  "hosts": ["*.prod.internal"]
}
Response:
{
  "certificate": "ssh-ed25519-cert-v01@openssh.com AAAA...",
  "valid_before": "2026-04-18T12:00:00Z"
}

REST API

POST   /keys/:principal/rotate     -- trigger rotation (returns rotation job id)
GET    /keys/:principal            -- list active keys for principal
GET    /keys/:principal/:key_id    -- get key details + status
DELETE /keys/:principal/:key_id    -- immediately revoke key
GET    /rotation-jobs/:id          -- check rotation job status
POST   /ssh-certs                  -- issue short-lived SSH certificate

Scheduled Auto-Rotation

-- Daily job: find keys expiring within 14 days
SELECT id, principal FROM ssh_keys
WHERE status = 'active'
  AND expires_at < now() + INTERVAL '14 days';

-- Enqueue rotation for each
INSERT INTO rotation_jobs (principal, old_key_id, grace_hours)
VALUES ($1, $2, 24);

High Availability Considerations

  • Distribution worker uses idempotent append: running twice does not duplicate keys (dedup by fingerprint in authorized_keys)
  • Failed host distributions are retried with exponential backoff; rotation job stays in distributing state until all hosts synced
  • Grace period ensures zero downtime: services using old key continue to work while clients switch to new key
  • SSH CA private key stored in HSM; cert issuance requires HSM signing operation
  • Bastion/jump host support for hosts not directly reachable
  • All operations append-only to audit log (immutable, no deletes allowed)

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why rotate SSH keys and how often should it happen?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “SSH keys are long-lived credentials that accumulate across hosts over time, making revocation difficult if a key is compromised. Rotation limits the window of exposure: a rotated key is removed from all authorized_keys files, so a leaked private key becomes useless after the rotation cycle. Best practice is to rotate every 30–90 days for service accounts and immediately on any suspected compromise. Certificate-based SSH with 24-hour TTLs eliminates the rotation problem entirely.”
}
},
{
“@type”: “Question”,
“name”: “How do you rotate SSH keys with zero downtime?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The rotation service uses a grace period: it appends the new public key to authorized_keys on all hosts before revoking the old one. For a configured overlap window (e.g., 24 hours), both keys are accepted. Clients and automation update their private key reference during this window. After the grace period expires, the old public key line is removed from every host's authorized_keys atomically using a tmp-file-and-mv pattern to avoid partial reads.”
}
},
{
“@type”: “Question”,
“name”: “What is certificate-based SSH and how does it improve on key-based auth?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Certificate-based SSH replaces per-host authorized_keys management with short-lived certificates signed by a trusted SSH CA. Each host trusts the CA public key (TrustedUserCAKeys in sshd_config). When a user or service needs access, they request a signed cert with a short TTL (e.g., 24 hours). The cert is valid for that window and then expires — no revocation, no authorized_keys updates needed. This scales to thousands of hosts without any per-host configuration changes.”
}
},
{
“@type”: “Question”,
“name”: “How does the distribution worker update authorized_keys safely on each host?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The worker SSHes to each host and performs an atomic update: it copies the current authorized_keys to a temp file, appends the new public key, sets permissions to 600, then renames the temp file over the original. The rename (mv) is atomic on POSIX filesystems, so the SSH daemon never reads a partially written file. For revocation, it uses grep -v to filter out the old key into a temp file and then renames. All operations are retried with exponential backoff and recorded in the audit log.”
}
}
]
}

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Coinbase Interview Guide

Scroll to Top