Low Level Design: SSH Key Rotation Service

Overview

An SSH key rotation service automates generation of new keypairs, distribution to all target hosts, a grace period with both old and new keys active, and revocation of old keys — eliminating long-lived static SSH keys across infrastructure. This LLD covers the data model, rotation flow, host distribution worker, certificate-based SSH, and audit logging.

Core Data Model

SSH Key Table

ssh_keys (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  principal       VARCHAR NOT NULL,     -- user or service identity
  public_key      TEXT NOT NULL,        -- authorized_keys format
  private_key_ref VARCHAR NOT NULL,     -- pointer to secrets manager path
  algorithm       VARCHAR NOT NULL,     -- rsa-4096 | ed25519
  fingerprint     VARCHAR NOT NULL,     -- SHA256 fingerprint for dedup
  status          VARCHAR NOT NULL DEFAULT 'active', -- active/retiring/revoked
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
  expires_at      TIMESTAMPTZ NOT NULL,
  rotated_from    UUID REFERENCES ssh_keys(id)   -- predecessor key
)

Host Inventory Table

hosts (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  hostname        VARCHAR NOT NULL UNIQUE,
  ssh_host        VARCHAR NOT NULL,         -- IP or DNS for SSH connection
  ssh_port        INT NOT NULL DEFAULT 22,
  auth_keys_path  VARCHAR NOT NULL DEFAULT '/home/%s/.ssh/authorized_keys',
  jump_host       VARCHAR,                  -- bastion if needed
  last_sync_at    TIMESTAMPTZ,
  last_sync_status VARCHAR                  -- ok/error
)

Rotation Job Table

rotation_jobs (
  id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  principal     VARCHAR NOT NULL,
  old_key_id    UUID REFERENCES ssh_keys(id),
  new_key_id    UUID REFERENCES ssh_keys(id),
  status        VARCHAR NOT NULL DEFAULT 'pending', -- pending/distributing/grace/revoking/done/failed
  grace_hours   INT NOT NULL DEFAULT 24,
  started_at    TIMESTAMPTZ,
  grace_until   TIMESTAMPTZ,
  completed_at  TIMESTAMPTZ,
  error         TEXT
)

Audit Log Table

ssh_audit_log (
  id         BIGSERIAL PRIMARY KEY,
  event_type VARCHAR NOT NULL,    -- generated/distributed/grace_start/revoked/failed
  principal  VARCHAR NOT NULL,
  key_id     UUID,
  host       VARCHAR,
  actor      VARCHAR NOT NULL,    -- human user or "scheduler"
  timestamp  TIMESTAMPTZ NOT NULL DEFAULT now(),
  detail     JSONB
)

Rotation Flow

Step 1: Generate new keypair
  ssh-keygen -t ed25519 -C "svc-deploy@2026-04-17" -f /tmp/newkey -N ""
  -- or for RSA:
  ssh-keygen -t rsa -b 4096 -C "svc-deploy@2026-04-17" -f /tmp/newkey -N ""

Step 2: Store private key in secrets manager
  PUT /secrets/ssh/svc-deploy/private-key <private key PEM>
  private_key_ref = "/ssh/svc-deploy/private-key"

Step 3: Insert new ssh_keys record (status=active)
  INSERT INTO ssh_keys (principal, public_key, private_key_ref, algorithm, fingerprint, expires_at)
  VALUES ('svc-deploy', 'ssh-ed25519 AAAA...', '/ssh/svc-deploy/private-key',
          'ed25519', 'SHA256:...', now() + INTERVAL '90 days');

Step 4: Create rotation job
  INSERT INTO rotation_jobs (principal, old_key_id, new_key_id, grace_hours)
  VALUES ('svc-deploy', <old_id>, <new_id>, 24);

Step 5: Distribute new public key to all hosts (see Distribution Worker)

Step 6: Grace period — both old + new keys active for grace_hours

Step 7: Revoke old key after grace_until

Distribution Worker

-- For each host in inventory, append new pubkey atomically
func distributeKey(host Host, pubkey string, principal string) error {
    authKeysPath := fmt.Sprintf(host.AuthKeysPath, principal)
    tmpPath := authKeysPath + ".tmp"

    // SSH to host and run atomic update
    script := fmt.Sprintf(`
        cp %s %s
        echo "%s" >> %s
        chmod 600 %s
        mv %s %s
    `, authKeysPath, tmpPath,
       pubkey, tmpPath,
       tmpPath,
       tmpPath, authKeysPath)

    return sshExec(host.SSHHost, host.SSHPort, script)
}

-- Update sync status
UPDATE hosts SET last_sync_at=now(), last_sync_status='ok'
WHERE id=$1;

Revocation: Remove Old Key

func revokeKey(host Host, oldPubkey string, principal string) error {
    authKeysPath := fmt.Sprintf(host.AuthKeysPath, principal)
    tmpPath := authKeysPath + ".tmp"

    // Remove matching line atomically using grep -v
    script := fmt.Sprintf(`
        grep -v "%s" %s > %s
        chmod 600 %s
        mv %s %s
    `, oldPubkey, authKeysPath, tmpPath, tmpPath, tmpPath, authKeysPath)

    return sshExec(host.SSHHost, host.SSHPort, script)
}

-- Mark old key revoked
UPDATE ssh_keys SET status='revoked' WHERE id=$1;

Certificate-Based SSH Alternative

Instead of managing authorized_keys files, use short-lived SSH certificates signed by an internal CA. This eliminates per-host key distribution entirely:

-- CA setup (one-time)
ssh-keygen -t ed25519 -f /etc/ssh/ca_key -N ""
-- Add to each host sshd_config:
TrustedUserCAKeys /etc/ssh/ca_key.pub

-- Issue short-lived cert (24h TTL) on demand
ssh-keygen -s /etc/ssh/ca_key 
  -I "svc-deploy-$(date +%Y%m%d)" 
  -n svc-deploy 
  -V +24h 
  user_key.pub
-- Returns: user_key-cert.pub

-- Client uses cert for auth — no authorized_keys changes needed
ssh -i user_key -i user_key-cert.pub host.example.com

SSH Cert Issuance API

POST /ssh-certs
Request:
{
  "principal": "svc-deploy",
  "public_key": "ssh-ed25519 AAAA...",
  "ttl_hours": 24,
  "hosts": ["*.prod.internal"]
}
Response:
{
  "certificate": "ssh-ed25519-cert-v01@openssh.com AAAA...",
  "valid_before": "2026-04-18T12:00:00Z"
}

REST API

POST   /keys/:principal/rotate     -- trigger rotation (returns rotation job id)
GET    /keys/:principal            -- list active keys for principal
GET    /keys/:principal/:key_id    -- get key details + status
DELETE /keys/:principal/:key_id    -- immediately revoke key
GET    /rotation-jobs/:id          -- check rotation job status
POST   /ssh-certs                  -- issue short-lived SSH certificate

Scheduled Auto-Rotation

-- Daily job: find keys expiring within 14 days
SELECT id, principal FROM ssh_keys
WHERE status = 'active'
  AND expires_at < now() + INTERVAL '14 days';

-- Enqueue rotation for each
INSERT INTO rotation_jobs (principal, old_key_id, grace_hours)
VALUES ($1, $2, 24);

High Availability Considerations

  • Distribution worker uses idempotent append: running twice does not duplicate keys (dedup by fingerprint in authorized_keys)
  • Failed host distributions are retried with exponential backoff; rotation job stays in distributing state until all hosts synced
  • Grace period ensures zero downtime: services using old key continue to work while clients switch to new key
  • SSH CA private key stored in HSM; cert issuance requires HSM signing operation
  • Bastion/jump host support for hosts not directly reachable
  • All operations append-only to audit log (immutable, no deletes allowed)

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Coinbase Interview Guide

Scroll to Top