Key Management Service Low-Level Design: Key Hierarchy, HSM Integration, and Envelope Encryption

Problem Scope and Requirements

A Key Management Service (KMS) manages cryptographic keys throughout their lifecycle: generation, storage, rotation, and destruction. Unlike a secret manager that stores arbitrary credentials, a KMS is purpose-built for key material and enforces that raw key bytes never leave hardware security modules (HSMs) unencrypted.

Functional Requirements

  • Create, describe, list, disable, enable, and schedule deletion of Customer Master Keys (CMKs).
  • Perform encrypt, decrypt, generate-data-key, and re-encrypt operations using CMKs.
  • Support key rotation: automatic (annual) and on-demand.
  • Maintain key hierarchy: root keys in HSM, intermediate keys in software, data keys generated per-use.
  • Grant fine-grained key usage permissions via key policies and grants.

Non-Functional Requirements

  • Cryptographic operations under 5 ms P99 for software-backed keys; under 20 ms for HSM-backed operations.
  • FIPS 140-2 Level 3 compliance for root key operations.
  • No single point of failure: key material available across 3+ AZs.

Key Hierarchy

A well-designed KMS uses a layered key hierarchy to limit exposure and enable efficient rotation:

  • Root Key (RK): Generated and stored entirely within HSM hardware. Never exported. Used only to wrap/unwrap Domain Keys. One root key per HSM cluster.
  • Domain Key (DK): Wraps Customer Master Keys for a logical domain (e.g., a customer account). Stored as HSM-wrapped ciphertext in a database. Unwrapped into HSM memory on demand.
  • Customer Master Key (CMK): The key customers interact with via API. Stored as a Domain-Key-wrapped blob. Unwrapped into software memory for encrypt/decrypt operations (for software-backed CMKs) or into HSM memory for hardware-backed CMKs.
  • Data Encryption Key (DEK): Generated fresh for each encrypt operation using a CMK. The CMK encrypts the DEK (producing an encrypted DEK or EDK); the DEK encrypts the user data. DEKs are ephemeral and never stored by the KMS.

Core Data Model

KeyMetadata

KeyMetadata {
    key_id:          UUID
    arn:             string          // global identifier
    description:     string
    key_usage:       enum { ENCRYPT_DECRYPT, SIGN_VERIFY, GENERATE_HMAC }
    key_spec:        enum { AES_256, RSA_2048, RSA_4096, ECC_P256, ECC_P384 }
    origin:          enum { AWS_KMS, EXTERNAL, HSM }
    state:           enum { ENABLED, DISABLED, PENDING_DELETION, PENDING_IMPORT }
    created_at:      int64
    deletion_date:   int64          // non-zero if pending deletion
    rotation_enabled: bool
    current_backing_key_version: int
}

BackingKey

BackingKey {
    key_id:          UUID
    version:         int
    wrapped_key_material: bytes   // wrapped by the domain key
    algorithm:       string
    created_at:      int64
    retired_at:      int64        // set when a new version is created
}

Grant

Grant {
    grant_id:        UUID
    key_id:          UUID
    grantee_principal: string
    operations:      []enum { Encrypt, Decrypt, GenerateDataKey, Sign, Verify }
    constraints:     map[string]string   // encryption context constraints
    expires_at:      int64
}

HSM Integration

HSMs are connected via PKCS#11 library (for on-prem) or AWS CloudHSM SDK. The KMS maintains a pool of HSM session handles. On startup, each KMS host authenticates to the HSM cluster using a partition password stored in a hardware TPM on the host. Session handles are kept warm in a pool (similar to a database connection pool) to amortize the ~10 ms session establishment cost.

For unwrap operations: call C_UnwrapKey with the root key handle and the wrapped DK ciphertext. The HSM performs the unwrap internally and returns an in-HSM key handle — the plaintext DK never leaves HSM memory. Subsequent CMK unwraps using that DK handle can be performed in software or within the HSM depending on the key's origin.

Envelope Encryption Operation Flow

GenerateDataKey

  1. Validate caller identity and key policy/grant.
  2. Unwrap the CMK's backing key material (from DB, using domain key).
  3. Generate 32 random bytes as a plaintext DEK using a CSPRNG (hardware RNG if available).
  4. Encrypt the DEK with the CMK using AES-256-GCM. Include the encryption context as AAD.
  5. Return both the plaintext DEK (for immediate use) and the encrypted DEK (for storage alongside ciphertext).
  6. Wipe the plaintext DEK from KMS memory. The KMS never sees user data.

Decrypt (Encrypted DEK)

  1. Parse the ciphertext blob: extract CMK key_id, version, ciphertext, and auth tag.
  2. Unwrap the specified CMK backing key version (supports decrypting with rotated-out key versions).
  3. Decrypt the EDK using AES-256-GCM, validating the encryption context as AAD.
  4. Return the plaintext DEK to the caller. Wipe from memory after response is sent.

Key Rotation

When a CMK is rotated, a new backing key version is generated and stored. The old version is retained indefinitely to decrypt data that was encrypted under it. The current_backing_key_version pointer is updated atomically. All new encrypt and GenerateDataKey calls use the new version. Old encrypted blobs carry the version number in their header so decryption always selects the correct backing key.

For automatic rotation (annual), a rotation scheduler scans KeyMetadata for eligible keys and triggers the rotation workflow. The workflow is idempotent: if a new version already exists but the pointer has not been updated (crash mid-rotation), it completes the pointer update rather than generating a third version.

API Design

POST /keys                              — CreateKey
GET  /keys/{key_id}                     — DescribeKey
POST /keys/{key_id}/encrypt             — Encrypt (small payload, up to 4 KB)
POST /keys/{key_id}/decrypt             — Decrypt
POST /keys/{key_id}/generate-data-key  — GenerateDataKey
POST /keys/{key_id}/rotate              — RotateKey (on-demand)
POST /keys/{key_id}/grants              — CreateGrant
DELETE /keys/{key_id}/grants/{grant_id} — RetireGrant
POST /keys/{key_id}/schedule-deletion  — ScheduleKeyDeletion (min 7-day window)

Scalability Considerations

  • CMK caching: Cache unwrapped CMK material in memory for a short window (e.g., 10 minutes) to avoid repeated HSM round-trips. Cache entries must be zeroed on expiry, not simply garbage collected.
  • HSM cluster scaling: HSMs have fixed IOPS limits (~10,000 crypto ops/sec per module). Scale by adding HSM partitions and routing key operations to the partition hosting the domain key, using consistent hashing on domain_id.
  • Audit and compliance: Every cryptographic operation is logged to a CloudTrail-equivalent append-only store. Log entries include encryption context, key version used, caller identity, and a request ID for tracing.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does a three-tier key hierarchy (CMK, KEK, DEK) work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Data Encryption Keys (DEKs) encrypt the actual data and are generated per-record or per-resource. Key Encryption Keys (KEKs) wrap DEKs and are scoped to a service or tenant. Customer Master Keys (CMKs) wrap KEKs and live in an HSM or KMS. Decrypting data requires unwrapping down the chain: CMK unwraps KEK, KEK unwraps DEK, DEK decrypts data. Revoking a CMK renders all downstream data unreadable without touching individual records.”
}
},
{
“@type”: “Question”,
“name”: “How is an HSM integrated into a key management service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The KMS exposes a gRPC/REST API for encrypt, decrypt, and generate-key operations. Internally it dispatches cryptographic operations to an HSM via PKCS#11 or a vendor SDK. Private key material never leaves the HSM boundary. The KMS caches public key handles and session tokens for performance but always delegates private-key operations to the HSM, ensuring hardware-backed non-exportability.”
}
},
{
“@type”: “Question”,
“name”: “What is envelope encryption and what problem does it solve?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Envelope encryption generates a unique DEK locally to encrypt a payload, then encrypts the DEK itself with a remotely-managed CMK. Only the encrypted DEK is stored alongside the ciphertext. This limits KMS call volume (one call per key generation, not per record), reduces blast radius (compromising the local DEK only affects one record), and lets data owners change CMKs via re-wrapping without re-encrypting all data.”
}
},
{
“@type”: “Question”,
“name”: “How does key rotation work without disrupting active workloads?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Rotation creates a new key version while retaining old versions in an ACTIVE_DECRYPT-only state. New encryptions always use the latest version. The KMS stores a key version identifier alongside each ciphertext so decryption selects the correct version. A background rotation job re-encrypts records using the new key version, then marks old versions DISABLED once no ciphertext references remain. Zero downtime is maintained because both versions are valid during the transition window.”
}
}
]
}

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems

Scroll to Top