How does envelope encryption work in a secret management system?

Envelope encryption uses two layers of keys: a Data Encryption Key (DEK) encrypts the actual secret value, and a Master Encryption Key (MEK) encrypts the DEK. Only the encrypted DEK is stored alongside the ciphertext. To decrypt, the system first uses the MEK (held in a KMS like AWS KMS or Google Cloud KMS) to unwrap the DEK, then uses the DEK to decrypt the secret. This limits MEK exposure and allows re-keying secrets by re-encrypting only the DEK.

What is the difference between dynamic secret generation and static secrets in Vault?

Static secrets in Vault are long-lived credentials stored and retrieved as-is, requiring manual rotation. Dynamic secrets are generated on demand for a specific lease duration — Vault creates a real credential (e.g., a database user with a TTL) and automatically revokes it when the lease expires. Dynamic secrets reduce blast radius because credentials are short-lived and unique per client, eliminating the risk of shared or stale credentials.

How do you rotate secrets without downtime for dependent services?

Use a versioned secret model: when a new secret version is written, both the old and new versions remain valid during a grace period. Services are notified via a push mechanism (e.g., webhook, pub/sub event) or detect the new version on the next cache refresh. Services reload the new credential while the old one is still accepted by the backend. Once all services have migrated (tracked via version acknowledgment), the old version is revoked. This avoids a hard cutover.

How should access policies be designed in a secret management system?

Policies should default to deny: no access is granted unless explicitly permitted. Access is defined by path-based ACLs (e.g., secret/data/prod/db/*) combined with capability sets (read, write, delete, list). Policies are attached to roles, not individual identities, and roles are assigned via an auth method (e.g., Kubernetes service account, AWS IAM role). Use the principle of least privilege — each service gets read access only to the specific paths it needs.

How does a client SDK implement local caching with background refresh for secrets?

The SDK maintains an in-process cache keyed by secret path. On first access, it fetches from the server and stores the value with a TTL derived from the lease or a configured max-age. A background goroutine (or thread) runs before TTL expiry — typically at 75% of the TTL — to proactively refresh the secret and update the cache without blocking callers. On refresh failure, the SDK retains the stale value and retries with exponential backoff, surfacing an alert if the secret cannot be refreshed before expiry.

Low Level Design: Secret Management System

⏱ 9 min read

Envelope Encryption

The foundation of any secret management system is encryption at rest. Naive approaches encrypt every secret directly with a single master key — this creates catastrophic risk if the master key is ever compromised. Envelope encryption solves this by using two layers of keys.

A data encryption key (DEK) is generated per secret (or per secret version) and used to encrypt the secret value itself using AES-256-GCM. The DEK is then encrypted by a master encryption key (MEK) stored in an HSM or a cloud KMS like AWS KMS or GCP Cloud KMS. The encrypted DEK is stored alongside the encrypted secret value in the database. The MEK never leaves the KMS boundary — decryption operations are performed inside the KMS, not in application memory.

Decryption flow: fetch the encrypted DEK from the database, call the KMS Decrypt API with the encrypted DEK to retrieve the plaintext DEK (KMS performs this operation internally), then decrypt the secret value using the plaintext DEK in application memory. The plaintext DEK is discarded after use and never persisted. This design means rotating the MEK only requires re-encrypting the stored DEKs, not re-encrypting every secret value, which is critical for large deployments.

Secret Data Model

Secrets are organized by path (e.g., /prod/payments/db-password), enabling hierarchical namespacing that mirrors organizational structure. The core schema:

secrets table: secret_id (UUID), path (string, indexed), encrypted_value (bytes), encrypted_dek (bytes), version (integer), created_by (principal ID), created_at (timestamp), expires_at (nullable timestamp).

Versioning is non-destructive: writing a new value creates a new row with an incremented version number. The previous versions are retained for rollback. A separate secret_pointers table maps each path to its current active version. Reading a secret resolves the pointer and fetches the active version. Rollback is an atomic pointer update to a previous version. This allows zero-downtime rotation: write the new version, verify consumers can use it, then advance the pointer.

Expiry is enforced at read time — a secret past its expires_at timestamp returns a 404 to clients. A background reaper job also hard-deletes expired secrets after a grace period to reduce storage footprint.

Access Policy Engine

The policy model is principal + path pattern + allowed actions. Principals are services (identified by IAM role or service account) or human users (authenticated via SSO). Path patterns support wildcards: /prod/payments/* grants access to all secrets under the payments path. Actions are: read, write, list, delete.

The system is deny-by-default: a principal with no matching policy gets no access. Policies are stored as JSON documents and evaluated at every read and write request. Policy evaluation order: explicit deny takes precedence over explicit allow, mirroring AWS IAM semantics. Policies can be attached to individual principals or to groups, with group membership resolved at evaluation time.

Policy documents are versioned and audited. Any policy change — grant, revoke, or modification — is logged with the actor, timestamp, and before/after diff. The policy evaluator is a standalone service so it can be load tested and optimized independently. Caching evaluated decisions with a short TTL (30 seconds) reduces latency without creating large windows for stale policy.

Dynamic Secrets

Static credentials are the root cause of most credential sprawl incidents. Dynamic secrets eliminate static credentials by generating ephemeral credentials on demand. The architecture follows HashiCorp Vault’s secrets engine model: a backend plugin handles credential generation for a specific system.

Example — database backend: when a service requests a database credential, the secrets engine connects to the database using a privileged root credential (stored in the secret store, not exposed to clients), creates a new database user with a unique username and random password, grants it the appropriate role, and sets a lease TTL (e.g., 60 minutes). The engine returns the credential to the requesting service along with the lease ID and TTL.

When the lease expires, the engine automatically revokes the credential by dropping the database user. If the service needs longer access, it can renew the lease before expiry. Dynamic secrets work for databases, cloud IAM credentials, SSH certificates, and API tokens. The requesting service never sees a long-lived static credential — credentials are scoped to the lifetime of the request context.

Secret Rotation

Even with dynamic secrets, some credentials must be static (e.g., third-party API keys). Rotation is the process of replacing an active credential with a new one without service interruption. Two modes: manual and automatic.

Manual rotation: an operator writes a new version of the secret. The active pointer still points to the old version. Consumers are notified (via event or documentation) to reload credentials. Once all consumers have confirmed they’re using the new version, the pointer is advanced. Old version is retained for emergency rollback.

Automatic rotation: a rotation function (Lambda or similar) is invoked on a schedule. It generates a new credential (calls the third-party API to rotate a key, generates a new password, etc.), writes it as a new secret version, then sends an event to consumers via SNS or an internal event bus so they can refresh their in-memory cache. Before retiring the old credential, the rotation function verifies the new credential works by running a health check. Only on success does it advance the active pointer. This prevents a failed rotation from breaking production.

Audit Logging

Every access to the secret store — read, write, list, delete, policy change — generates an audit log entry. Fields: principal ID, action, secret path, source IP, user agent, timestamp, success/failure, and for reads: whether the secret was served from cache or live store.

Audit logs are append-only and written to an immutable log store (e.g., AWS CloudTrail, a write-once S3 bucket with Object Lock, or a dedicated audit database with no delete permissions). The application cannot modify or delete audit logs — the audit log writer has only append permissions. Logs are shipped to a SIEM for real-time alerting on anomalous access patterns (e.g., a service suddenly reading secrets outside its normal path prefix).

Comprehensive audit logging is a hard requirement for SOC 2 Type II and PCI DSS compliance. Auditors will request evidence that all secret access is logged, that logs are tamper-proof, and that access reviews are conducted periodically. Build the audit trail in from day one — retrofitting it is painful.

Client SDK Design

Clients should never call the secret store on every request — that introduces latency on every operation and creates a hard dependency on secret store availability. The client SDK manages a local in-memory cache with intelligent refresh behavior.

Cache TTL is set slightly shorter than the secret’s actual TTL (e.g., if a secret expires in 60 minutes, cache it for 55 minutes). A background goroutine or thread refreshes the cached value before expiry, so the cache never goes cold under normal operation. On refresh failure, the SDK falls back to the stale cached value and logs a warning — stale credentials are better than a complete outage. The SDK retries with exponential backoff.

Critical rule: the SDK never writes cached secrets to disk. In-memory only. This prevents secrets from leaking into log files, core dumps, or container image layers. The SDK also zeroes out secret bytes in memory after use where the language runtime allows it (Go and Rust support this; JVM and Python do not reliably). SDK initialization blocks until the first successful secret fetch — fail fast at startup rather than failing at the first real request.

High Availability

The secret store is critical infrastructure — every service depends on it at startup and during credential refresh. A single-node secret store is a single point of failure for the entire fleet. HA design requires multiple layers.

Cluster topology: active-active Vault cluster (or equivalent) with Raft consensus for leader election and log replication. Minimum 3 nodes in production (requires majority quorum for writes). Client SDK uses a load balancer or service discovery to route requests to any healthy node.

Unsealing: traditional Vault requires manual unseal with Shamir key shares after a restart, which is operationally painful. Auto-unseal via cloud KMS (AWS KMS, GCP Cloud KMS) removes the manual step — the cluster unseals automatically on restart using a KMS key that requires IAM authentication. This is safe because access to the KMS key is gated by IAM policy, not a human operator.

Multi-region replication: performance standbys in secondary regions serve read requests locally, reducing latency for geographically distributed services. Write requests route to the primary cluster. In disaster recovery scenarios, a secondary region can be promoted to primary. SLA target for the secret store should be 99.99% — four nines — given how many downstream services depend on it.