Secret Manager System Low-Level Design

What is a Secret Manager?

A secret manager stores and distributes sensitive credentials (API keys, database passwords, TLS certificates, OAuth tokens) securely. Problems it solves: secrets hard-coded in source code or config files (exposed in git history), secrets shared via Slack (plaintext, auditable), no rotation (leaked secret = permanent compromise). Examples: HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager.

Requirements

  • Store secrets encrypted at rest; decrypt only on authorized access
  • RBAC: services can only read the secrets they are authorized for
  • Secret versioning: rotate secrets without downtime (both old and new versions active briefly)
  • Automatic rotation: rotate DB passwords and API keys on a schedule
  • Audit trail: log every access with who, what, when
  • High availability: secret reads must not fail even if the manager is temporarily unavailable

Data Model

Secret(secret_id UUID, name VARCHAR, path VARCHAR UNIQUE,
       description, created_by, created_at, rotation_policy_id)

SecretVersion(version_id UUID, secret_id, version_num INT, ciphertext BYTEA,
              kms_key_id, status ENUM(CURRENT,PREVIOUS,DEPRECATED),
              created_at, expires_at)
-- Only one version has status=CURRENT per secret

SecretPolicy(policy_id, secret_id, principal_type ENUM(SERVICE,USER,ROLE),
             principal_id, actions ENUM(READ,WRITE,DELETE,ROTATE), expires_at)

SecretAudit(audit_id, secret_id, version_id, principal_id, action,
            ip_address, success BOOL, accessed_at)

Encryption Architecture (Envelope Encryption)

Secrets are never stored or transmitted in plaintext. Use envelope encryption:

  1. A Data Encryption Key (DEK) is generated per secret
  2. The secret value is encrypted with the DEK (AES-256-GCM)
  3. The DEK itself is encrypted with a Key Encryption Key (KEK) stored in a Hardware Security Module (HSM) or KMS (AWS KMS, GCP Cloud KMS)
  4. Only the encrypted DEK (wrapped DEK) and the ciphertext are stored in the database

To decrypt: call KMS to unwrap the DEK (KMS never exposes the KEK), use the DEK to decrypt the secret. KMS access is controlled separately and audited.

Secret Rotation

Rotation without downtime — zero-downtime rotation protocol:

  1. Generate new secret value (e.g., new DB password)
  2. Create new SecretVersion (status=CURRENT), demote old version to status=PREVIOUS
  3. Update the target system (e.g., change DB password) — both old and new passwords are valid during the transition
  4. Wait for all services to fetch the new version (TTL grace period: 60 seconds)
  5. Expire the old version (status=DEPRECATED) — old password no longer valid

Services must cache the current secret version and refresh periodically (TTL=60s) or on authentication failure (lazy refresh).

Service Authentication (Workload Identity)

How does a service prove its identity to the secret manager? Options: (1) Cloud workload identity: AWS IAM role attached to EC2/ECS, GCP Service Account attached to GKE pod. The cloud provider validates the identity without a separate credential. (2) mTLS: the service presents a client certificate issued by the organization’s CA. (3) Vault AppRole: a role_id (static, low sensitivity) + a secret_id (dynamic, short-lived) pair. The secret_id is injected at deploy time via a secure channel (CI/CD pipeline). Prefer cloud workload identity — no bootstrap secret needed.

Local Caching and High Availability

Services cache secrets in memory with a TTL (default 60s). On cache hit: serve from memory — no secret manager call. On cache miss or TTL expiry: fetch from secret manager, update cache. On secret manager unavailability: serve stale cached value. Stale secret is far better than a service outage. Monitor: alert if stale cache age > 10 minutes (indicates prolonged secret manager outage).

Key Design Decisions

  • Envelope encryption with KMS: secrets are protected even if the DB is compromised
  • Versioned secrets with grace period: zero-downtime rotation
  • Workload identity for authentication: eliminates bootstrap credential problem
  • Local cache with TTL: high availability, low latency, tolerates transient failures
  • Full audit trail: every access logged for compliance (SOC2, HIPAA)


{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is envelope encryption and why does a secret manager use it?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Envelope encryption protects secrets at rest without exposing the master key. Process: (1) Generate a unique Data Encryption Key (DEK) for each secret. (2) Encrypt the secret value with the DEK using AES-256-GCM. (3) Encrypt the DEK with a Key Encryption Key (KEK) stored in a Hardware Security Module (HSM) or KMS like AWS KMS. (4) Store only the encrypted DEK (wrapped DEK) and the ciphertext in the database. To decrypt: call KMS to unwrap the DEK (KMS decrypts using the KEK, which never leaves the HSM), then use the DEK to decrypt the secret locally. Benefits: (1) Even if the database is compromised, secrets are unreadable without the KEK. (2) DEKs are small — cheap to re-encrypt with a new KEK (key rotation). (3) The KEK is centrally managed and audited in KMS. The secret manager never sees the KEK in plaintext.”}},{“@type”:”Question”,”name”:”How does zero-downtime secret rotation work?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”The challenge: rotating a DB password requires updating both the DB and every service that uses it, without causing authentication failures. Zero-downtime protocol: (1) Generate a new password. (2) Update the DB to accept BOTH the old and new password (create new DB user, or use DB-specific dual password support). (3) Create a new SecretVersion with status=CURRENT; set old version to status=PREVIOUS. (4) Services polling with TTL=60s will fetch the new version within 1 minute. (5) After grace period (e.g., 2x the TTL = 2 minutes), revoke the old password from the DB and mark old SecretVersion as DEPRECATED. Services that refresh on authentication failure will also handle the transition: if auth fails with cached secret, re-fetch and retry once. This handles the edge case where a service just fetched the old version before the rotation completed.”}},{“@type”:”Question”,”name”:”How does a service authenticate to a secret manager without a bootstrap credential?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”The bootstrap credential problem: to get secrets from a secret manager, a service needs credentials to authenticate — but those credentials are themselves secrets that need to be stored securely. Solutions: (1) Cloud workload identity (preferred): AWS IAM role attached to an EC2 instance or ECS task; GCP Service Account attached to a GKE pod. The cloud provider validates the identity using the instance metadata service (IMDS). No credentials to store — the VM/container proves its identity through a signed token from the cloud provider. (2) Kubernetes service accounts with IRSA (IAM Roles for Service Accounts): pods have a JWT automatically injected and exchanged for cloud credentials. (3) Vault AppRole: a role_id (not secret, can be in config) + a short-lived secret_id injected by CI/CD at deploy time. Cloud workload identity is cleanest: zero additional secrets, enforced by the cloud provider.”}},{“@type”:”Question”,”name”:”What should be in the audit log for a secret manager?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Every secret access must be logged for compliance (SOC2, HIPAA, PCI-DSS require access audit trails). Each audit record should include: secret_path (what was accessed), version_id (which version), principal_id (who accessed it — service account, user), action (READ, WRITE, ROTATE, DELETE), timestamp, ip_address, success (true/false), and reason_denied (if failed — authorization failure, secret not found, version deprecated). Store audit logs immutably: write to append-only storage (Kafka → S3 with object lock, or write to a separate tamper-evident DB). Retention: SOC2 requires 1 year, HIPAA requires 6 years, PCI-DSS requires 1 year. Alerts: alert on unusual patterns — high read volume from a single IP, reads of secrets outside the principal's normal set, failed auth attempts from a service.”}},{“@type”:”Question”,”name”:”How does secret caching in the client SDK work?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Client SDK caching prevents every secret access from hitting the secret manager (which adds latency and creates a hard dependency). Cache behavior: on first access, fetch from secret manager, store in memory with TTL=60s. On subsequent accesses within TTL: serve from cache — O(1), no network call. On TTL expiry: background refresh (fetch new value, update cache) or lazy refresh (fetch on next access). On secret manager unavailability: serve stale cached value — never return an error. On authentication failure with cached credentials: refresh immediately (don't wait for TTL). For rotation: services with a 60s cache TTL will see the new secret within 60 seconds — acceptable for most use cases. For zero-downtime rotation, the grace period must be at least 2x the TTL. Monitor staleness: if cache age > 10 minutes, alert — the secret manager may be unreachable.”}}]}

Coinbase system design covers secret management and key security. See common questions for Coinbase interview: secret management and security system design.

Atlassian system design covers secret management for distributed systems. Review patterns for Atlassian interview: secret management and infrastructure security design.

Amazon system design covers secrets management and IAM. See design patterns for Amazon interview: AWS Secrets Manager and security design.

Scroll to Top