A TLS certificate manager automates the full lifecycle of digital certificates: issuance via ACME protocol, storage with encrypted private keys, rotation before expiry, and distribution to load balancers and services. This post covers the design in detail, including multi-domain SAN certificates, private CA integration, and hot reload mechanics.
Certificate Lifecycle
Certificates move through these states: issued → active → expiring_soon → rotating → active (renewed). The manager monitors all active certificates and flags those within 30 days of expiry as expiring_soon. A rotation job is triggered automatically, acquiring a new certificate before the old one expires.
ACME Protocol and Domain Validation
The ACME protocol (RFC 8555) automates domain ownership validation and certificate issuance. Two challenge types are supported:
- HTTP-01: The ACME server expects a token to be served at
/.well-known/acme-challenge/{token}over HTTP on port 80. Simple to implement; requires the certificate manager to write the token file (or configure the web server to proxy the request). - DNS-01: The ACME server expects a TXT record
_acme-challenge.{domain}set to the key authorization. Required for wildcard certificates; requires DNS API access.
The manager supports both Let's Encrypt (public CA) and internal ACME CAs (e.g., Step CA for private PKI).
Private Key Security
Private keys are never stored in plaintext. On generation, the key is encrypted with a Data Encryption Key (DEK) retrieved from a KMS (AWS KMS, GCP Cloud KMS, HashiCorp Vault). The encrypted key blob is stored in the database in key_ref. On distribution, the manager decrypts in-memory and transmits over mTLS to the target.
Key rotation: when a certificate is rotated, a new key pair is generated — reusing the old private key is avoided for forward secrecy.
SAN Certificates
Subject Alternative Names allow a single certificate to cover multiple domains. The manager stores san_domains as a JSONB array. When any domain in the SAN list is about to expire (they share the same expiry), the entire certificate is rotated as a unit.
Certificate Distribution and Hot Reload
After issuance or rotation, the manager pushes the new certificate and private key to all registered targets (load balancers, API gateways, application servers) via an authenticated API call. Targets perform hot reload:
- SIGHUP: nginx/HAProxy reload the TLS config without dropping connections
- API reload: Envoy, Traefik, and similar proxies accept certificate updates via xDS or admin API
Audit Trail
Every certificate action — issuance, renewal, revocation — is appended to CertAudit. The audit log is append-only and records the actor (human user or automation job) and timestamp. This supports compliance requirements and incident postmortems.
SQL Schema
-- Certificate inventory
CREATE TABLE Certificate (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
domain TEXT NOT NULL,
san_domains JSONB NOT NULL DEFAULT '[]',
serial_number TEXT NOT NULL UNIQUE,
not_before TIMESTAMPTZ NOT NULL,
not_after TIMESTAMPTZ NOT NULL,
status TEXT NOT NULL DEFAULT 'issued',
fingerprint TEXT NOT NULL,
key_ref TEXT NOT NULL, -- KMS-encrypted key reference
CONSTRAINT chk_cert_status CHECK (
status IN ('issued','active','expiring_soon','rotating','expired','revoked')
)
);
CREATE INDEX idx_cert_domain ON Certificate(domain);
CREATE INDEX idx_cert_not_after ON Certificate(not_after) WHERE status IN ('active','expiring_soon');
-- Rotation events linking old to new certificate
CREATE TABLE CertRotation (
id BIGSERIAL PRIMARY KEY,
cert_id UUID NOT NULL REFERENCES Certificate(id),
triggered_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
new_cert_id UUID REFERENCES Certificate(id),
completed_at TIMESTAMPTZ
);
-- Append-only audit log
CREATE TABLE CertAudit (
id BIGSERIAL PRIMARY KEY,
cert_id UUID NOT NULL REFERENCES Certificate(id),
action TEXT NOT NULL, -- issued, renewed, revoked, distributed
actor TEXT NOT NULL,
timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Python Interface
import uuid
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional
@dataclass
class CertificateRecord:
id: str
domain: str
san_domains: list[str]
not_after: datetime
status: str
fingerprint: str
key_ref: str
class CertificateManager:
def __init__(self, db, kms_client, acme_client):
self.db = db
self.kms = kms_client
self.acme = acme_client
def issue_certificate(
self,
domains: list[str],
acme_provider: str = "letsencrypt"
) -> CertificateRecord:
"""Issue a new certificate via ACME for the given domains (primary + SANs)."""
primary = domains[0]
sans = domains[1:]
# Step 1: validate domain ownership
for domain in domains:
token = self.acme.request_challenge(domain, challenge_type="http-01")
self.validate_acme_challenge(domain, token)
# Step 2: generate key pair, encrypt private key via KMS
private_key, public_key = self._generate_key_pair()
key_ref = self.kms.encrypt(private_key)
# Step 3: submit CSR to ACME CA
cert_pem = self.acme.finalize_order(public_key, domains, provider=acme_provider)
# Step 4: persist
record = self.db.insert_certificate(
domain=primary,
san_domains=sans,
cert_pem=cert_pem,
key_ref=key_ref,
status="active"
)
self.db.append_audit(record.id, action="issued", actor="acme-automation")
return record
def rotate_certificate(self, cert_id: str) -> CertificateRecord:
"""Rotate an expiring certificate; issues replacement and updates status."""
old = self.db.get_certificate(cert_id)
all_domains = [old.domain] + old.san_domains
new_cert = self.issue_certificate(all_domains)
self.db.record_rotation(old_cert_id=cert_id, new_cert_id=new_cert.id)
self.db.update_status(cert_id, "expired")
return new_cert
def validate_acme_challenge(self, domain: str, token: str) -> bool:
"""Serve or confirm the ACME challenge token for HTTP-01 validation."""
# HTTP-01: write token to /.well-known/acme-challenge/{token}
challenge_path = f"/.well-known/acme-challenge/{token}"
self._serve_challenge(challenge_path, token)
return self.acme.verify_challenge(domain, token)
def distribute_certificate(self, cert_id: str, targets: list[str]) -> None:
"""Push certificate + decrypted key to each target and trigger hot reload."""
cert = self.db.get_certificate(cert_id)
private_key = self.kms.decrypt(cert.key_ref)
for target_url in targets:
self._push_to_target(target_url, cert, private_key)
self.db.append_audit(cert_id, action="distributed", actor=target_url)
def _generate_key_pair(self) -> tuple[bytes, bytes]:
from cryptography.hazmat.primitives.asymmetric import ec
from cryptography.hazmat.backends import default_backend
key = ec.generate_private_key(ec.SECP256R1(), default_backend())
private_bytes = key.private_bytes(
encoding=__import__('cryptography').hazmat.primitives.serialization.Encoding.PEM,
format=__import__('cryptography').hazmat.primitives.serialization.PrivateFormat.PKCS8,
encryption_algorithm=__import__('cryptography').hazmat.primitives.serialization.NoEncryption()
)
public_bytes = key.public_key().public_bytes(
encoding=__import__('cryptography').hazmat.primitives.serialization.Encoding.PEM,
format=__import__('cryptography').hazmat.primitives.serialization.PublicFormat.SubjectPublicKeyInfo
)
return private_bytes, public_bytes
def _serve_challenge(self, path: str, token: str) -> None:
# Write token to web server challenge directory
pass
def _push_to_target(self, target_url: str, cert: CertificateRecord, key: bytes) -> None:
# POST cert+key to target reload endpoint over mTLS
pass
Design Considerations
30-day rotation trigger: Let's Encrypt certificates are valid for 90 days. Triggering renewal at 30 days leaves a 30-day window for retries if ACME validation fails (DNS propagation, port 80 blocked, etc.). Do not wait until 7 days — that is too little margin for operational incidents.
Hot reload without downtime: Both nginx (SIGHUP) and Envoy (xDS SDS) support certificate updates without terminating existing connections. The certificate manager should verify the target accepted the new certificate before marking the rotation complete.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Atlassian Interview Guide