Low Level Design: Audit Log Service – Tech Interview Dot Org

Overview

An audit log service provides an authoritative, tamper-evident record of who did what to which resource and when. It is write-heavy, append-only, and must satisfy compliance queries that span months or years of history.

Audit Event Schema

Every audit event answers the five compliance questions: who, what action, on what resource, when, and from where. A structured schema makes querying reliable without full-text search.

audit_events
  id              UUID PK          -- globally unique, time-ordered (UUIDv7)
  tenant_id       UUID             -- multi-tenant isolation
  actor_type      VARCHAR          -- user, service_account, system
  actor_id        VARCHAR          -- user ID or service name
  actor_ip        INET
  actor_user_agent TEXT
  action          VARCHAR          -- RESOURCE_TYPE.VERB e.g. document.delete
  resource_type   VARCHAR
  resource_id     VARCHAR
  resource_name   VARCHAR NULL     -- human label at time of event
  outcome         ENUM('success','failure','partial')
  metadata        JSONB            -- action-specific fields
  occurred_at     TIMESTAMPTZ      -- set by client, validated server-side
  received_at     TIMESTAMPTZ      -- set by ingest layer, monotonic
  prev_hash       CHAR(64)         -- SHA-256 of previous event (hash chain)
  event_hash      CHAR(64)         -- SHA-256 of this row excluding event_hash

UUIDv7 encodes a millisecond timestamp in the high bits, giving natural time-ordering without a separate sequence. occurred_at comes from the caller but is bounded — events more than 5 minutes in the past or future are rejected to prevent backdating.

Append-Only Write Path

The ingest API accepts events over HTTPS, validates the schema, enriches with server-side fields (received_at, resolved actor details), and writes to the primary store. No UPDATE or DELETE statements are issued against audit tables — database roles for the ingest service are granted INSERT only.

For high-throughput systems, the ingest layer publishes to a Kafka topic partitioned by tenant_id. Consumers write micro-batches to the database and to a columnar store (e.g., ClickHouse or Redshift) used for analytical queries. The Kafka offset is the secondary durability checkpoint — if the DB write fails, the consumer replays from the last committed offset.

Hash Chaining for Tamper Evidence

Each event's event_hash is computed as:

SHA-256(id || tenant_id || actor_id || action || resource_id
        || outcome || occurred_at || metadata_canonical_json
        || prev_hash)

The prev_hash of a new event is the event_hash of the most recent event for that tenant. This forms a per-tenant chain: altering any historical event invalidates every hash from that point forward, making tampering detectable.

A background verifier periodically walks the chain for each tenant and alerts if a hash mismatch is found. For additional assurance, periodic checkpoint hashes can be published to an external immutable store (a transparency log or a blockchain anchor service).

Search and Query API

The query API supports filtering on any combination of: time range, actor ID, resource type, resource ID, action, and outcome. Pagination uses a cursor built from (received_at, id) rather than OFFSET to stay efficient on large result sets.

Indexes on the primary OLTP store:

(tenant_id, received_at DESC)           -- primary time-range scan
(tenant_id, actor_id, received_at DESC) -- actor queries
(tenant_id, resource_type, resource_id, received_at DESC) -- resource history
(tenant_id, action, received_at DESC)   -- action-type queries

For complex free-text queries against metadata, the columnar replica handles the heavy lifting. The API router sends simple indexed queries to Postgres and analytical queries to ClickHouse, transparent to the caller.

Retention Tiers and Compliance Export

Retention requirements vary by regulation: SOC 2 typically requires one year, HIPAA up to six years, some financial regulations up to seven. The service implements tiered storage:

Hot tier (0–90 days): primary OLTP DB, fully indexed, sub-second queries.
Warm tier (90 days–2 years): columnar store, analytical queries in seconds.
Cold tier (2–7 years): compressed Parquet on object storage (S3/GCS), queryable via Athena or BigQuery on demand.

A nightly archival job moves events from hot to warm and from warm to cold, verifying hashes before and after transfer. Once in cold storage, events are written with object-lock (WORM) enabled so they cannot be deleted before the retention deadline even by an admin.

The compliance export endpoint generates a signed, time-limited URL to a pre-built Parquet or CSV export for a requested time range and tenant. Exports are asynchronous for large ranges — the caller polls a job ID or receives a webhook on completion.

Security and Scalability

The ingest endpoint authenticates callers with short-lived service tokens (no long-lived API keys). The query API enforces tenant isolation at the DB query layer — every query has tenant_id = $caller_tenant injected server-side; callers cannot override it.

At scale, the append-only write pattern is friendly to horizontal partitioning. Partition the audit table by received_at month. Each partition can be attached or detached independently, making archival a DDL operation rather than a bulk DELETE. Read replicas serve the query API; the ingest path writes only to the primary.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is an audit log service and why is it needed?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “An audit log service is a dedicated system that records a tamper-evident, chronological trail of actions performed within an application—who did what, to which resource, and when. It is needed for regulatory compliance (SOC 2, HIPAA, GDPR), security incident investigation, forensic analysis, and accountability. Unlike application logs that capture system behavior, audit logs focus on user and system actor actions and must be immutable, searchable, and retained for defined periods.” } }, { “@type”: “Question”, “name”: “How does hash chaining provide tamper evidence in audit logs?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Hash chaining links each audit log entry to the previous one by including the prior entry’s hash in the current entry’s payload before hashing it—similar to a blockchain. Each record stores: hash_n = SHA256(entry_data_n + hash_{n-1}). To verify integrity, a verifier recomputes the chain from any known checkpoint and checks that every hash matches. Any insertion, deletion, or modification of a record breaks all subsequent hashes, making tampering immediately detectable. The genesis record uses a well-known seed hash or a timestamp-based nonce.” } }, { “@type”: “Question”, “name”: “How do you design audit log retention tiers for compliance?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Audit log retention is typically tiered by recency and access frequency. Hot tier (0–90 days): logs stored in a fast queryable store like Elasticsearch or a relational DB for real-time search and incident response. Warm tier (90 days–1 year): logs compressed and moved to cheaper object storage (S3) with query support via Athena or similar. Cold/archive tier (1–7+ years): logs stored in lowest-cost archival storage (S3 Glacier) to meet long-term compliance mandates (e.g., PCI-DSS requires 1 year online, 7 years archived). Automated lifecycle policies handle tier transitions; cryptographic integrity proofs are stored separately for each tier boundary.” } }, { “@type”: “Question”, “name”: “What is the difference between audit logs and application logs?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Application logs capture runtime behavior—errors, warnings, debug traces, and system events—primarily for developers and operators diagnosing performance or failures. Audit logs capture actor-driven events—user logins, resource access, configuration changes, data exports—primarily for compliance officers, security teams, and legal review. Key differences: audit logs must be immutable and tamper-evident (application logs typically are not); audit logs require long-term structured retention with queryable actor/resource fields; audit logs are often subject to regulatory mandates; and audit log writes must never be lost, making them write-ahead or append-only by design.” } } ] }