Low Level Design: Audit Log System – Tech Interview Dot Org

An audit log is an immutable, chronological record of every significant action taken in a system — who did what to which resource, when, and with what result. Audit logs are foundational to security, compliance, and forensic investigation. Designing them correctly requires careful attention to immutability, schema design, query performance, and regulatory requirements.

Use Cases for Audit Logs

Audit logs serve multiple distinct purposes. Security compliance: detecting unauthorized access, privilege abuse, or data exfiltration — "who accessed the patient record at 2am?" Regulatory compliance: HIPAA requires audit trails of PHI access; SOC 2 requires evidence of access controls; GDPR requires records of data processing activities. Forensic investigation: after an incident, audit logs are the primary evidence for reconstructing what happened — which accounts were compromised, what data was accessed, what was changed. Accountability: admin and privileged user actions (granting roles, deleting records, exporting data) must be traceable to individuals. Without audit logs, you cannot answer "did this actually happen?" after the fact.

Actor-Action-Resource Model

The standard model for audit events uses three axes. Actor: who performed the action — a human user (user_id), a service account, or an automated process; includes source IP and session ID. Action: what was done — standardized verbs like CREATE, READ, UPDATE, DELETE, LOGIN, LOGOUT, EXPORT, SHARE, GRANT_ROLE, REVOKE_ROLE. Resource: what was affected — a resource type (patient_record, invoice, user_account) and a resource ID. Result: SUCCESS or FAILURE — failed access attempts are as important as successful ones. Timestamp: ISO 8601 with UTC timezone — never local time. This model makes queries natural: "show all READ actions on resource patient:123 in the last 30 days."

Audit Event Schema

A well-designed audit event schema:

{
  event_id:      UUID (primary key),
  actor_id:      string (user_id or service name),
  actor_type:    enum (USER, SERVICE, SYSTEM),
  action:        string (CREATE, READ, UPDATE, DELETE, LOGIN, ...),
  resource_type: string (patient_record, invoice, ...),
  resource_id:   string,
  result:        enum (SUCCESS, FAILURE),
  metadata:      JSONB (before/after values, error message, etc.),
  ip_address:    inet,
  user_agent:    string,
  session_id:    string,
  created_at:    timestamptz
}

The metadata JSONB field captures event-specific context: for an UPDATE, store the before and after values of changed fields; for a failed login, store the failure reason; for a data export, store the query parameters and record count. Keep the core schema fixed and use metadata for variable content.

Append-Only Design and Immutability

The defining property of an audit log is that records are never updated or deleted. If an audit record could be modified, it would be useless as evidence. Enforce append-only at every layer: the application layer (no UPDATE/DELETE permissions on the audit table), the database layer (row-level security, revoke DELETE privileges from the application DB user), and the storage layer (write-once storage for archives). In PostgreSQL, you can enforce this with a rule or trigger that raises an error on any UPDATE or DELETE. Document explicitly that the audit table has no soft delete, no "correction" mechanism — if an event was logged incorrectly, log a compensating event instead.

Tamper Evidence and Hash Chains

Append-only storage prevents casual modification but doesn’t prove records haven’t been tampered with by a privileged attacker. Hash chain: each audit event includes a prev_hash field containing the hash of the previous event’s content. The hash of the full chain can be verified at any time — any modification breaks the chain. Merkle tree: group events into time windows (e.g., hourly), build a Merkle tree over them, and publish the root hash to an external, independent system (a blockchain, a timestamping service like RFC 3161, or a separate audit authority). External signing: send event hashes to an external audit log signing service that provides cryptographic timestamps. These mechanisms make tampering detectable even by a fully compromised application administrator.

Write-Once Storage for Compliance

For regulatory compliance, audit logs must be stored in media that physically cannot be overwritten. AWS S3 Object Lock with WORM (Write Once Read Many) mode prevents objects from being deleted or overwritten for a specified retention period — even by the account root user in governance mode. Azure Immutable Blob Storage provides equivalent functionality. The workflow: audit events are written to the operational audit DB for online querying; a periodic export job (daily or hourly) archives events to S3 with Object Lock enabled; the S3 copies satisfy regulatory immutability requirements. This two-tier approach gives you fast queryable online storage and compliant immutable archival storage.

High-Write Throughput Architecture

In a high-traffic system, synchronously writing to an audit DB on every request adds latency to every operation. The better architecture: the application publishes audit events to Kafka (fire-and-forget, adds ~1ms). A separate async consumer reads from Kafka and writes to the audit DB in batches. This decouples audit write latency from request latency entirely. Kafka provides durability (replication factor 3) and a buffer for burst traffic. The consumer can retry failed writes without affecting the user-facing request. Tradeoff: audit events appear in the DB with a short delay (seconds), not instantly — acceptable for all compliance use cases. Never make a request fail because the audit log is down; use Kafka as a buffer.

Search, Indexing, and Cold Storage

Audit log queries are typically: filter by actor_id, resource_type, resource_id, action, time range. Index on all four. For full-text search on metadata (e.g., "find all events where the error message contains ‘permission denied’"), PostgreSQL’s GIN index on the JSONB column works well for moderate scale; Elasticsearch handles larger volumes with better query flexibility. Data tiering: keep the last 90 days in the hot PostgreSQL audit DB; archive older events to S3 Parquet files queryable via Athena or BigQuery. Compliance queries against old data are infrequent and tolerate higher latency, so cold storage is acceptable. Hot/cold tiering keeps the operational DB small and fast.

Retention Policies and GDPR Conflicts

Different regulations impose different retention requirements: HIPAA: 6 years from creation or last effective date. PCI DSS: 1 year online, minimum 3 months immediately available. SOC 2: typically 1 year. The GDPR right to erasure creates a direct conflict: a user may request deletion of all their personal data, but audit records containing their user_id may be required by HIPAA or PCI for years. The standard resolution: pseudonymize the actor ID rather than delete the record. Replace the user_id with a non-reversible hash or a UUID that is dissociated from the user’s identity, while preserving the event record for compliance purposes. The event remains auditable (an action was taken) without retaining the personal identifier.

Isolating the Audit DB

The audit DB must be completely isolated from the application DB. If an attacker compromises the application database and has DDL access, they could drop or truncate the audit table. Store audit logs in a separate database instance with a separate set of credentials. The application has INSERT-only access to the audit DB — it cannot SELECT (for privacy), UPDATE, or DELETE. Read access for compliance queries and dashboards goes through a separate read-only credential. This isolation ensures that even a full compromise of the application DB doesn’t give an attacker the ability to cover their tracks in the audit log.

Real-Time Alerting on Audit Events

The audit event stream in Kafka can power real-time security alerting. A stream processor (Flink, Kafka Streams) applies rules: alert if any user performs more than 1000 READ operations on patient_record in an hour (mass data access); alert on GRANT_ROLE actions outside business hours (privilege escalation); alert on LOGIN FAILURE bursts from a single IP (brute force); alert on EXPORT of more than 10,000 records (data exfiltration). These rules are essentially joins between the audit stream and anomaly thresholds. Integrate alerts with PagerDuty, Slack, or a SIEM (Splunk, Datadog). This turns the audit log from a passive compliance record into an active security monitoring system.

Interview Tips

Audit log systems come up in interviews as a compliance requirement question, a "design a logging system" question, or as a component of a larger system design. Key signals: immutability as the first-class requirement, append-only enforcement at multiple layers, hash chains for tamper evidence, Kafka for decoupled high-throughput writes, data tiering (hot DB + cold S3), GDPR pseudonymization over deletion. Mentioning the isolation of the audit DB from the application DB demonstrates security-conscious design. Real-time alerting on the audit stream elevates the design from a passive log to an active security control.

Frequently Asked Questions

What is the actor-action-resource model for audit logging?

The actor-action-resource model structures every audit event around three core entities. Actor: who performed the action (user_id, service account, IP address, session_id). Action: what was done (CREATE, READ, UPDATE, DELETE, LOGIN, EXPORT — use past tense verb like user.created, payment.refunded). Resource: what was affected (resource_type, resource_id, resource name). Additional fields: timestamp, result (SUCCESS/FAILURE), reason for failure, before/after state for UPDATE operations, and correlation_id for distributed tracing. This structure enables consistent filtering and alerting regardless of the operation type.

How do you make audit logs tamper-evident?

Hash chaining: each audit event includes a hash field computed as SHA-256(previous_event_hash + current_event_content). An auditor can verify the chain by recomputing hashes from the beginning — any modification or deletion breaks the chain. For stronger guarantees, use a Merkle tree over time windows (hourly/daily): the root hash is published externally (blockchain, CT log, email to auditors). Cryptographic signing: each event is signed with the audit service’s private key; anyone with the public key can verify authenticity. AWS CloudTrail uses SHA-256 digest files to verify log file integrity.

How do you handle high write throughput for audit logging without impacting application latency?

Decouple audit logging from the request path: instead of writing synchronously to the audit DB on every API call, publish events to a message queue (Kafka or SQS) asynchronously. The application service writes the audit event to Kafka before returning the response (or fire-and-forget after). A separate audit consumer service reads from Kafka and writes to the audit DB. This ensures the audit DB write latency does not block the request. If Kafka is unavailable, the application can buffer events in a local write-ahead log and replay them when Kafka recovers.

How do you balance GDPR right-to-erasure with audit log immutability?

The conflict: GDPR requires deleting a user’s personal data on request; audit logs must be immutable. Resolution: pseudonymization — replace the user’s directly identifying data (name, email, exact IP) with a pseudonymous identifier (user_id or a hashed actor_id) when logging. The mapping table (actor_id → PII) is stored separately and can be deleted on erasure request, effectively anonymizing the audit logs without modifying or deleting the log entries. The audit events remain intact (preserving the action/resource chain) but the actor is no longer re-identifiable. Consult legal counsel for jurisdiction-specific requirements.