Question 1

What fields should every audit log event contain?

Accepted Answer

Every audit event needs: actor (who — user_id, service account, API key), action (what — using resource.verb naming like 'document.delete'), resource (to what — resource_type and resource_id), outcome (success/failure), timestamp (ISO 8601 UTC), and request_id for correlation with application logs. For mutation events, include before/after values to show what changed. Add ip_address and user_agent for security investigations. Never omit actor or resource — these are the two dimensions that make audit logs useful. Structured JSON format enables reliable parsing, indexing, and querying.

Question 2

How do you make audit logs tamper-evident?

Accepted Answer

Tamper evidence means that any modification to an audit record is detectable. Techniques: hash chaining (each record includes a cryptographic hash of the previous record — modification breaks the chain), write-once object storage (S3 Object Lock / WORM storage prevents modification or deletion), and append-only database access (revoke UPDATE and DELETE privileges on the audit_log table from the application service account — only INSERT is permitted). For regulated industries, use a dedicated immutable audit log service (AWS CloudTrail, Azure Activity Log) with cryptographic integrity verification rather than building tamper evidence from scratch.

Question 3

How do you store audit logs cost-effectively at scale?

Accepted Answer

Two-tier storage: hot tier (Elasticsearch/OpenSearch) for the last 90 days — enables sub-second full-text and structured queries needed for active security investigations; cold tier (S3 in Parquet format) for 1-7 years — queryable via Athena for compliance reports at a fraction of Elasticsearch cost. S3 lifecycle rules automatically transition objects from Standard to Infrequent Access to Glacier as they age. Compliance-driven retention: SOC 2 requires 1 year, HIPAA 6 years, PCI-DSS 1 year. Compress with Parquet + Snappy for 10-20x storage reduction vs raw JSON in cold storage.

Question 4

Why should audit logs be published asynchronously rather than written synchronously?

Accepted Answer

Writing audit logs synchronously in the database transaction adds latency to every business operation — the transaction cannot commit until the audit write completes. If the audit log database is slow or unavailable, all business operations fail. Asynchronous pattern: publish audit events to a durable queue (Kafka, SQS) before returning the response; a separate audit log service consumes the queue and persists events. The business operation completes without waiting for audit persistence. The queue is durable — events are not lost even if the audit log service is temporarily unavailable. This decouples audit log performance and availability from business operation performance and availability.

Low Level Design: Audit Logging System

Audit Event Schema

Instrumentation Without Coupling

Tamper Evidence

Storage and Retention

Query Patterns

Real-Time Alerting

Access Control for Audit Logs