Low Level Design: Audit Logging System

An audit log records a tamper-evident, chronologically ordered history of all significant actions in a system: who did what, when, to which resource, and with what result. Audit logs are required for compliance (SOC 2, HIPAA, GDPR), security incident investigation, and debugging production issues. The key properties are immutability (audit logs cannot be modified after the fact), completeness (no events are missed), and queryability (finding relevant events efficiently).

Audit Event Schema

Every audit event contains: actor (who — user_id, service account, API key), action (what — “user.login”, “document.delete”, “permission.grant”), resource (to what — resource_type, resource_id), outcome (success, failure, partial), timestamp (ISO 8601 UTC), request_id (for correlation with application logs), ip_address, user_agent, and before/after values for mutation events (what changed). Use a consistent action naming convention: {resource}.{verb} — “order.cancelled”, “user.password_changed”, “role.assigned”. Structured JSON format enables reliable parsing and indexing. Never omit actor or resource — these are the two dimensions that make audit logs useful for investigations.

Instrumentation Without Coupling

Instrument audit logging at the service layer, not the database layer — database triggers miss business context (which user triggered the operation, which API endpoint was called). Patterns for clean instrumentation: aspect-oriented logging (AOP interceptors that wrap service methods), decorator/middleware that logs before and after every service call, or an audit context that is populated at the request boundary and flushed at the end. Never write audit logs synchronously in the database transaction (adds latency to every operation); publish audit events to a queue (Kafka) and process asynchronously. The application publishes the event; a separate audit log service persists it.

Tamper Evidence

Audit logs must be tamper-evident: if a record is deleted or modified, it should be detectable. Techniques: hash chaining (each audit record includes a cryptographic hash of the previous record; modification of any record breaks the chain), Merkle trees (efficient verification that any specific record is included in the log without reading the full log), write-once storage (AWS S3 Object Lock, WORM — Write Once Read Many — storage prevents modification or deletion), and append-only database tables (revoke UPDATE and DELETE privileges on the audit_log table from the application service account — only INSERT is permitted).

Storage and Retention

Audit logs are write-heavy and query-sparse: events are written continuously but queried only during investigations or compliance audits. Storage strategy: hot storage (Elasticsearch or OpenSearch) for the last 90 days — enables full-text and structured queries with sub-second response; cold storage (S3 Parquet) for 1-7 years — compressed, queryable via Athena for compliance reporting, low cost. Retention periods are compliance-driven: SOC 2 typically requires 1 year; HIPAA requires 6 years. Lifecycle rules in S3 automatically transition objects from Standard to Infrequent Access to Glacier as they age, minimizing storage cost.

Query Patterns

Common audit log queries: “all actions by user X in the last 30 days” (index on actor.user_id + timestamp), “all actions on resource Y” (index on resource_type + resource_id + timestamp), “all failed login attempts from IP Z” (index on action + outcome + ip_address + timestamp), “what changed in document D between date A and date B” (index on resource_id + timestamp, return before/after). Elasticsearch handles all these with structured queries. For compliance reports querying years of history, use Athena against the cold storage Parquet files — slower but no index size constraints. Pre-build common compliance reports (access reports, failed authentication reports) as scheduled queries.

Real-Time Alerting

Audit logs feed security alerting: detect anomalous patterns in real time. Alert on: more than 10 failed login attempts for a user in 5 minutes (brute force), a privileged action (admin.role_assigned) outside business hours, bulk data export (> 10,000 records exported in one session), access to resources flagged as sensitive, and login from a new country for a user with no travel history. Stream audit events from Kafka to a SIEM (Splunk, Datadog, or a custom rules engine). Correlate with authentication logs and access logs. Tune alert thresholds to minimize false positives while catching genuine threats.

Access Control for Audit Logs

Audit logs are sensitive — they contain information about all user activity including security-relevant events. Access control: security and compliance teams have read access; engineers have read access with additional approval for sensitive records (PII in audit data); the application service account has write-only access (cannot read or modify what it writes). Log access to the audit log itself — meta-auditing. Mask or redact PII in audit log payloads where the PII is not essential for the audit purpose (log that a user changed their email, not the email address itself, unless the email change is the compliance-relevant fact). Document the data classification of audit logs in the data catalog.

Scroll to Top