System Design: Audit Log — Immutable Event Trail, Compliance, and Tamper Detection

Why Audit Logs?

An audit log records every significant action in a system: who did what, to which resource, when, and from where. Audit logs serve multiple purposes: security forensics (investigate a breach — which accounts were accessed?), compliance (SOC 2, HIPAA, PCI-DSS all mandate audit trails), debugging (trace the sequence of events that caused a bug), and accountability (prove that a change was authorized). The defining property of an audit log: it must be append-only and tamper-evident. A legitimate user should never be able to delete or modify a past audit entry, even database administrators. Any modification must be detectable.

What to Log

Log every action that changes state or accesses sensitive data. Minimum fields per event: event_id (UUID), timestamp (with timezone, millisecond precision), actor_id (user or service that performed the action), actor_ip (source IP address), actor_user_agent, action_type (LOGIN, LOGOUT, CREATE, UPDATE, DELETE, EXPORT, VIEW_PII), resource_type (USER, ORDER, PAYMENT, REPORT), resource_id, outcome (SUCCESS, FAILURE, UNAUTHORIZED), request_id (correlates with application logs), before_state (JSON snapshot of the resource before the change), after_state (JSON snapshot after). For authentication events: log both successful logins and failed attempts. Failed login patterns detect brute force. Access to sensitive data (PII, financial records): log even read-only views (required by HIPAA for PHI access).

Immutability and Tamper Detection

An audit log is only trustworthy if it cannot be silently modified. Techniques: (1) Append-only database table: revoke UPDATE and DELETE privileges on the audit_log table from all users, including the application service account. The application can only INSERT. Even a compromised application cannot delete logs. (2) Hash chaining: each log entry includes a hash of the previous entry — similar to a blockchain. entry_hash = SHA256(prev_hash + event_data). If any entry is modified or deleted, all subsequent hashes become invalid. Verify integrity by recomputing the chain. (3) Write to an external system: replicate logs to a separate, isolated system (dedicated S3 bucket with object lock, a write-once storage service). The primary database cannot reach this system to modify it. (4) Digital signatures: sign each batch of log entries with the service’s private key. Verify with the public key. Modification invalidates the signature.

Schema and Storage

Audit logs are write-heavy and read-rarely (queried during investigations or compliance audits). Optimize for writes. PostgreSQL with append-only access and monthly partitioning (partition by timestamp month): old partitions can be archived to cold storage without affecting active write performance. Elasticsearch: index audit logs for fast full-text and filter queries during investigations (search all events by actor_id, resource_type, time range). Sync from the primary database via CDC (Debezium). S3 + Athena: archive partitioned Parquet files to S3. Query with Athena for compliance exports. Retention: regulatory requirements vary: HIPAA = 6 years, PCI-DSS = 1 year online + archive, SOC 2 = typically 1 year minimum. Store all events for the required period; archive older events to Glacier.

Implementation Patterns

Interceptor/middleware approach: add audit logging as a cross-cutting concern. In an API gateway or service middleware layer: extract actor, action, resource from each request/response. Insert to the audit log asynchronously (fire-and-forget to a Kafka topic; a dedicated audit log consumer writes to the database). Never block the main request path on audit log writes. Decorator pattern: wrap repository methods with an audit decorator that automatically captures before/after state for updates. The decorator reads the current state before the update, applies the update, then logs both states. For database-level audit: PostgreSQL triggers on sensitive tables (users, payments) automatically insert audit entries on INSERT/UPDATE/DELETE — audit logging cannot be bypassed by direct database access.

Interview Tips

  • Separation: audit logs must be in a separate system from application logs. Application logs are operational (debug, warn, error); audit logs are compliance-grade (who accessed what). Different retention, access controls, and integrity requirements.
  • PII in audit logs: audit logs often contain PII (names, emails, IPs). Apply data masking for non-privileged viewers. Only compliance officers and security engineers should see full PII in audit logs.
  • Alerting on audit events: security events (multiple failed logins, bulk data export, admin privilege grant) should trigger real-time alerts via SIEM (Security Information and Event Management) systems like Splunk or Datadog SIEM.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you prevent even database administrators from deleting audit logs?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Defense in depth: (1) Revoke privileges: the application service account has INSERT-only permission on the audit_log table. Even if the application is compromised, it cannot DELETE. DBAs can be further restricted by requiring dual approval for DELETE on this table. (2) Write to an immutable external store: replicate audit events to S3 with Object Lock (WORM — Write Once Read Many). Object Lock with compliance mode prevents deletion even by AWS root users for the retention period. The primary DB cannot reach this S3 bucket to modify it. (3) Hash chaining: each entry’s hash includes the previous entry. A deletion leaves a “hole” in the chain detectable during integrity verification. (4) Signed batches: batch audit events and sign with a hardware security module (HSM)-protected key. Modification invalidates the signature.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between an audit log and an application log?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Application logs are operational: INFO/WARN/ERROR messages for debugging, performance monitoring, and incident response. They capture system behavior: “database query took 500ms,” “cache miss on key X,” “null pointer exception in method Y.” They are typically short-lived (retained 7-30 days), not sensitive, and accessible to engineers. Audit logs are compliance-grade: they record business actions — who did what, to which data, when. They capture actor intent and data changes: “user admin_john@company.com deleted user account 12345 at 14:23:00 UTC.” They must be retained for years (regulatory requirements), access-controlled (only security/compliance teams), tamper-proof, and complete (no missing events). A compromised system should not be able to erase its own audit trail. Application logs can be purged freely; audit logs cannot.”
}
},
{
“@type”: “Question”,
“name”: “How do you query audit logs efficiently during a security investigation?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Investigation queries: “all actions by user X in the last 30 days,” “all access to patient record Y,” “all failed login attempts in the last hour,” “all admin privilege grants this quarter.” Primary database (PostgreSQL): index on (actor_id, timestamp), (resource_type, resource_id, timestamp), (action_type, timestamp). These cover the most common investigation patterns. Elasticsearch: sync audit events via CDC for full-text and complex filter queries. Boolean filter queries: actor_id=X AND action_type IN [DELETE, EXPORT] AND timestamp > NOW()-30d. Kibana or Grafana dashboards for security teams. For compliance exports (e.g., all PHI access for a HIPAA audit): pre-built report queries with scheduled execution. Store report results as downloadable CSVs with their own audit trail (who requested the report, when).”
}
},
{
“@type”: “Question”,
“name”: “What events must be logged for SOC 2 compliance?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “SOC 2 Trust Services Criteria require logging for the Security and Availability criteria: Authentication events (login, logout, failed login, password change, MFA enrollment/disable). Authorization events (privilege changes, role assignments, permission grants/revocations). Data access (read/write/delete of sensitive data). System configuration changes (firewall rule changes, security setting changes, new user provisioning). Data export events (bulk downloads, API exports). Third-party access (SSO logins, OAuth token grants). Infrastructure events (server start/stop, deployment). For each event: minimum fields are actor, timestamp, source IP, outcome, and the resource affected. SOC 2 auditors typically review log completeness and the ability to investigate specific incidents. Demonstrate that logs cannot be modified and are retained for at least 1 year (Type II audit period).”
}
},
{
“@type”: “Question”,
“name”: “How do you implement real-time alerting on audit log events?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Publish audit events to a Kafka topic in real time (not batch). A SIEM (Security Information and Event Management) system consumes from Kafka and evaluates alert rules. Common alert rules: (1) Brute force: > 5 failed logins for the same user in 5 minutes — alert + auto-lock account. (2) Impossible travel: user logs in from New York then Tokyo within 1 hour — alert for review. (3) Bulk data export: any single export of > 10,000 records — alert security team. (4) Privilege escalation: any grant of admin role — alert and require dual approval. (5) Off-hours access to sensitive data: access to PII records outside 9am-6pm business hours — flag for review. Rules are evaluated as streaming queries (Flink, Spark Streaming, or SIEM rule engines like Splunk alerts). Alert routing: PagerDuty for critical (active breach), Slack for high (review needed), email digest for medium.”
}
}
]
}

Asked at: Databricks Interview Guide

Asked at: Cloudflare Interview Guide

Asked at: Stripe Interview Guide

Asked at: Atlassian Interview Guide

Scroll to Top