Access Log Analyzer Low-Level Design: Streaming Parse, Pattern Detection, and Anomaly Alerting

Access Log Analyzer: Overview and Requirements

An access log analyzer ingests raw HTTP access logs from web servers and CDNs, parses them through a streaming pipeline, applies regex-based pattern matching, detects anomalies, and triggers threshold-based alerts. It replaces ad-hoc grep sessions with a structured, real-time observability layer for traffic and security teams.

Functional Requirements

  • Parse access logs in multiple formats: Combined Log Format (Apache/NGINX), JSON structured logs, and CDN vendor formats (CloudFront, Fastly).
  • Match log lines against a configurable set of named regex patterns (e.g., scanner signatures, error code spikes, path traversal attempts).
  • Detect statistical anomalies: sudden spikes in 5xx rates, request volume deviations, and new geographic sources.
  • Fire alerts via webhook, PagerDuty, or Slack when metric thresholds are crossed.
  • Provide a query interface for historical log search with filtering by IP, path, status code, and time range.

Non-Functional Requirements

  • Process at least 500,000 log lines per second per node.
  • Alert latency under 60 seconds from log event time to notification delivery.
  • Retain parsed log records for 90 days in queryable storage.
  • Pattern match evaluation under 1 ms per log line.

Data Model

Parsed Log Record

  • log_id — UUID generated at parse time.
  • source_id — identifier of the log source (server or CDN account).
  • timestamp — parsed from the log line, indexed for time-range queries.
  • client_ip, method, path, status_code, bytes_sent, response_time_ms.
  • user_agent, referrer.
  • geo_country, geo_asn — enriched at parse time via a GeoIP lookup.
  • matched_patterns — array of pattern names that fired on this line.

Alert Rule

  • rule_id — UUID.
  • name, description.
  • rule_type — THRESHOLD, ANOMALY, PATTERN_COUNT.
  • metric_expression — defines what to measure (e.g., count of 5xx per minute).
  • condition — operator and value (e.g., greater_than 100).
  • window_seconds — rolling evaluation window.
  • notification_channels — JSON array of webhook, PagerDuty, or Slack configs.

Core Algorithms

Streaming Parse Pipeline

Log lines arrive via a Kafka topic published by log shippers (Filebeat, Fluentd, Vector). The parser service consumes the topic and applies a format detector to select the correct parser (Combined Log Format tokenizer, JSON deserializer, or CDN-specific field mapper). Each parsed record is enriched with GeoIP data using a local MaxMind database loaded into memory, then emitted to a downstream topic for pattern matching and a secondary path for storage.

Pattern Matching

Compile all active regex patterns into a single Aho-Corasick automaton or use the RE2 library for safe, linear-time matching. Evaluate the compiled automaton against the concatenated string representation of each log record. Return all matching pattern names. This approach ensures that adding 100 patterns does not linearly degrade throughput — the automaton evaluates all patterns in a single pass.

Anomaly Detection

Maintain a sliding window of metric values (5xx rate, request volume, unique IP count) per source using a ring buffer. Compute the EWMA and standard deviation over the window. Flag a data point as anomalous when it deviates more than N standard deviations from the EWMA. N is configurable per rule (default: 3). Use a holdout period after an alert fires to suppress repeated alerts for the same sustained anomaly.

Threshold Alerting

A dedicated alert evaluator service polls computed metrics every 10 seconds. For each active alert rule, evaluate the metric expression against the rolling window aggregate. When the condition is met for two consecutive evaluation cycles (to reduce flapping), publish an alert event to the notification fanout queue.

Scalability Design

  • Partition the Kafka input topic by source_id so that records from the same server are processed in order by the same parser instance.
  • Scale parser consumers horizontally; each consumer writes parsed records to ClickHouse (columnar store optimized for time-range aggregation queries) and to an Elasticsearch index for full-text path and user-agent search.
  • Compute windowed metrics (5xx rate, top paths, top IPs) using a stream processing framework (Apache Flink or Kafka Streams) writing results to Redis for the alert evaluator to read.
  • Archive records older than 30 days to Parquet on S3 and drop from ClickHouse to manage storage costs while preserving queryability via Athena.

API Design

  • GET /v1/logs/search?source={id}&start={ts}&end={ts}&status={code}&ip={ip}&path={pattern} — filtered log search; returns paginated parsed records.
  • GET /v1/metrics/timeseries?source={id}&metric={name}&start={ts}&end={ts}&interval={s} — time-series metric data for dashboards.
  • POST /v1/alert-rules — create a new alert rule.
  • PUT /v1/alert-rules/{rule_id} — update thresholds or notification channels.
  • DELETE /v1/alert-rules/{rule_id} — disable a rule.
  • GET /v1/alerts/history?start={ts}&end={ts} — retrieve fired alert history with context snapshots.

Observability

  • Track parse error rate per source — a spike indicates the upstream log format changed without notice.
  • Monitor consumer lag on the Kafka input topic; lag above 60 seconds means the pipeline cannot keep up with ingestion rate and needs horizontal scaling.
  • Alert on GeoIP enrichment cache miss rate; a high miss rate may indicate a MaxMind database update removed entries that a stale in-memory copy still expects.
  • Measure alert-to-notification latency end-to-end to verify the 60-second SLA is being met under load.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Atlassian Interview Guide

Scroll to Top