How is a streaming log parse pipeline structured for access log analysis?

Log lines are shipped from servers via a lightweight forwarder (Fluent Bit, Filebeat) to a message broker (Kafka). A stream processor (Flink, Spark Structured Streaming) consumes partitioned topics, parses each line with a compiled regex or schema-driven parser, extracts fields (timestamp, IP, method, path, status, latency), and emits structured events to downstream aggregation and alerting stages.

How does Aho-Corasick pattern matching apply to access log analysis?

Aho-Corasick builds a finite automaton from a dictionary of known attack signatures (SQLi patterns, path traversal sequences, scanner user agents) and scans each log field in a single O(n) pass regardless of dictionary size. This is far more efficient than running each pattern as a separate regex. The automaton is rebuilt and hot-swapped when the signature set changes.

How is EWMA used for anomaly detection in access logs?

Exponentially weighted moving average (EWMA) maintains a smoothed baseline of metrics such as request rate, error rate, and p99 latency per endpoint. Each new observation updates the average with a decay factor alpha. A metric is flagged as anomalous when it deviates from the EWMA by more than k standard deviations, computed from a parallel EWMA of squared deviations.

How does threshold alerting work in an access log analyzer?

Static thresholds (e.g., error rate > 5%) and dynamic thresholds derived from EWMA baselines trigger alerts when breached. Alert rules are evaluated in a windowed aggregation (tumbling or sliding windows). Deduplication and flap-suppression logic (requiring N consecutive breaches before firing) reduce noise. Alerts are routed to PagerDuty or Slack with annotated context from the raw log sample.

Access Log Analyzer Low-Level Design: Streaming Parse, Pattern Detection, and Anomaly Alerting

⏱ 5 min read

Access Log Analyzer: Overview and Requirements

An access log analyzer ingests raw HTTP access logs from web servers and CDNs, parses them through a streaming pipeline, applies regex-based pattern matching, detects anomalies, and triggers threshold-based alerts. It replaces ad-hoc grep sessions with a structured, real-time observability layer for traffic and security teams.

Functional Requirements

Parse access logs in multiple formats: Combined Log Format (Apache/NGINX), JSON structured logs, and CDN vendor formats (CloudFront, Fastly).
Match log lines against a configurable set of named regex patterns (e.g., scanner signatures, error code spikes, path traversal attempts).
Detect statistical anomalies: sudden spikes in 5xx rates, request volume deviations, and new geographic sources.
Fire alerts via webhook, PagerDuty, or Slack when metric thresholds are crossed.
Provide a query interface for historical log search with filtering by IP, path, status code, and time range.

Non-Functional Requirements

Process at least 500,000 log lines per second per node.
Alert latency under 60 seconds from log event time to notification delivery.
Retain parsed log records for 90 days in queryable storage.
Pattern match evaluation under 1 ms per log line.

Data Model

Parsed Log Record

log_id — UUID generated at parse time.
source_id — identifier of the log source (server or CDN account).
timestamp — parsed from the log line, indexed for time-range queries.
client_ip, method, path, status_code, bytes_sent, response_time_ms.
user_agent, referrer.
geo_country, geo_asn — enriched at parse time via a GeoIP lookup.
matched_patterns — array of pattern names that fired on this line.

Alert Rule

rule_id — UUID.
name, description.
rule_type — THRESHOLD, ANOMALY, PATTERN_COUNT.
metric_expression — defines what to measure (e.g., count of 5xx per minute).
condition — operator and value (e.g., greater_than 100).
window_seconds — rolling evaluation window.
notification_channels — JSON array of webhook, PagerDuty, or Slack configs.

Core Algorithms

Streaming Parse Pipeline

Log lines arrive via a Kafka topic published by log shippers (Filebeat, Fluentd, Vector). The parser service consumes the topic and applies a format detector to select the correct parser (Combined Log Format tokenizer, JSON deserializer, or CDN-specific field mapper). Each parsed record is enriched with GeoIP data using a local MaxMind database loaded into memory, then emitted to a downstream topic for pattern matching and a secondary path for storage.

Pattern Matching

Compile all active regex patterns into a single Aho-Corasick automaton or use the RE2 library for safe, linear-time matching. Evaluate the compiled automaton against the concatenated string representation of each log record. Return all matching pattern names. This approach ensures that adding 100 patterns does not linearly degrade throughput — the automaton evaluates all patterns in a single pass.

Anomaly Detection

Maintain a sliding window of metric values (5xx rate, request volume, unique IP count) per source using a ring buffer. Compute the EWMA and standard deviation over the window. Flag a data point as anomalous when it deviates more than N standard deviations from the EWMA. N is configurable per rule (default: 3). Use a holdout period after an alert fires to suppress repeated alerts for the same sustained anomaly.

Threshold Alerting

A dedicated alert evaluator service polls computed metrics every 10 seconds. For each active alert rule, evaluate the metric expression against the rolling window aggregate. When the condition is met for two consecutive evaluation cycles (to reduce flapping), publish an alert event to the notification fanout queue.

Scalability Design

Partition the Kafka input topic by source_id so that records from the same server are processed in order by the same parser instance.
Scale parser consumers horizontally; each consumer writes parsed records to ClickHouse (columnar store optimized for time-range aggregation queries) and to an Elasticsearch index for full-text path and user-agent search.
Compute windowed metrics (5xx rate, top paths, top IPs) using a stream processing framework (Apache Flink or Kafka Streams) writing results to Redis for the alert evaluator to read.
Archive records older than 30 days to Parquet on S3 and drop from ClickHouse to manage storage costs while preserving queryability via Athena.

API Design

GET /v1/logs/search?source={id}&start={ts}&end={ts}&status={code}&ip={ip}&path={pattern} — filtered log search; returns paginated parsed records.
GET /v1/metrics/timeseries?source={id}&metric={name}&start={ts}&end={ts}&interval={s} — time-series metric data for dashboards.
POST /v1/alert-rules — create a new alert rule.
PUT /v1/alert-rules/{rule_id} — update thresholds or notification channels.
DELETE /v1/alert-rules/{rule_id} — disable a rule.
GET /v1/alerts/history?start={ts}&end={ts} — retrieve fired alert history with context snapshots.

Observability

Track parse error rate per source — a spike indicates the upstream log format changed without notice.
Monitor consumer lag on the Kafka input topic; lag above 60 seconds means the pipeline cannot keep up with ingestion rate and needs horizontal scaling.
Alert on GeoIP enrichment cache miss rate; a high miss rate may indicate a MaxMind database update removed entries that a stale in-memory copy still expects.
Measure alert-to-notification latency end-to-end to verify the 60-second SLA is being met under load.