Log Aggregation System Low-Level Design

What is a Log Aggregator?

A log aggregator collects logs from distributed services, normalizes them, stores them for search, and enables real-time alerting. In a microservices architecture with 100 services and 1000 instances, logs are scattered across thousands of machines. Without aggregation, debugging requires SSHing to individual machines. Examples: ELK Stack (Elasticsearch + Logstash + Kibana), Grafana Loki, Splunk, AWS CloudWatch Logs.

Requirements

  • Collect logs from 1000 service instances (structured JSON + unstructured text)
  • Ingest 100MB/second, 1TB/day total volume
  • Full-text search across logs within 30 seconds of ingestion
  • Retention: 30 days hot storage (searchable), 1 year cold storage (archived)
  • Real-time alerting: alert within 60 seconds of an error pattern appearing
  • Grafana dashboards for log volume, error rate by service

Architecture

Service Instances → Log Agent (Fluent Bit / Filebeat) → Kafka
                                                       → Log Processor (Logstash / Flink)
                                                         → Elasticsearch (hot, 30d)
                                                         → S3 (cold, 1yr)
                                                         → Alert Engine (Prometheus/Alertmanager)
                                                       → Kibana / Grafana (visualization)

Log Agent (Fluent Bit)

A lightweight agent runs on each host (sidecar in Kubernetes, daemon on bare metal). Responsibilities: tail log files, parse structured JSON logs, add metadata (host, service name, environment, pod_id), buffer locally (in case of backpressure), and forward to Kafka. Fluent Bit uses <1% CPU and ~20MB RAM per instance — minimal overhead. Configuration: tail /var/log/app/*.log, parse JSON, add Kubernetes pod labels as tags, forward to Kafka topic logs.

Data Model (Log Entry)

LogEntry {
    timestamp: ISO8601
    level: ENUM(DEBUG, INFO, WARN, ERROR, FATAL)
    service: string
    host: string
    trace_id: UUID (correlates logs across services for one request)
    span_id: UUID
    message: string
    fields: {key: value, ...}  // structured fields
    raw: string                // original log line
}

Log Processing Pipeline

Kafka consumer processes log entries: (1) Parse unstructured logs (regex patterns for common formats: nginx, MySQL, Python tracebacks). (2) Enrich: add geo from IP, resolve service name from host. (3) Filter: drop DEBUG logs in production by default (configurable). (4) Route: ERROR/FATAL → alert engine immediately. All logs → Elasticsearch. Logs older than 30 days → S3 archiver.

Elasticsearch Indexing Strategy

Create a new index per day: logs-2026-04-17. This enables time-based retention (delete old indices without expensive DELETE queries) and efficient searches (query only indices within the time range). Index template: configure shards=5 (for 100GB/day index), replicas=1. Index lifecycle policy: hot (0-7d, SSD-backed), warm (7-30d, HDD-backed, fewer replicas), delete (>30d). For 1TB/day: Elasticsearch cluster of ~20 nodes with 8TB SSD each (for 7 days hot) and S3 for cold storage. Field mappings: timestamp as date, level as keyword (exact match), message as text (full-text search), service as keyword.

Real-Time Alerting

Two approaches: (1) Elasticsearch watcher: runs a query every 60 seconds, alerts if result count exceeds threshold. Simple but adds load to Elasticsearch. (2) Stream-based alerting (recommended for high volume): the log processor (Flink) maintains a sliding window (5-minute) count of errors per service. If error_count > threshold: publish alert to Alertmanager → PagerDuty/Slack. Pattern matching: regex on log messages for critical patterns (OutOfMemoryError, connection refused, NullPointerException). Alert fatigue prevention: deduplication (same alert fires at most once per 30 minutes per service), rate limit alerts.

Key Design Decisions

  • Fluent Bit agent: lightweight, Kubernetes-native, handles backpressure with local buffering
  • Kafka as buffer: decouples agents from downstream processing; handles spikes; enables replay
  • Daily Elasticsearch indices: efficient retention management, time-range query optimization
  • Hot/warm/cold tiering: SSD for recent logs, HDD for older, S3 for archive
  • trace_id in every log entry: enables distributed trace reconstruction from logs alone

Databricks system design covers observability and log aggregation. See common questions for Databricks interview: log aggregation and observability design.

Atlassian system design covers distributed logging and monitoring. Review patterns for Atlassian interview: log aggregation and monitoring system design.

Netflix system design covers large-scale log aggregation and observability. See design patterns for Netflix interview: log aggregation and observability at scale.

Scroll to Top