Log Aggregation System Low-Level Design

What is a Log Aggregator?

A log aggregator collects logs from distributed services, normalizes them, stores them for search, and enables real-time alerting. In a microservices architecture with 100 services and 1000 instances, logs are scattered across thousands of machines. Without aggregation, debugging requires SSHing to individual machines. Examples: ELK Stack (Elasticsearch + Logstash + Kibana), Grafana Loki, Splunk, AWS CloudWatch Logs.

Requirements

  • Collect logs from 1000 service instances (structured JSON + unstructured text)
  • Ingest 100MB/second, 1TB/day total volume
  • Full-text search across logs within 30 seconds of ingestion
  • Retention: 30 days hot storage (searchable), 1 year cold storage (archived)
  • Real-time alerting: alert within 60 seconds of an error pattern appearing
  • Grafana dashboards for log volume, error rate by service

Architecture

Service Instances → Log Agent (Fluent Bit / Filebeat) → Kafka
                                                       → Log Processor (Logstash / Flink)
                                                         → Elasticsearch (hot, 30d)
                                                         → S3 (cold, 1yr)
                                                         → Alert Engine (Prometheus/Alertmanager)
                                                       → Kibana / Grafana (visualization)

Log Agent (Fluent Bit)

A lightweight agent runs on each host (sidecar in Kubernetes, daemon on bare metal). Responsibilities: tail log files, parse structured JSON logs, add metadata (host, service name, environment, pod_id), buffer locally (in case of backpressure), and forward to Kafka. Fluent Bit uses <1% CPU and ~20MB RAM per instance — minimal overhead. Configuration: tail /var/log/app/*.log, parse JSON, add Kubernetes pod labels as tags, forward to Kafka topic logs.

Data Model (Log Entry)

LogEntry {
    timestamp: ISO8601
    level: ENUM(DEBUG, INFO, WARN, ERROR, FATAL)
    service: string
    host: string
    trace_id: UUID (correlates logs across services for one request)
    span_id: UUID
    message: string
    fields: {key: value, ...}  // structured fields
    raw: string                // original log line
}

Log Processing Pipeline

Kafka consumer processes log entries: (1) Parse unstructured logs (regex patterns for common formats: nginx, MySQL, Python tracebacks). (2) Enrich: add geo from IP, resolve service name from host. (3) Filter: drop DEBUG logs in production by default (configurable). (4) Route: ERROR/FATAL → alert engine immediately. All logs → Elasticsearch. Logs older than 30 days → S3 archiver.

Elasticsearch Indexing Strategy

Create a new index per day: logs-2026-04-17. This enables time-based retention (delete old indices without expensive DELETE queries) and efficient searches (query only indices within the time range). Index template: configure shards=5 (for 100GB/day index), replicas=1. Index lifecycle policy: hot (0-7d, SSD-backed), warm (7-30d, HDD-backed, fewer replicas), delete (>30d). For 1TB/day: Elasticsearch cluster of ~20 nodes with 8TB SSD each (for 7 days hot) and S3 for cold storage. Field mappings: timestamp as date, level as keyword (exact match), message as text (full-text search), service as keyword.

Real-Time Alerting

Two approaches: (1) Elasticsearch watcher: runs a query every 60 seconds, alerts if result count exceeds threshold. Simple but adds load to Elasticsearch. (2) Stream-based alerting (recommended for high volume): the log processor (Flink) maintains a sliding window (5-minute) count of errors per service. If error_count > threshold: publish alert to Alertmanager → PagerDuty/Slack. Pattern matching: regex on log messages for critical patterns (OutOfMemoryError, connection refused, NullPointerException). Alert fatigue prevention: deduplication (same alert fires at most once per 30 minutes per service), rate limit alerts.

Key Design Decisions

  • Fluent Bit agent: lightweight, Kubernetes-native, handles backpressure with local buffering
  • Kafka as buffer: decouples agents from downstream processing; handles spikes; enables replay
  • Daily Elasticsearch indices: efficient retention management, time-range query optimization
  • Hot/warm/cold tiering: SSD for recent logs, HDD for older, S3 for archive
  • trace_id in every log entry: enables distributed trace reconstruction from logs alone


{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Why use a log agent like Fluent Bit instead of writing logs directly to Elasticsearch?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Services writing logs directly to Elasticsearch couples them to the logging infrastructure. Problems: (1) If Elasticsearch is slow or down, the service is affected — backpressure can cause request latency spikes. (2) Every service must implement retry logic, batching, and connection management for Elasticsearch. (3) Changing the logging backend (e.g., switching from Elasticsearch to Loki) requires code changes in every service. Log agents (Fluent Bit, Filebeat) decouple services from logging: services write to local files or stdout. The agent handles tailing, buffering, parsing, enrichment, and forwarding. If Kafka is slow, the agent buffers on disk (configurable, e.g., 1GB). The service sees zero impact. Agents also add metadata (pod name, namespace, node) that the service itself doesn't know.”}},{“@type”:”Question”,”name”:”Why create a new Elasticsearch index per day for log storage?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Daily indices enable efficient time-based operations: (1) Retention: to delete logs older than 30 days, simply delete the oldest indices (DELETE /logs-2026-03-17). No expensive DELETE query that scans millions of documents. (2) Query efficiency: searching logs from "the last 24 hours" queries only 1-2 indices instead of scanning all shards of a single large index. (3) Hot/warm/cold tiering: older indices can be moved to cheaper storage (warm = HDD-backed nodes, cold = S3 via Elasticsearch ILM) without affecting recent indices. (4) Index tuning: recent indices (hot) get more shards and replicas; archived indices (warm) get fewer. Daily indices also simplify the index lifecycle policy (ILM): hot 0-7 days on SSD, warm 7-30 days on HDD, delete after 30 days.”}},{“@type”:”Question”,”name”:”How does trace_id enable distributed tracing through logs?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”In a microservices architecture, one user request may span 5-10 services. Without correlation, debugging requires matching timestamps across service logs — tedious and imprecise. trace_id is a UUID generated at the entry point (API gateway) and propagated in every outbound HTTP call (as a header, e.g., X-Trace-ID) and in every log entry written during that request. When debugging an error, searching for the trace_id in the log aggregator returns all log lines from all services for that single request in chronological order. Propagation: use OpenTelemetry SDK, which automatically injects the trace_id into HTTP headers and into log entries via MDC (Mapped Diagnostic Context). span_id identifies a specific service leg within the trace. Tools: Jaeger and Zipkin visualize the trace; Kibana/Grafana Loki searches log lines by trace_id.”}},{“@type”:”Question”,”name”:”How do you prevent alert fatigue in a log-based alerting system?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Alert fatigue occurs when too many alerts fire, causing on-call engineers to ignore them. Prevention strategies: (1) Deduplication: the same alert pattern should fire at most once per N minutes per service — after the first alert fires, suppress subsequent identical alerts for 30 minutes. (2) Rate thresholds: alert only if error_rate > X errors/minute, not on individual errors. Smoothing: use a 5-minute rolling average, not instantaneous counts. (3) Alert grouping: group related alerts (same service, same error type) into a single notification. (4) Severity tiers: CRITICAL pages immediately, WARNING sends Slack message, INFO is only visible in dashboards. Only page for CRITICAL. (5) Error budget alerts: alert when error rate exceeds the SLO error budget consumption rate (e.g., "at current error rate, you will exhaust this week's budget in 2 hours"). This is more actionable than raw error counts.”}},{“@type”:”Question”,”name”:”How do you handle 1TB/day log volume cost-effectively?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”At 1TB/day, storing 30 days of logs = 30TB of hot storage. Cost optimization strategies: (1) Log level filtering: DROP debug and trace logs in production at the agent level (Fluent Bit filter). Debug logs can be 10x the volume of warning/error logs. (2) Sampling: for high-volume repetitive logs (health checks, batch job progress), sample 1% and discard 99%. (3) Compression: Elasticsearch uses LZ4 compression internally; segment files on disk are compressed 3-5x. (4) Hot/warm/cold tiering: Elasticsearch ILM — hot (SSD, 7 days), warm (HDD, 30 days, reduced replicas), cold (S3 via searchable snapshots). SSD is 4x more expensive than HDD, HDD is 10x more expensive than S3. (5) Log routing: only send ERROR and above to Elasticsearch (expensive, searchable); send all logs to S3 (cheap archive). Query S3 via Athena for historical analysis.”}}]}

Databricks system design covers observability and log aggregation. See common questions for Databricks interview: log aggregation and observability design.

Atlassian system design covers distributed logging and monitoring. Review patterns for Atlassian interview: log aggregation and monitoring system design.

Netflix system design covers large-scale log aggregation and observability. See design patterns for Netflix interview: log aggregation and observability at scale.

Scroll to Top