What is a Log Aggregator?
A log aggregator collects logs from distributed services, normalizes them, stores them for search, and enables real-time alerting. In a microservices architecture with 100 services and 1000 instances, logs are scattered across thousands of machines. Without aggregation, debugging requires SSHing to individual machines. Examples: ELK Stack (Elasticsearch + Logstash + Kibana), Grafana Loki, Splunk, AWS CloudWatch Logs.
Requirements
- Collect logs from 1000 service instances (structured JSON + unstructured text)
- Ingest 100MB/second, 1TB/day total volume
- Full-text search across logs within 30 seconds of ingestion
- Retention: 30 days hot storage (searchable), 1 year cold storage (archived)
- Real-time alerting: alert within 60 seconds of an error pattern appearing
- Grafana dashboards for log volume, error rate by service
Architecture
Service Instances → Log Agent (Fluent Bit / Filebeat) → Kafka
→ Log Processor (Logstash / Flink)
→ Elasticsearch (hot, 30d)
→ S3 (cold, 1yr)
→ Alert Engine (Prometheus/Alertmanager)
→ Kibana / Grafana (visualization)
Log Agent (Fluent Bit)
A lightweight agent runs on each host (sidecar in Kubernetes, daemon on bare metal). Responsibilities: tail log files, parse structured JSON logs, add metadata (host, service name, environment, pod_id), buffer locally (in case of backpressure), and forward to Kafka. Fluent Bit uses <1% CPU and ~20MB RAM per instance — minimal overhead. Configuration: tail /var/log/app/*.log, parse JSON, add Kubernetes pod labels as tags, forward to Kafka topic logs.
Data Model (Log Entry)
LogEntry {
timestamp: ISO8601
level: ENUM(DEBUG, INFO, WARN, ERROR, FATAL)
service: string
host: string
trace_id: UUID (correlates logs across services for one request)
span_id: UUID
message: string
fields: {key: value, ...} // structured fields
raw: string // original log line
}
Log Processing Pipeline
Kafka consumer processes log entries: (1) Parse unstructured logs (regex patterns for common formats: nginx, MySQL, Python tracebacks). (2) Enrich: add geo from IP, resolve service name from host. (3) Filter: drop DEBUG logs in production by default (configurable). (4) Route: ERROR/FATAL → alert engine immediately. All logs → Elasticsearch. Logs older than 30 days → S3 archiver.
Elasticsearch Indexing Strategy
Create a new index per day: logs-2026-04-17. This enables time-based retention (delete old indices without expensive DELETE queries) and efficient searches (query only indices within the time range). Index template: configure shards=5 (for 100GB/day index), replicas=1. Index lifecycle policy: hot (0-7d, SSD-backed), warm (7-30d, HDD-backed, fewer replicas), delete (>30d). For 1TB/day: Elasticsearch cluster of ~20 nodes with 8TB SSD each (for 7 days hot) and S3 for cold storage. Field mappings: timestamp as date, level as keyword (exact match), message as text (full-text search), service as keyword.
Real-Time Alerting
Two approaches: (1) Elasticsearch watcher: runs a query every 60 seconds, alerts if result count exceeds threshold. Simple but adds load to Elasticsearch. (2) Stream-based alerting (recommended for high volume): the log processor (Flink) maintains a sliding window (5-minute) count of errors per service. If error_count > threshold: publish alert to Alertmanager → PagerDuty/Slack. Pattern matching: regex on log messages for critical patterns (OutOfMemoryError, connection refused, NullPointerException). Alert fatigue prevention: deduplication (same alert fires at most once per 30 minutes per service), rate limit alerts.
Key Design Decisions
- Fluent Bit agent: lightweight, Kubernetes-native, handles backpressure with local buffering
- Kafka as buffer: decouples agents from downstream processing; handles spikes; enables replay
- Daily Elasticsearch indices: efficient retention management, time-range query optimization
- Hot/warm/cold tiering: SSD for recent logs, HDD for older, S3 for archive
- trace_id in every log entry: enables distributed trace reconstruction from logs alone
Databricks system design covers observability and log aggregation. See common questions for Databricks interview: log aggregation and observability design.
Atlassian system design covers distributed logging and monitoring. Review patterns for Atlassian interview: log aggregation and monitoring system design.
Netflix system design covers large-scale log aggregation and observability. See design patterns for Netflix interview: log aggregation and observability at scale.