A log aggregation system collects log data from distributed services, centralizes it for storage and search, and provides query and alerting capabilities. The pipeline must handle high write throughput (millions of log lines per second), variable-length unstructured data, and complex search queries over petabytes of historical logs.
Log Collection
Agents run on each host or as Kubernetes daemonsets to collect logs from files and container stdout/stderr. Common agents: Fluentd (Ruby, flexible plugin ecosystem), Fluent Bit (C, lightweight for resource-constrained environments), Logstash (Java, powerful processing pipeline), Promtail (for Loki). Agents tail log files, parse log lines into structured events, enrich with host metadata (hostname, environment, service name), and forward to the aggregation backend.
Transport Layer
Logs are buffered in the agent (in-memory or on-disk) and forwarded to a message broker (Kafka) or directly to the storage backend. Kafka acts as a buffer: it absorbs bursts of log volume, decouples agents from the storage backend, and allows multiple consumers (Elasticsearch indexer, S3 archiver, real-time alerting). If the storage backend is temporarily unavailable, Kafka retains logs without data loss. Use Kafka partitioned by service name for parallelism.
Log Processing and Parsing
Raw log lines are parsed into structured JSON: timestamp (parsed from log line or added by agent), log level (INFO, WARN, ERROR), service name, trace_id, message, and additional fields. Grok patterns (regex-based) parse common log formats (Apache access logs, exception stack traces). Stream processors (Flink, Logstash) apply transformations: masking PII (credit card numbers, emails), enriching with user metadata, routing to different topics based on log level.
Elasticsearch for Full-Text Search
Elasticsearch (or OpenSearch) indexes logs for full-text search and aggregation. Each log line becomes an Elasticsearch document. Time-based index naming (logs-2026.04.17) enables efficient deletion of old data by dropping entire indices. The ELK stack (Elasticsearch + Logstash + Kibana) is the traditional log aggregation solution. Kibana provides query UI, dashboards, and alerting on log data. Elasticsearch scales horizontally; sharding partitions data across nodes.
Grafana Loki
Loki is a log aggregation system from Grafana optimized for low cost. Unlike Elasticsearch, Loki does not index the content of log lines — only labels (service, environment, host). Log content is stored compressed in chunks in object storage (S3). Queries use LogQL to filter by labels first (cheap, in-memory label index lookup) then grep through matching chunks. This reduces indexing cost by 10-100x compared to Elasticsearch at the cost of slower full-text search.
Retention and Cold Storage
Hot logs (recent 7-30 days) are stored in Elasticsearch or Loki for fast search. Cold logs (30+ days) are archived to S3 in compressed columnar format (Parquet) for long-term retention at low cost. Query cold logs using Athena (SQL over S3 Parquet) when historical analysis is needed. Most queries target recent logs; cold storage handles compliance and forensic use cases. Lifecycle policies automatically move logs from hot to cold storage based on age.
Alerting on Logs
Log-based alerting detects error conditions: “if the count of ERROR log lines for service payment-service exceeds 100 in a 5-minute window, page on-call.” Implement with: Elasticsearch watchers/alerting rules, Grafana Loki alert rules, or a stream processor (Flink) consuming from Kafka that counts errors and triggers alerts. Route alerts to PagerDuty, Slack, or OpsGenie. Log alerts complement metric alerts: logs provide context (which user, which request) that metrics lack.