Question 1

What is the ELK stack and how does it work?

Accepted Answer

The ELK stack is Elasticsearch, Logstash, and Kibana. Logstash collects, parses, and enriches log data then writes it to Elasticsearch. Elasticsearch indexes log documents for full-text search and aggregation. Kibana provides a web UI for querying logs, building dashboards, and setting up alerts. Beats (lightweight agents like Filebeat) are commonly added as a lightweight shipper to replace Logstash for simple log forwarding.

Question 2

How does Grafana Loki differ from Elasticsearch for log storage?

Accepted Answer

Elasticsearch indexes the full content of every log line for fast full-text search but at high storage and compute cost. Loki only indexes log labels (service, environment, host), storing compressed log content in object storage (S3). Queries filter by labels first (cheap) then grep through matching chunks. Loki costs 10-100x less than Elasticsearch but has slower full-text search. Loki integrates natively with Grafana and uses LogQL, similar in feel to PromQL.

Question 3

Why is Kafka used in a log aggregation pipeline?

Accepted Answer

Kafka buffers logs between collection agents and the storage backend. It absorbs traffic bursts without data loss when the storage backend is slow or temporarily unavailable. Multiple consumers can read the same log stream independently: one consumer indexes into Elasticsearch, another archives to S3, another feeds a real-time alerting system. Kafka's retention window also enables replaying log data into a new storage system when migrating backends.

Question 4

How do you handle log retention and cost in a high-volume system?

Accepted Answer

Use tiered retention: hot storage (Elasticsearch or Loki) for recent 7-30 days with fast search; cold storage (S3 Parquet) for 30+ days at low cost. Query cold storage with Athena or Presto when historical analysis is needed. Apply compression to cold storage (Parquet + Snappy achieves 10x+ compression on log data). Delete hot storage data on schedule using index lifecycle management. Set different retention policies by log severity: DEBUG logs 3 days, ERROR logs 90 days.

Low Level Design: Log Aggregation System

Log Collection

Transport Layer

Log Processing and Parsing

Elasticsearch for Full-Text Search

Grafana Loki

Retention and Cold Storage

Alerting on Logs