Question 1

Why is structured logging important and what fields should every log include?

Accepted Answer

Structured logging emits logs as JSON with consistent field names instead of unstructured text strings. Benefits: no regex parsing needed (the log collector forwards JSON directly), consistent fields enable efficient filtering (service:order-service AND level:error), trace_id links logs across services for distributed tracing, and automated systems can process structured fields without regex. Every log entry must include: timestamp (ISO 8601), level (debug/info/warn/error), service name, trace_id (for cross-service request correlation), and message. Additional context fields (user_id, order_id, duration_ms) make debugging faster. Without structured logging, finding all error logs for a specific user across 50 microservices requires different regex patterns for each service log format.

Question 2

When should you use Elasticsearch versus Grafana Loki for logs?

Accepted Answer

Elasticsearch: indexes full log content using an inverted index. Enables fast full-text search across any field. Powerful but expensive: a cluster ingesting 1 TB/day typically needs 10-20 nodes with 64 GB RAM each. The index can be larger than raw log data. Use when: full-text search across all logs is critical (security analysis, compliance auditing). Grafana Loki: indexes only labels (service, namespace, level), not content. Log content is compressed in S3. Queries filter by labels first (fast), then grep content (slower for broad searches). Cost: 10-100x cheaper than Elasticsearch. Use when: most queries filter by service + time range (typical debugging), and cost is a concern. The trade-off is clear: Elasticsearch provides Google-like search over all logs. Loki provides grep over a well-filtered subset. For most operational debugging (I know the service and approximate time), Loki is sufficient and dramatically cheaper.

Question 3

How do you optimize logging costs at scale?

Accepted Answer

Log storage dominates observability costs. For 5 TB/day: 150 TB/month. Optimization strategies: (1) Drop debug/trace logs in production -- they are 50-80% of volume. Filter at the collection agent (Fluent Bit) before reaching the backend. (2) Sample high-volume, low-value logs -- health check access logs at 10% sampling instead of 100%. (3) Tiered retention -- full resolution for 7 days (hot, SSD), compressed for 30 days (warm), archived to S3 for 1 year (cold, compliance), then delete. (4) Index lifecycle management (ILM) -- Elasticsearch automates hot->warm->cold->delete transitions. (5) Field reduction -- do not index fields never searched. Stack traces do not need per-word indexing. (6) Compression -- Loki compresses 5-10x. These combined typically reduce costs by 5-10x compared to indexing every log at full resolution indefinitely.

Question 4

How does the log collection pipeline work in Kubernetes?

Accepted Answer

Pipeline: (1) Applications write to stdout (container best practice). The container runtime captures stdout to /var/log/containers/ on the node. (2) Fluent Bit runs as a DaemonSet on each node. It tails container log files, parses JSON, enriches with Kubernetes metadata (pod name, namespace, node, labels), and forwards to the aggregator. Fluent Bit uses 10-50MB RAM per node (vs 100-500MB for Fluentd). (3) Kafka acts as a buffer between collectors and the indexer. Absorbs traffic spikes and provides durability -- if the logging backend is down, Kafka retains logs until it recovers. (4) The indexer (Elasticsearch or Loki) receives logs, indexes them, and serves search queries. Without Kafka buffer: a logging backend outage causes log loss. With Kafka: logs are retained for hours/days, and the backend processes the backlog when it recovers.

System Design: Logging System at Scale — ELK Stack, Fluentd, Structured Logging, Aggregation, Retention, Loki

Structured Logging: The Foundation

Log Collection Pipeline

Elasticsearch vs Grafana Loki

Log Retention and Cost Optimization

Log-Based Alerting and Anomaly Detection