Question 1

Why use a log agent like Fluent Bit instead of writing logs directly to Elasticsearch?

Accepted Answer

Services writing logs directly to Elasticsearch couples them to the logging infrastructure. Problems: (1) If Elasticsearch is slow or down, the service is affected — backpressure can cause request latency spikes. (2) Every service must implement retry logic, batching, and connection management for Elasticsearch. (3) Changing the logging backend (e.g., switching from Elasticsearch to Loki) requires code changes in every service. Log agents (Fluent Bit, Filebeat) decouple services from logging: services write to local files or stdout. The agent handles tailing, buffering, parsing, enrichment, and forwarding. If Kafka is slow, the agent buffers on disk (configurable, e.g., 1GB). The service sees zero impact. Agents also add metadata (pod name, namespace, node) that the service itself doesn't know.

Question 2

Why create a new Elasticsearch index per day for log storage?

Accepted Answer

Daily indices enable efficient time-based operations: (1) Retention: to delete logs older than 30 days, simply delete the oldest indices (DELETE /logs-2026-03-17). No expensive DELETE query that scans millions of documents. (2) Query efficiency: searching logs from "the last 24 hours" queries only 1-2 indices instead of scanning all shards of a single large index. (3) Hot/warm/cold tiering: older indices can be moved to cheaper storage (warm = HDD-backed nodes, cold = S3 via Elasticsearch ILM) without affecting recent indices. (4) Index tuning: recent indices (hot) get more shards and replicas; archived indices (warm) get fewer. Daily indices also simplify the index lifecycle policy (ILM): hot 0-7 days on SSD, warm 7-30 days on HDD, delete after 30 days.

Question 3

How does trace_id enable distributed tracing through logs?

Accepted Answer

In a microservices architecture, one user request may span 5-10 services. Without correlation, debugging requires matching timestamps across service logs — tedious and imprecise. trace_id is a UUID generated at the entry point (API gateway) and propagated in every outbound HTTP call (as a header, e.g., X-Trace-ID) and in every log entry written during that request. When debugging an error, searching for the trace_id in the log aggregator returns all log lines from all services for that single request in chronological order. Propagation: use OpenTelemetry SDK, which automatically injects the trace_id into HTTP headers and into log entries via MDC (Mapped Diagnostic Context). span_id identifies a specific service leg within the trace. Tools: Jaeger and Zipkin visualize the trace; Kibana/Grafana Loki searches log lines by trace_id.

Question 4

How do you prevent alert fatigue in a log-based alerting system?

Accepted Answer

Alert fatigue occurs when too many alerts fire, causing on-call engineers to ignore them. Prevention strategies: (1) Deduplication: the same alert pattern should fire at most once per N minutes per service — after the first alert fires, suppress subsequent identical alerts for 30 minutes. (2) Rate thresholds: alert only if error_rate > X errors/minute, not on individual errors. Smoothing: use a 5-minute rolling average, not instantaneous counts. (3) Alert grouping: group related alerts (same service, same error type) into a single notification. (4) Severity tiers: CRITICAL pages immediately, WARNING sends Slack message, INFO is only visible in dashboards. Only page for CRITICAL. (5) Error budget alerts: alert when error rate exceeds the SLO error budget consumption rate (e.g., "at current error rate, you will exhaust this week's budget in 2 hours"). This is more actionable than raw error counts.

Question 5

How do you handle 1TB/day log volume cost-effectively?

Accepted Answer

At 1TB/day, storing 30 days of logs = 30TB of hot storage. Cost optimization strategies: (1) Log level filtering: DROP debug and trace logs in production at the agent level (Fluent Bit filter). Debug logs can be 10x the volume of warning/error logs. (2) Sampling: for high-volume repetitive logs (health checks, batch job progress), sample 1% and discard 99%. (3) Compression: Elasticsearch uses LZ4 compression internally; segment files on disk are compressed 3-5x. (4) Hot/warm/cold tiering: Elasticsearch ILM — hot (SSD, 7 days), warm (HDD, 30 days, reduced replicas), cold (S3 via searchable snapshots). SSD is 4x more expensive than HDD, HDD is 10x more expensive than S3. (5) Log routing: only send ERROR and above to Elasticsearch (expensive, searchable); send all logs to S3 (cheap archive). Query S3 via Athena for historical analysis.

Log Aggregation System Low-Level Design

What is a Log Aggregator?

Requirements

Architecture

Log Agent (Fluent Bit)

Data Model (Log Entry)

Log Processing Pipeline

Elasticsearch Indexing Strategy

Real-Time Alerting

Key Design Decisions