Question 1

What structured logging format should a logging system use and what fields are mandatory?

Accepted Answer

Use JSON as the canonical structured format so logs are machine-parseable without regex. Mandatory fields are: timestamp (ISO-8601 with milliseconds), log level (ERROR/WARN/INFO/DEBUG), service name, trace ID and span ID for distributed tracing correlation, host/pod name, message, and an optional structured payload object. Avoid free-form concatenated strings; every variable should be a discrete JSON field so Elasticsearch can index and aggregate it without a custom ingest pipeline.

Question 2

How do you design a log ingestion pipeline with Kafka to handle backpressure?

Accepted Answer

Producers (Filebeat, Fluentd) write to a Kafka topic partitioned by service name. Kafka acts as the durable buffer: if downstream consumers (Logstash, Flink) slow down, producers keep writing up to the retention limit without dropping data. Set producer acks=all and enable idempotence to prevent duplicates. On the consumer side, tune max.poll.records and processing parallelism to match throughput. If consumer lag exceeds a threshold, scale consumer group instances horizontally. Use a dead-letter topic for unparseable records so they don't block the main pipeline.

Question 3

How does Elasticsearch ILM implement hot-warm-cold tiering for log storage?

Accepted Answer

Index Lifecycle Management (ILM) moves indices through phases based on age or size. Hot nodes (SSD, high CPU) hold the current write index and recent data (e.g., 0-3 days). When the rollover condition triggers (e.g., 50 GB or 1 day), ILM moves the index to warm nodes (spinning disk or fewer replicas) for read-only queries (3-30 days). After 30 days the index moves to cold nodes or a snapshot repository (frozen tier) for infrequent access. Finally the delete phase purges data beyond retention. This reduces storage cost by ~5-10x compared to keeping everything on hot nodes.

Question 4

What sampling strategy should you use for high-volume debug logs to control cost without losing signal?

Accepted Answer

Use head-based sampling at the producer: emit 100% of ERROR and WARN logs, and sample DEBUG/INFO logs at a fixed rate (e.g., 1-5%). For tail-based sampling, buffer a trace's spans and only emit logs tied to slow or errored traces. Apply consistent hashing on trace ID so all spans for a sampled trace are either kept or dropped together. Store sampled-out aggregate counters (total events per service per minute) so you retain volume metrics without storing every record. Reservoir sampling is useful for capped per-service budgets.

Question 5

How do you set up alerting on log patterns using ElastAlert?

Accepted Answer

ElastAlert polls Elasticsearch on a configurable schedule and evaluates rules against recent log data. For error-rate alerts use the frequency rule type: trigger when a query (e.g., level:ERROR AND service:checkout) matches more than N documents in a sliding window. For spike detection use the spike rule: alert when the event count is X times higher than a baseline window. For absence alerting (e.g., no heartbeat logs) use flatline. Route alerts to PagerDuty, Slack, or email via alert destinations. Store rule YAML in version control and use ElastAlert's --config to separate environment-specific connection settings.

Low Level Design: Centralized Logging System

Log Ingestion Pipeline

Structured Log Format

Log Parsing and Enrichment

Indexing with Elasticsearch

Search and Query Interface

Sampling and Volume Control

Retention and Tiering

Alerting on Log Patterns