Question 1

What is the difference between ELK Stack and Grafana Loki for log storage?

Accepted Answer

ELK Stack (Elasticsearch, Logstash, Kibana) and Grafana Loki represent two fundamentally different indexing philosophies. Elasticsearch indexes every field in every log line — it parses each log, extracts all fields (timestamp, level, service, message, user_id, request_id, etc.), and builds inverted indexes for full-text search. This enables extremely powerful queries: search across any field, regex on message content, aggregate by any dimension. The cost: Elasticsearch uses 10-20x more storage than raw log volume, and ingestion CPU is high because every log line is fully parsed. Loki (Grafana's log aggregation system, inspired by Prometheus) indexes only labels — a small set of key-value pairs attached to a log stream (service="checkout", env="prod", region="us-east-1"). The actual log line content is stored as compressed chunks without field extraction. Queries use LogQL: first select streams by label ({service="checkout"}), then filter within the stream by grep-like pattern (|= "error"). Loki uses 5-10x less storage than Elasticsearch. Trade-off: Loki cannot do full-text search across an arbitrary field like user_id unless it's a label — but making every field a label defeats the purpose. Choose Elasticsearch when: you need rich ad-hoc queries, compliance reporting, or security analytics across arbitrary fields. Choose Loki when: primary use case is tailing logs and filtering by service/environment, and cost efficiency matters.

Question 2

How do you design log sampling to balance cost and observability?

Accepted Answer

Sampling reduces log volume while preserving observability for the cases that matter. The key insight: not all logs have equal value. Error logs are rare and always valuable — sample 100%. Slow traces (above p99 latency threshold) are infrequent and always worth keeping — sample 100%. Informational traces are frequent and mostly uninteresting — sample 1-10%. Implementation strategies: (1) Head-based sampling: the decision is made at the start of a request and propagates to all downstream services via the trace context header (X-B3-Sampled or W3C traceparent). Simple but cannot sample based on outcome (you don't know yet if the request will error). (2) Tail-based sampling: buffer the complete trace, make the sampling decision after the root span completes. Can sample 100% of errors and slow traces regardless of overall sample rate. Requires a trace collector (Jaeger, Tempo) to buffer spans and apply tail sampling rules. (3) Log-level sampling: always emit ERROR and WARN; sample INFO at 10%; suppress DEBUG in production entirely. Fluent Bit and Fluentd support sampling filters. Business logic: for financial transactions, sample 100% regardless of log level — full audit trail is a compliance requirement. For read-only API endpoints with high traffic (health checks, catalog browse), 1% sampling is sufficient.

Question 3

How do you implement trace ID correlation across microservices for distributed tracing?

Accepted Answer

Distributed tracing requires propagating a trace_id (and span_id) through every service call so logs, metrics, and traces from all services participating in a single request can be correlated. Implementation: (1) The entry point (API gateway or first service) generates a trace_id (UUID or 128-bit random ID) if the incoming request does not already have one. (2) The trace_id is injected into the HTTP request header using a standard propagation format: W3C Trace Context (traceparent: 00-{trace_id}-{span_id}-{flags}) or B3 Propagation (X-B3-TraceId, X-B3-SpanId). (3) Every service reads the trace_id from incoming headers, includes it in all structured log lines ({"trace_id": "abc123", "service": "checkout", "level": "info", "msg": "..."}), and propagates it to all outgoing calls (HTTP headers, Kafka message headers, gRPC metadata). (4) Log aggregation systems (ELK, Loki) allow querying by trace_id to retrieve all logs across all services for a single request. (5) OpenTelemetry is the standard SDK for instrumentation — it handles propagation automatically for common HTTP clients and frameworks. The key operational benefit: when an error occurs in service C that was triggered by a request to service A, searching for the trace_id in Kibana shows the complete chain: service A → service B → service C, with timing for each hop and all log lines from all services for that request.

Tier	Duration	Storage	Access latency
Hot (recent logs)	7 days	Elasticsearch SSD / Loki SSD	Milliseconds
Warm (investigation window)	30-90 days	Elasticsearch HDD / S3 + Loki	Seconds
Cold (compliance)	1-7 years	S3 Glacier / Azure Archive	Hours

System Design Interview: Log Aggregation and Observability Pipeline

Why Log Aggregation?

The Three Pillars of Observability

Log Pipeline Architecture

1. Log Emission (Structured Logging)

2. Collection (Fluentd / Fluent Bit)

3. Storage: ELK Stack vs Grafana Loki

Log Sampling and Volume Management

Retention Policy and Compliance

Alerting on Logs

Key Interview Points

Frequently Asked Questions

What is the difference between ELK Stack and Grafana Loki for log storage?

How do you design log sampling to balance cost and observability?

How do you implement trace ID correlation across microservices for distributed tracing?

Companies That Ask This Question