System Design Interview: Log Aggregation and Observability Pipeline

Why Log Aggregation?

With hundreds of microservices each running multiple replicas, logs are scattered across thousands of containers. Without centralized log aggregation, debugging a production incident requires SSHing into individual machines — slow, error-prone, and impossible when the failing pod has already been replaced. A log aggregation system collects logs from all sources, enriches them with metadata, makes them searchable, and retains them for compliance. It is a foundational component of any observability strategy alongside metrics and distributed tracing.

The Three Pillars of Observability

  • Metrics: numeric time-series measurements (CPU%, request_rate, error_rate). Low cardinality; efficient to store (Prometheus, InfluxDB). Best for dashboards and alerting.
  • Logs: timestamped text records of discrete events. High cardinality (every request can generate a log line). Best for debugging specific incidents and auditing.
  • Traces: records of a request’s path through multiple services with timing for each hop. Best for diagnosing latency and finding which service is causing slowdowns.

Log aggregation serves the “Logs” pillar. Correlating all three (linking log lines to trace IDs to metric spikes) is the holy grail of observability — platforms like Datadog, Grafana, and Honeycomb integrate all three.

Log Pipeline Architecture

1. Log Emission (Structured Logging)

Logs should be structured JSON, not free-text strings. Structured logs are parseable without regex and allow filtering on specific fields.


# Bad: free text (hard to parse, query, and filter)
logger.info(f"User 12345 placed order 67890 for $99.99 at 2025-04-16T14:23:01Z")

# Good: structured JSON
logger.info({
    "event":      "order.placed",
    "user_id":    12345,
    "order_id":   67890,
    "amount":     99.99,
    "trace_id":   "abc-123-def-456",  # correlation with distributed trace
    "request_id": "req-789",
    "timestamp":  "2025-04-16T14:23:01Z",
    "service":    "order-service",
    "version":    "1.4.2"
})

Standard fields on every log line: timestamp, service name, version, trace_id, request_id, log level. The trace_id is the critical link between logs and distributed traces — when debugging, find the trace first, then pull all logs with that trace_id across all services.

2. Collection (Fluentd / Fluent Bit)

In Kubernetes, each node runs a log collector DaemonSet (one pod per node). Fluent Bit is the lightweight collector (C, low CPU/memory); Fluentd is the heavier aggregator (Ruby). Fluent Bit reads container logs from /var/log/containers/*.log (where Kubernetes writes container stdout/stderr), parses JSON, adds Kubernetes metadata (pod name, namespace, labels), and forwards to a central aggregator or directly to storage.


# Fluent Bit configuration (kubernetes log tailing):
[INPUT]
    Name              tail
    Path              /var/log/containers/*.log
    Parser            docker
    Tag               kube.*
    Refresh_Interval  5

[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc:443
    Merge_Log           On   # merge JSON log content into the record

[OUTPUT]
    Name  es
    Match *
    Host  elasticsearch.logging.svc
    Port  9200
    Index fluent-bit-${date}

3. Storage: ELK Stack vs Grafana Loki

ELK Stack (Elasticsearch + Logstash + Kibana): Elasticsearch stores logs as JSON documents. Every field is indexed by default — enabling fast full-text search on any field. Logstash processes and transforms logs before indexing. Kibana provides search UI, dashboards, and alerting. Scaling: Elasticsearch shards data across nodes; hot-warm-cold architecture uses fast SSD nodes for recent logs (7 days), HDD nodes for warm logs (30 days), and S3 for cold storage (1+ years). Cost at scale: indexing all fields in Elasticsearch is expensive in CPU and storage — 10 TB/day of raw logs may require 15-20 nodes.

Grafana Loki: Loki stores logs as compressed chunks, indexed only by labels (not content). Full-text search requires a table scan over chunks — slower than Elasticsearch for ad-hoc queries. Advantage: dramatically cheaper storage (S3 as the backend), no index to maintain, and seamless Grafana integration. Best when you have structured logs with known labels and can filter by them (service, namespace, level). Unsuitable for full-text search across all log content.

Log Sampling and Volume Management

At 100,000 requests/second, storing every log line is prohibitively expensive. Sampling strategies:

  • Error logs: 100% retention — never sample errors; they are always valuable for debugging
  • Tail-based sampling for traces: retain 100% of traces with errors or high latency; sample 1-5% of successful fast traces
  • Head-based sampling for info logs: sample 10% of successful request logs — sufficient for traffic analysis without storing every line
  • Dynamic sampling: increase sampling rate automatically when error rates spike (need more data during incidents)

Retention Policy and Compliance

Tier Duration Storage Access latency
Hot (recent logs) 7 days Elasticsearch SSD / Loki SSD Milliseconds
Warm (investigation window) 30-90 days Elasticsearch HDD / S3 + Loki Seconds
Cold (compliance) 1-7 years S3 Glacier / Azure Archive Hours

Compliance-driven retention: PCI DSS requires 1 year of logs; SOC 2 and HIPAA require 6-7 years. Cold storage (S3 Glacier) at $0.004/GB/month makes long-term retention affordable — a terabyte per month costs $4.

Alerting on Logs

Log-based alerting catches errors that metrics might miss (a new error type with no existing metric). Elasticsearch and Loki both support alert rules on log queries. Common log alerts:

  • Error rate: count of log lines with level=ERROR exceeds threshold in 5-minute window
  • Specific exception: log lines containing “OutOfMemoryError” — trigger immediately
  • No logs: expected log lines not appearing (service stopped emitting) — silent failures
  • Latency from logs: parse response_time_ms from structured logs; alert when P99 exceeds threshold

Key Interview Points

  • Structured JSON logging is prerequisite — free-text logs are unqueryable at scale
  • Always include trace_id in logs to correlate with distributed traces
  • Fluent Bit DaemonSet collects container logs from nodes; adds Kubernetes metadata
  • Elasticsearch: expensive but rich full-text search; Loki: cheap S3-backed but label-only indexing
  • Sample info logs (10%), never sample error logs (100%), use tail-based sampling for traces
  • Hot-warm-cold tiers: 7 days SSD, 30-90 days HDD, years on S3 Glacier for compliance

Frequently Asked Questions

What is the difference between ELK Stack and Grafana Loki for log storage?

ELK Stack (Elasticsearch, Logstash, Kibana) and Grafana Loki represent two fundamentally different indexing philosophies. Elasticsearch indexes every field in every log line — it parses each log, extracts all fields (timestamp, level, service, message, user_id, request_id, etc.), and builds inverted indexes for full-text search. This enables extremely powerful queries: search across any field, regex on message content, aggregate by any dimension. The cost: Elasticsearch uses 10-20x more storage than raw log volume, and ingestion CPU is high because every log line is fully parsed. Loki (Grafana's log aggregation system, inspired by Prometheus) indexes only labels — a small set of key-value pairs attached to a log stream (service="checkout", env="prod", region="us-east-1"). The actual log line content is stored as compressed chunks without field extraction. Queries use LogQL: first select streams by label ({service="checkout"}), then filter within the stream by grep-like pattern (|= "error"). Loki uses 5-10x less storage than Elasticsearch. Trade-off: Loki cannot do full-text search across an arbitrary field like user_id unless it's a label — but making every field a label defeats the purpose. Choose Elasticsearch when: you need rich ad-hoc queries, compliance reporting, or security analytics across arbitrary fields. Choose Loki when: primary use case is tailing logs and filtering by service/environment, and cost efficiency matters.

How do you design log sampling to balance cost and observability?

Sampling reduces log volume while preserving observability for the cases that matter. The key insight: not all logs have equal value. Error logs are rare and always valuable — sample 100%. Slow traces (above p99 latency threshold) are infrequent and always worth keeping — sample 100%. Informational traces are frequent and mostly uninteresting — sample 1-10%. Implementation strategies: (1) Head-based sampling: the decision is made at the start of a request and propagates to all downstream services via the trace context header (X-B3-Sampled or W3C traceparent). Simple but cannot sample based on outcome (you don't know yet if the request will error). (2) Tail-based sampling: buffer the complete trace, make the sampling decision after the root span completes. Can sample 100% of errors and slow traces regardless of overall sample rate. Requires a trace collector (Jaeger, Tempo) to buffer spans and apply tail sampling rules. (3) Log-level sampling: always emit ERROR and WARN; sample INFO at 10%; suppress DEBUG in production entirely. Fluent Bit and Fluentd support sampling filters. Business logic: for financial transactions, sample 100% regardless of log level — full audit trail is a compliance requirement. For read-only API endpoints with high traffic (health checks, catalog browse), 1% sampling is sufficient.

How do you implement trace ID correlation across microservices for distributed tracing?

Distributed tracing requires propagating a trace_id (and span_id) through every service call so logs, metrics, and traces from all services participating in a single request can be correlated. Implementation: (1) The entry point (API gateway or first service) generates a trace_id (UUID or 128-bit random ID) if the incoming request does not already have one. (2) The trace_id is injected into the HTTP request header using a standard propagation format: W3C Trace Context (traceparent: 00-{trace_id}-{span_id}-{flags}) or B3 Propagation (X-B3-TraceId, X-B3-SpanId). (3) Every service reads the trace_id from incoming headers, includes it in all structured log lines ({"trace_id": "abc123", "service": "checkout", "level": "info", "msg": "…"}), and propagates it to all outgoing calls (HTTP headers, Kafka message headers, gRPC metadata). (4) Log aggregation systems (ELK, Loki) allow querying by trace_id to retrieve all logs across all services for a single request. (5) OpenTelemetry is the standard SDK for instrumentation — it handles propagation automatically for common HTTP clients and frameworks. The key operational benefit: when an error occurs in service C that was triggered by a request to service A, searching for the trace_id in Kibana shows the complete chain: service A → service B → service C, with timing for each hop and all log lines from all services for that request.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between ELK Stack and Grafana Loki for log storage?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “ELK Stack (Elasticsearch, Logstash, Kibana) and Grafana Loki represent two fundamentally different indexing philosophies. Elasticsearch indexes every field in every log line — it parses each log, extracts all fields (timestamp, level, service, message, user_id, request_id, etc.), and builds inverted indexes for full-text search. This enables extremely powerful queries: search across any field, regex on message content, aggregate by any dimension. The cost: Elasticsearch uses 10-20x more storage than raw log volume, and ingestion CPU is high because every log line is fully parsed. Loki (Grafana’s log aggregation system, inspired by Prometheus) indexes only labels — a small set of key-value pairs attached to a log stream (service=”checkout”, env=”prod”, region=”us-east-1″). The actual log line content is stored as compressed chunks without field extraction. Queries use LogQL: first select streams by label ({service=”checkout”}), then filter within the stream by grep-like pattern (|= “error”). Loki uses 5-10x less storage than Elasticsearch. Trade-off: Loki cannot do full-text search across an arbitrary field like user_id unless it’s a label — but making every field a label defeats the purpose. Choose Elasticsearch when: you need rich ad-hoc queries, compliance reporting, or security analytics across arbitrary fields. Choose Loki when: primary use case is tailing logs and filtering by service/environment, and cost efficiency matters.”
}
},
{
“@type”: “Question”,
“name”: “How do you design log sampling to balance cost and observability?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Sampling reduces log volume while preserving observability for the cases that matter. The key insight: not all logs have equal value. Error logs are rare and always valuable — sample 100%. Slow traces (above p99 latency threshold) are infrequent and always worth keeping — sample 100%. Informational traces are frequent and mostly uninteresting — sample 1-10%. Implementation strategies: (1) Head-based sampling: the decision is made at the start of a request and propagates to all downstream services via the trace context header (X-B3-Sampled or W3C traceparent). Simple but cannot sample based on outcome (you don’t know yet if the request will error). (2) Tail-based sampling: buffer the complete trace, make the sampling decision after the root span completes. Can sample 100% of errors and slow traces regardless of overall sample rate. Requires a trace collector (Jaeger, Tempo) to buffer spans and apply tail sampling rules. (3) Log-level sampling: always emit ERROR and WARN; sample INFO at 10%; suppress DEBUG in production entirely. Fluent Bit and Fluentd support sampling filters. Business logic: for financial transactions, sample 100% regardless of log level — full audit trail is a compliance requirement. For read-only API endpoints with high traffic (health checks, catalog browse), 1% sampling is sufficient.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement trace ID correlation across microservices for distributed tracing?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Distributed tracing requires propagating a trace_id (and span_id) through every service call so logs, metrics, and traces from all services participating in a single request can be correlated. Implementation: (1) The entry point (API gateway or first service) generates a trace_id (UUID or 128-bit random ID) if the incoming request does not already have one. (2) The trace_id is injected into the HTTP request header using a standard propagation format: W3C Trace Context (traceparent: 00-{trace_id}-{span_id}-{flags}) or B3 Propagation (X-B3-TraceId, X-B3-SpanId). (3) Every service reads the trace_id from incoming headers, includes it in all structured log lines ({“trace_id”: “abc123”, “service”: “checkout”, “level”: “info”, “msg”: “…”}), and propagates it to all outgoing calls (HTTP headers, Kafka message headers, gRPC metadata). (4) Log aggregation systems (ELK, Loki) allow querying by trace_id to retrieve all logs across all services for a single request. (5) OpenTelemetry is the standard SDK for instrumentation — it handles propagation automatically for common HTTP clients and frameworks. The key operational benefit: when an error occurs in service C that was triggered by a request to service A, searching for the trace_id in Kibana shows the complete chain: service A → service B → service C, with timing for each hop and all log lines from all services for that request.”
}
}
]
}

Companies That Ask This Question

Scroll to Top