Question 1

What is the difference between logs, metrics, and traces?

Accepted Answer

Logs are discrete, timestamped records of events: "user 123 failed login at 14:03:22 from IP 1.2.3.4." They are human-readable text or structured JSON. Best for: debugging specific incidents, capturing error details, audit trails. High storage cost at scale. Metrics are numerical measurements aggregated over time: "HTTP 500 error rate = 2.3% over the last 5 minutes." They are low-cardinality (few unique label combinations) and cheap to store. Best for: dashboards, alerting, capacity planning. Traces are end-to-end records of a single request's path through multiple services: "Request X took 240ms -- 5ms in load balancer, 110ms in auth service, 125ms in database." Best for: understanding latency bottlenecks, identifying which service in a chain is slow. The three complement each other: a metric alert leads you to a trace, a trace leads you to log lines.

Question 2

How do you handle high-cardinality labels in Prometheus without metric explosion?

Accepted Answer

Cardinality explosion: if you use user_id as a Prometheus metric label (millions of unique values), Prometheus creates one time series per unique label combination -- millions of series for a single metric. Memory and query performance degrade catastrophically. Rule: only use low-cardinality labels ( 2 seconds; keep 1% of the rest. Forward the kept traces to Jaeger. Discard the rest. This guarantees 100% capture of errors and slow traces while still managing storage volume. The buffer requirement means the collector must hold spans in memory for the full window -- scale collectors horizontally.

Question 3

How do you correlate logs, metrics, and traces during an incident?

Accepted Answer

Effective correlation requires a shared identifier propagated through all three signals. The trace_id is the key. When a request enters your system: generate a trace_id. Inject it into all log lines for that request: {"trace_id": "abc123", "level": "error", "message": "DB timeout"}. Use it as the span identifier in the trace. Optionally add it as an exemplar on histogram metrics (Prometheus exemplars link a specific metric data point to a trace_id). Investigation workflow: (1) Metric alert fires: p99 latency > 3s for service X. (2) Grafana dashboard shows the spike started at 14:25. (3) Jaeger: filter traces for service X from 14:25, status=slow. Find trace_id "abc123." (4) In that trace: user-service took 2.5s. (5) Jump to logs: search trace_id="abc123" in Kibana. Find "DB query exceeded 2.5s timeout." (6) Root cause found in 2 minutes instead of 30.

Question 4

How do you design an alerting system that avoids alert fatigue?

Accepted Answer

Alert fatigue: too many alerts train on-call engineers to ignore them. Design principles: (1) Alert on symptoms, not causes. "User-facing error rate > 1%" is a symptom alert -- it directly indicates user impact. "Database CPU > 80%" is a cause alert -- high CPU may not affect users. Alert on symptoms; use cause metrics for dashboards. (2) Set thresholds based on data, not intuition. Measure the baseline; alert at 3 standard deviations above the mean. (3) Require sustained violations: alert only after the condition holds for 5 minutes (avoids transient spikes). (4) Alert deduplication and grouping: multiple services failing due to the same root cause should create one alert, not dozens. AlertManager groups by (alertname, cluster). (5) Auto-resolve: alerts should auto-close when the condition is no longer met. (6) Actionable alerts: every alert should have a runbook link describing exactly what to do.

System Design: Observability Platform — Logs, Metrics, Distributed Tracing, and Alerting

The Three Pillars of Observability

Log Aggregation Pipeline

Metrics Collection and Storage

Distributed Tracing

Correlation and Alerting

Interview Tips