System Design: Observability Platform — Logs, Metrics, Distributed Tracing, and Alerting

The Three Pillars of Observability

Observability lets engineers understand the internal state of a system from its external outputs. Three pillars: Logs (discrete event records — “user 123 logged in at 2:03pm”), Metrics (numerical measurements over time — “HTTP latency p99 = 450ms”), and Traces (end-to-end request journeys across microservices — “request ABC took 230ms: 10ms in API gateway, 120ms in user service, 100ms in DB”). Modern platforms combine all three for root-cause analysis: a metric alert shows latency spiked → a trace shows which service is slow → logs show the specific error in that service.

Log Aggregation Pipeline

Services write structured logs (JSON) to stdout. A log agent (Fluentd, Filebeat, Vector) on each host captures stdout, adds metadata (hostname, service name, environment), and ships to a central log store. Transport: Kafka buffers the log stream. Log store: Elasticsearch for full-text search (ELK stack), or ClickHouse for analytics queries. Index strategy: one index per day per service (logs-user-service-2025-04-17). Retention: keep 30 days in hot storage (Elasticsearch), archive to S3 Glacier for 1 year for compliance. Sampling: at high throughput (millions of log lines/sec), sample verbose logs (debug, info) — only store 1% of INFO logs but 100% of WARN and ERROR. Structured logging: enforce JSON format with required fields (request_id, user_id, service, timestamp, level) — enables fast filtered queries without regex parsing.

Metrics Collection and Storage

Services expose metrics via an HTTP endpoint (/metrics) in Prometheus format. A Prometheus scraper pulls metrics from each service every 15 seconds. For push-based metrics (short-lived jobs, IoT): services push to a push gateway, which Prometheus scrapes. Metrics types: Counter (monotonically increasing — request count), Gauge (can go up or down — memory usage), Histogram (buckets for distribution — latency buckets: 10ms, 50ms, 100ms, 500ms, 1s+), Summary (pre-computed quantiles — p50, p95, p99). Storage: Prometheus stores data in a local TSDB (time-series DB) for 2 weeks. Long-term storage: remote write to Thanos or Cortex (horizontally scalable Prometheus) backed by S3. Query language: PromQL — rate(http_requests_total[5m]) gives the per-second request rate over the last 5 minutes.

Distributed Tracing

A trace represents a single request’s journey across all services. Each operation is a span: (trace_id, span_id, parent_span_id, service, operation, start_time, duration, tags, status). The trace_id propagates through all service calls via HTTP headers (X-Trace-ID) or gRPC metadata. Instrumentation: OpenTelemetry SDK auto-instruments HTTP clients and servers to create and propagate spans. Trace data is sent to a collector (OpenTelemetry Collector) which batches and forwards to a trace backend (Jaeger, Zipkin, or a commercial APM like Datadog, Honeycomb). Sampling: tracing 100% of requests is expensive (high storage and CPU overhead). Head-based sampling: sample N% of all traces at the entry point (e.g., 1% for high-throughput services). Tail-based sampling: collect all spans for every request, then discard non-interesting traces (no errors, below a latency threshold) at the collector — captures all errors and slow traces.

Correlation and Alerting

Correlation: every log line should include the trace_id, allowing one-click navigation from a trace span to the corresponding log lines. A metric alert says “p99 latency > 2s” → find a trace with high latency from that time window → follow to log lines in the slow service. Alerting: Prometheus AlertManager. Alert rules in PromQL: ALERT HighLatency IF histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m])) > 2. Notification routing: send P0 alerts to PagerDuty (wakes the on-call), P1 to Slack #incidents, P2 to email. Deduplication: group alerts on the same service to prevent alert storms. Inhibition: suppress lower-severity alerts when a higher-severity one is active for the same service (database is down → suppress all downstream service errors).

Interview Tips

  • Logs vs metrics: logs are for debugging individual events; metrics are for aggregate health. Logs are expensive at scale; metrics are cheap (fixed cardinality).
  • Cardinality explosion: avoid high-cardinality labels on metrics (e.g., user_id as a label creates millions of time series). Use low-cardinality labels only (service, endpoint, status_code).
  • OpenTelemetry is the vendor-neutral standard for instrumentation — use it to stay portable across backends.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between logs, metrics, and traces?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Logs are discrete, timestamped records of events: “user 123 failed login at 14:03:22 from IP 1.2.3.4.” They are human-readable text or structured JSON. Best for: debugging specific incidents, capturing error details, audit trails. High storage cost at scale. Metrics are numerical measurements aggregated over time: “HTTP 500 error rate = 2.3% over the last 5 minutes.” They are low-cardinality (few unique label combinations) and cheap to store. Best for: dashboards, alerting, capacity planning. Traces are end-to-end records of a single request’s path through multiple services: “Request X took 240ms — 5ms in load balancer, 110ms in auth service, 125ms in database.” Best for: understanding latency bottlenecks, identifying which service in a chain is slow. The three complement each other: a metric alert leads you to a trace, a trace leads you to log lines.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle high-cardinality labels in Prometheus without metric explosion?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Cardinality explosion: if you use user_id as a Prometheus metric label (millions of unique values), Prometheus creates one time series per unique label combination — millions of series for a single metric. Memory and query performance degrade catastrophically. Rule: only use low-cardinality labels ( 2 seconds; keep 1% of the rest. Forward the kept traces to Jaeger. Discard the rest. This guarantees 100% capture of errors and slow traces while still managing storage volume. The buffer requirement means the collector must hold spans in memory for the full window — scale collectors horizontally.”
}
},
{
“@type”: “Question”,
“name”: “How do you correlate logs, metrics, and traces during an incident?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Effective correlation requires a shared identifier propagated through all three signals. The trace_id is the key. When a request enters your system: generate a trace_id. Inject it into all log lines for that request: {“trace_id”: “abc123”, “level”: “error”, “message”: “DB timeout”}. Use it as the span identifier in the trace. Optionally add it as an exemplar on histogram metrics (Prometheus exemplars link a specific metric data point to a trace_id). Investigation workflow: (1) Metric alert fires: p99 latency > 3s for service X. (2) Grafana dashboard shows the spike started at 14:25. (3) Jaeger: filter traces for service X from 14:25, status=slow. Find trace_id “abc123.” (4) In that trace: user-service took 2.5s. (5) Jump to logs: search trace_id=”abc123″ in Kibana. Find “DB query exceeded 2.5s timeout.” (6) Root cause found in 2 minutes instead of 30.”
}
},
{
“@type”: “Question”,
“name”: “How do you design an alerting system that avoids alert fatigue?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Alert fatigue: too many alerts train on-call engineers to ignore them. Design principles: (1) Alert on symptoms, not causes. “User-facing error rate > 1%” is a symptom alert — it directly indicates user impact. “Database CPU > 80%” is a cause alert — high CPU may not affect users. Alert on symptoms; use cause metrics for dashboards. (2) Set thresholds based on data, not intuition. Measure the baseline; alert at 3 standard deviations above the mean. (3) Require sustained violations: alert only after the condition holds for 5 minutes (avoids transient spikes). (4) Alert deduplication and grouping: multiple services failing due to the same root cause should create one alert, not dozens. AlertManager groups by (alertname, cluster). (5) Auto-resolve: alerts should auto-close when the condition is no longer met. (6) Actionable alerts: every alert should have a runbook link describing exactly what to do.”
}
}
]
}

Asked at: Cloudflare Interview Guide

Asked at: Databricks Interview Guide

Asked at: Netflix Interview Guide

Asked at: Atlassian Interview Guide

Scroll to Top