Question 1

What are the three pillars of observability and when do you use each?

Accepted Answer

Metrics: low-cardinality numerical measurements over time (request rate, error rate, p99 latency). Cheap to store, good for dashboards and alerting. Use to detect that something is wrong. Logs: structured event records with rich context (request ID, user ID, error details). High-cardinality, expensive at scale. Use to understand why something is wrong. Traces: records of a request's path through distributed services, showing timing and causality across service boundaries. Use to find where in the system the bottleneck or error originates. In practice: a metrics alert fires, you drill into logs for the affected time window and service, then follow a trace to pinpoint the exact slow or failing call. OpenTelemetry provides a unified instrumentation standard that correlates all three via shared trace_id and span_id.

Question 2

How does Prometheus scraping differ from push-based metrics collection?

Accepted Answer

Pull (Prometheus): Prometheus actively scrapes /metrics HTTP endpoints on each service at regular intervals (15-30s). Services expose metrics without knowing who reads them. Advantages: Prometheus controls the scrape rate; if a service is down, Prometheus knows (scrape fails); no extra infrastructure needed. Disadvantage: services must be discoverable by Prometheus (Kubernetes service discovery, Consul, etc.); harder for short-lived jobs (use Pushgateway as intermediary). Push (StatsD, InfluxDB line protocol): services actively send metrics to a central collector. Advantages: works for short-lived jobs, batch processes, and environments where Prometheus cannot reach the service. Disadvantages: metric loss if the collector is slow; services must know the collector's address. For long-running services in Kubernetes, Prometheus pull is preferred. For batch jobs and lambdas, push is more practical.

Question 3

What is tail-based sampling in distributed tracing and why is it better than head-based?

Accepted Answer

Head-based sampling: decide at the request entry point whether to trace this request (e.g., sample 1%). Simple, low overhead. Problem: you discard 99% of traces before you know if they are interesting — slow or error traces are sampled at the same rate as fast/successful ones, and rare error patterns may never be captured. Tail-based sampling: record all spans in a buffer, make the sampling decision when the trace is complete (tail = after all spans arrive). Can then apply smart rules: always keep 100% of error traces, always keep traces with p99+ latency, sample only 1% of successful fast traces. Result: much better coverage of interesting traces with the same storage budget. Implementation: a tail-sampling proxy (OpenTelemetry Collector, Jaeger sampling server) receives all spans, groups them by trace_id, waits for the root span to close, then applies the sampling decision.

Question 4

What are the four golden signals and how do you alert on them?

Accepted Answer

The four golden signals (Google SRE Book): (1) Latency: the time to serve a request. Alert when p99 exceeds threshold (e.g., > 500ms for 5 minutes). Distinguish successful vs. error latency separately. (2) Traffic: request volume (RPS, QPS). Alert on anomalous drops (potential outage) or spikes (capacity planning). (3) Errors: rate of failed requests (5xx, timeouts, exceptions). Alert when error rate > 1% for 1 minute. (4) Saturation: how full the service is (CPU, memory, connection pool, queue depth). Alert when > 80% for sustained periods. Alert design: use multi-window burn rate for SLO-based alerting (avoids false positives from transient spikes). For each signal: fast (1-hour window) and slow (6-hour window) burn rate alerts at different severity levels.

Question 5

How does structured logging improve searchability and reduce operational cost?

Accepted Answer

Unstructured logs: "User 12345 failed to checkout at 14:32:05 with error payment_timeout". To find all payment_timeout errors for user 12345, you need regex search across all log lines. Structured JSON logs: {"timestamp":"2026-04-17T14:32:05Z","level":"ERROR","user_id":12345,"order_id":"ord_abc","error":"payment_timeout","duration_ms":5023}. Elasticsearch can now query: level=ERROR AND user_id=12345 AND error=payment_timeout in milliseconds using inverted indexes. Operational benefits: dashboards aggregate by error type without regex. Alerts trigger on JSON field values. Log correlation: every log line includes trace_id and span_id — click a log line to jump directly to the distributed trace. The cost: structured logs are slightly larger (JSON overhead), but the reduction in engineer time to diagnose issues far outweighs the storage cost.

System Design: Observability Platform — Metrics, Logs, Traces, and Alerting at Scale (2025)

The Three Pillars of Observability

Metrics Pipeline: Prometheus Architecture

Log Pipeline: Structured Logging and ELK

Distributed Tracing: OpenTelemetry and Jaeger

Alerting and On-Call