Question 1

What are the three pillars of observability and when do you use each?

Accepted Answer

Metrics: numerical aggregates over time (request rate, error rate, p99 latency, queue depth). Use metrics for alerting (SLO burn rate), dashboards (service health at a glance), and capacity planning. They show what is happening at the service level but not why a specific request failed. Logs: discrete events with context (a specific error with stack trace, a request with all its details). Use logs for debugging specific incidents — the trace_id links a log line to the full trace. They handle high cardinality (per-request data) that metrics cannot. Traces: timing of a request across multiple services, showing the complete call graph. Use traces for root-cause analysis of latency issues — which service added 800ms? Which database query is slow? All three are required; none is sufficient alone: metrics detect the problem, traces locate it, logs explain it.

Question 2

What is the RED method for service instrumentation?

Accepted Answer

RED (Rate, Errors, Duration) defines the minimum metrics every service must emit: Rate — requests per second, showing how busy the service is and whether traffic is normal; Errors — error rate as a percentage of total requests, the primary health indicator; Duration — latency distribution (p50, p95, p99), showing response time. These three metrics are sufficient to detect most service-level problems and form the basis of SLO measurement. Instrument RED at the framework layer (HTTP middleware, gRPC interceptors) not in individual handlers — this ensures every endpoint is instrumented automatically. RED is the user-facing view; USE (Utilization, Saturation, Errors) complements it with the resource view (CPU, memory, connections).

Question 3

Why is high-cardinality a problem for metrics and how do you handle it?

Accepted Answer

Metrics storage (Prometheus, VictoriaMetrics) creates a separate time series for each unique combination of label values. High-cardinality labels — user_id (millions of values), order_id, request_id — create millions of time series: 1M user_ids × 10 endpoints × 5 status codes = 50M time series. This overwhelms storage, query engines, and ingestion pipelines. Rule: never use unbounded label values (IDs, URLs, emails) as metric labels. Acceptable labels: service name, endpoint (limited set like /checkout, /search), status_code (200, 400, 500), region, and environment. For per-entity queries (what happened to order 12345?), use traces and logs — they handle per-event data efficiently. Metrics handle aggregate patterns; logs/traces handle individual events.

Question 4

How does SLO-based alerting differ from threshold alerting?

Accepted Answer

Threshold alerting fires when a metric crosses a static value (error rate > 5%) regardless of business impact. This produces false positives (5% error at 3am on low traffic consumes minimal budget) and false negatives (1% error at peak may violate an SLO despite being below the threshold). SLO-based alerting fires when error budget is being consumed at a dangerous rate. For a 30-day SLO of 99.9% availability (43 minutes of budget), a burn rate of 14.4x means the entire month's budget would be exhausted in 2 hours — page immediately. A burn rate of 3x over 6 hours is slower — ticket for next business day. This approach: fires only when the SLO is at risk (fewer false positives), scales with traffic (low-traffic errors don't page if budget is not at risk), and maps directly to customer impact.

Designing for Observability: Low-Level Design

The Three Pillars

The RED Method

Structured Logging

Cardinality in Metrics

Instrumenting Business Events

SLO-Driven Alerting