System Design: Observability Platform — Metrics, Logs, Traces, and Alerting at Scale (2025)

The Three Pillars of Observability

Metrics: numerical measurements over time (request rate, error rate, latency percentiles, CPU usage). Low cardinality, cheap to store, good for dashboards and alerting. Best for “what is broken.” Logs: structured or unstructured text records of discrete events. High cardinality (each event is unique), expensive to store at scale, but rich in context. Best for “why is it broken.” Traces: a record of a request’s path through distributed services. Connects the timing and causality of spans across multiple services. Best for “where is the bottleneck.” Modern observability correlates all three: a dashboard alert (metrics) links to relevant logs, which link to a trace of a slow or failed request. Tools: Prometheus + Grafana (metrics), ELK/OpenSearch or Loki (logs), Jaeger or Zipkin (traces), OpenTelemetry (instrumentation standard for all three).

Metrics Pipeline: Prometheus Architecture

Prometheus uses a pull model: it scrapes metrics from HTTP /metrics endpoints on each service at regular intervals (15-30s). Services expose metrics using a client library (prometheus-client for Python, Micrometer for Java). Metric types: Counter: monotonically increasing (request_count, errors_total). Gauge: arbitrary value that can go up or down (active_connections, memory_used_bytes). Histogram: samples observations into configurable buckets (request_duration_seconds). Used to compute percentiles (p50, p95, p99) via histogram_quantile(). Summary: like histogram but computes quantiles on the client side. Storage: Prometheus stores time-series data in its own TSDB (time-series database) with high compression. For long-term retention (> 2 weeks): remote write to Thanos, Cortex, or Victoria Metrics for horizontal scaling and multi-cluster aggregation.

from prometheus_client import Counter, Histogram, start_http_server
import time

REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status_code']
)
REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

def handle_request(method, endpoint):
    start = time.time()
    try:
        response = process(method, endpoint)
        REQUEST_COUNT.labels(method, endpoint, response.status_code).inc()
        return response
    finally:
        REQUEST_LATENCY.labels(method, endpoint).observe(time.time() - start)

Log Pipeline: Structured Logging and ELK

Log pipeline: application emits structured JSON logs → Fluentd/Filebeat agent (runs as DaemonSet on each K8s node) ships logs → Kafka (buffer, backpressure) → Logstash/OpenSearch ingestion pipeline → Elasticsearch/OpenSearch index. Structured logging: always log JSON with consistent fields — timestamp (ISO8601 UTC), service, level, trace_id, span_id, request_id, user_id, message, and any domain fields. Avoid unstructured text: it cannot be queried efficiently. Index lifecycle management (ILM): hot indices (last 7 days) on SSD with full indexing. Warm indices (7-30 days) on HDD, fewer replicas. Cold/frozen indices (30-90 days) on object storage. Delete after 90 days (or per retention policy). Cardinality: never index high-cardinality fields (user IDs, raw URLs) as keyword — this explodes the inverted index. Store them as text or exclude from indexing.

Distributed Tracing: OpenTelemetry and Jaeger

OpenTelemetry (OTel) is the vendor-neutral standard for distributed tracing. Every request gets a trace_id (128-bit random UUID). Each service operation creates a span with: span_id, parent_span_id (forms the tree), operation_name, start_time, duration, status (OK/ERROR), and attributes (key-value metadata). The trace context (trace_id, parent_span_id) is propagated via HTTP headers (W3C Trace Context: traceparent, tracestate) and message queue headers. Sampling: tracing 100% of requests at high volume is prohibitively expensive. Strategies: head-based sampling (decide at request entry: sample 1% of requests); tail-based sampling (record all spans, decide at trace completion whether to persist — persist 100% of error traces + 1% of success traces). Jaeger/Zipkin store traces; Grafana Tempo is a cost-efficient trace backend using object storage.

Alerting and On-Call

Alert design principles: alert on symptoms (high error rate, high latency, low availability) rather than causes (CPU high, disk full) — symptoms are what users experience. Four golden signals (Google SRE): Latency (p99 request duration), Traffic (requests per second), Errors (error rate), Saturation (resource utilization). Alert thresholds: use multi-window multi-burn-rate alerts (SLO-based alerting). Example: 5% error budget burned in 1 hour → page immediately. 10% burned in 6 hours → ticket. This balances sensitivity (catch real incidents early) with specificity (avoid alert fatigue from transient spikes). Alert routing: PagerDuty/OpsGenie route alerts to on-call engineers based on service ownership. Runbooks: every alert links to a runbook with diagnosis steps, common causes, and resolution procedures. A well-maintained runbook reduces mean time to recovery (MTTR) significantly.


{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What are the three pillars of observability and when do you use each?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Metrics: low-cardinality numerical measurements over time (request rate, error rate, p99 latency). Cheap to store, good for dashboards and alerting. Use to detect that something is wrong. Logs: structured event records with rich context (request ID, user ID, error details). High-cardinality, expensive at scale. Use to understand why something is wrong. Traces: records of a request's path through distributed services, showing timing and causality across service boundaries. Use to find where in the system the bottleneck or error originates. In practice: a metrics alert fires, you drill into logs for the affected time window and service, then follow a trace to pinpoint the exact slow or failing call. OpenTelemetry provides a unified instrumentation standard that correlates all three via shared trace_id and span_id.”}},{“@type”:”Question”,”name”:”How does Prometheus scraping differ from push-based metrics collection?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Pull (Prometheus): Prometheus actively scrapes /metrics HTTP endpoints on each service at regular intervals (15-30s). Services expose metrics without knowing who reads them. Advantages: Prometheus controls the scrape rate; if a service is down, Prometheus knows (scrape fails); no extra infrastructure needed. Disadvantage: services must be discoverable by Prometheus (Kubernetes service discovery, Consul, etc.); harder for short-lived jobs (use Pushgateway as intermediary). Push (StatsD, InfluxDB line protocol): services actively send metrics to a central collector. Advantages: works for short-lived jobs, batch processes, and environments where Prometheus cannot reach the service. Disadvantages: metric loss if the collector is slow; services must know the collector's address. For long-running services in Kubernetes, Prometheus pull is preferred. For batch jobs and lambdas, push is more practical.”}},{“@type”:”Question”,”name”:”What is tail-based sampling in distributed tracing and why is it better than head-based?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Head-based sampling: decide at the request entry point whether to trace this request (e.g., sample 1%). Simple, low overhead. Problem: you discard 99% of traces before you know if they are interesting — slow or error traces are sampled at the same rate as fast/successful ones, and rare error patterns may never be captured. Tail-based sampling: record all spans in a buffer, make the sampling decision when the trace is complete (tail = after all spans arrive). Can then apply smart rules: always keep 100% of error traces, always keep traces with p99+ latency, sample only 1% of successful fast traces. Result: much better coverage of interesting traces with the same storage budget. Implementation: a tail-sampling proxy (OpenTelemetry Collector, Jaeger sampling server) receives all spans, groups them by trace_id, waits for the root span to close, then applies the sampling decision.”}},{“@type”:”Question”,”name”:”What are the four golden signals and how do you alert on them?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”The four golden signals (Google SRE Book): (1) Latency: the time to serve a request. Alert when p99 exceeds threshold (e.g., > 500ms for 5 minutes). Distinguish successful vs. error latency separately. (2) Traffic: request volume (RPS, QPS). Alert on anomalous drops (potential outage) or spikes (capacity planning). (3) Errors: rate of failed requests (5xx, timeouts, exceptions). Alert when error rate > 1% for 1 minute. (4) Saturation: how full the service is (CPU, memory, connection pool, queue depth). Alert when > 80% for sustained periods. Alert design: use multi-window burn rate for SLO-based alerting (avoids false positives from transient spikes). For each signal: fast (1-hour window) and slow (6-hour window) burn rate alerts at different severity levels.”}},{“@type”:”Question”,”name”:”How does structured logging improve searchability and reduce operational cost?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Unstructured logs: "User 12345 failed to checkout at 14:32:05 with error payment_timeout". To find all payment_timeout errors for user 12345, you need regex search across all log lines. Structured JSON logs: {"timestamp":"2026-04-17T14:32:05Z","level":"ERROR","user_id":12345,"order_id":"ord_abc","error":"payment_timeout","duration_ms":5023}. Elasticsearch can now query: level=ERROR AND user_id=12345 AND error=payment_timeout in milliseconds using inverted indexes. Operational benefits: dashboards aggregate by error type without regex. Alerts trigger on JSON field values. Log correlation: every log line includes trace_id and span_id — click a log line to jump directly to the distributed trace. The cost: structured logs are slightly larger (JSON overhead), but the reduction in engineer time to diagnose issues far outweighs the storage cost.”}}]}

See also: Databricks Interview Prep

See also: Cloudflare Interview Prep

See also: Netflix Interview Prep

Scroll to Top