Observability is the ability to understand the internal state of a system from its external outputs. A system is observable when you can diagnose any issue — including issues you have never seen before — purely from metrics, logs, and traces that the system emits. Observability is not just about adding dashboards; it requires designing telemetry into the system from the start, at every layer.
The Three Pillars
Metrics: numerical measurements over time (request rate, error rate, latency percentiles, queue depth). Metrics are aggregated — they tell you what is happening at the service level. Best for alerting and dashboards. Logs: discrete events with context (a specific error, a request completion with its details). Logs are per-event — they tell you what happened in a specific transaction. Best for debugging specific incidents. Traces: the journey of a request through the system, with timing for each step. Traces show the call graph — which service caused the latency spike. Best for root-cause analysis in distributed systems. Observability requires all three; none alone is sufficient.
The RED Method
The RED method defines the minimum metrics every service must emit: Rate (requests per second — how busy is the service?), Errors (error rate as a percentage — is the service healthy?), Duration (latency percentiles — how fast is the service?). These three metrics are sufficient to detect most service-level problems and form the basis of SLO measurement. Instrument every service endpoint with RED metrics at the framework level (middleware, interceptors) rather than manually in each handler — this ensures no endpoint is missed.
Structured Logging
Log in a machine-parseable format (JSON) rather than free-form strings. Every log line should include: timestamp (ISO 8601), level (INFO/WARN/ERROR), service (service name), trace_id (for correlation with traces), span_id, request_id, user_id, and the event-specific fields. Avoid string interpolation in log messages — instead, use structured fields: {“event”: “order_created”, “order_id”: 12345, “amount”: 99.99} rather than “Order 12345 created for $99.99”. Structured logs are queryable in Loki, Elasticsearch, and CloudWatch Logs Insights without regex parsing.
Cardinality in Metrics
Metrics labels (dimensions) create a combinatorial explosion in storage and query cost. High-cardinality labels — user_id, request_id, order_id — must never be used as metric labels. Each unique label value combination creates a separate time series: 1 million user_ids × 10 endpoints × 5 status codes = 50 million time series, overwhelming any metrics system. Acceptable labels: service, endpoint (limited set), status_code (5xx, 4xx, 2xx), region, instance. For per-entity queries, use logs or traces — they handle high cardinality. Metrics handle aggregate queries.
Instrumenting Business Events
Technical metrics (CPU, latency) are necessary but not sufficient for understanding system health. Business metrics make the system’s purpose visible: orders placed per minute, payment success rate, checkout abandonment rate, items added to cart per session. A 5% latency spike may be acceptable if order rate is stable; a 1% drop in order rate may indicate a critical bug even if all technical metrics look healthy. Instrument business events with the same rigor as technical metrics — they are often the earliest indicator of real problems.
SLO-Driven Alerting
Alert on what matters to users, not on what is easy to measure. Define SLOs (e.g., 99.9% of checkout requests complete in < 500ms over a 30-day window). Derive alerts from error budget burn rate: alert when burning the month's budget at 14.4x the normal rate (consuming it all in 2 hours). This produces two alerts: fast burn (15-minute window, high burn rate — page immediately) and slow burn (6-hour window, moderate burn rate — ticket for next business day). Avoid alerting on CPU, memory, or disk unless they directly cause SLO violations — these produce noise without actionable signal.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety
See also: Atlassian Interview Guide
See also: Coinbase Interview Guide
See also: Shopify Interview Guide
See also: Snap Interview Guide
See also: Lyft Interview Guide 2026: Rideshare Engineering, Real-Time Dispatch, and Safety Systems
See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems