Observability is the ability to understand the internal state of a system by examining its external outputs. The three pillars of observability — metrics, logs, and traces — provide complementary views of system behavior. This guide covers the architecture of a production observability stack, from instrumentation with OpenTelemetry to storage with Prometheus and Elasticsearch, to visualization with Grafana — essential knowledge for SRE and system design interviews.
The Three Pillars of Observability
Metrics are numeric measurements over time: request rate, error rate, latency percentiles, CPU utilization, memory usage. Metrics are aggregated and stored as time series data. They answer: “what is happening?” and “is the system healthy?” Metrics are cheap to collect and query, making them ideal for dashboards and alerting. Logs are timestamped, structured records of discrete events: a request was received, an error occurred, a user logged in. Logs provide context that metrics cannot: the specific error message, the request parameters that caused the failure, the stack trace. Logs answer: “why did this happen?” Logs are expensive to store (high volume, high cardinality) but essential for debugging. Traces follow a single request as it traverses multiple services. A trace contains spans — each span represents one operation (an HTTP call, a database query, a function execution). Traces answer: “where is the bottleneck?” and “which service is slow?” Traces are the most expensive to collect but the most powerful for debugging distributed systems.
Metrics with Prometheus
Prometheus is the standard open-source metrics system for cloud-native applications. Architecture: (1) Instrumentation — applications expose metrics at an HTTP endpoint (/metrics) in the Prometheus exposition format. Client libraries (Go, Java, Python, Node.js) provide counters, gauges, histograms, and summaries. (2) Scraping — the Prometheus server pulls metrics from targets at a configured interval (typically 15-30 seconds). Service discovery (Kubernetes, Consul, DNS) automatically finds scrape targets. (3) Storage — Prometheus stores time series data in a local time-series database (TSDB). Each time series is identified by a metric name and a set of labels: http_requests_total{method=”GET”, status=”200″, service=”api”}. (4) Querying — PromQL (Prometheus Query Language) enables powerful queries: rate(http_requests_total[5m]) computes the per-second rate over a 5-minute window. histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) computes the P99 latency. Prometheus retention: typically 15-30 days of local storage. For long-term storage, use Thanos or Cortex to write to object storage (S3).
Logging with the ELK Stack
The ELK stack (Elasticsearch, Logstash, Kibana) is the most deployed logging solution. Modern variant: EFK (Elasticsearch, Fluentd/Fluent Bit, Kibana). Architecture: (1) Collection — Fluent Bit runs as a DaemonSet on each Kubernetes node, tailing container log files from /var/log/containers/. It parses structured logs (JSON), adds metadata (pod name, namespace, node), and forwards to Elasticsearch. (2) Processing — Fluentd (or Logstash) optionally sits between collectors and Elasticsearch for advanced parsing, filtering, and routing. Drop debug logs in production, parse unstructured logs into fields, route security logs to a separate index. (3) Storage — Elasticsearch indexes log documents for full-text search. Use index lifecycle management (ILM) to automatically roll over indexes daily and delete indexes older than 30 days. (4) Querying — Kibana provides a search UI, log stream view, and dashboards. KQL (Kibana Query Language) enables filtering: kubernetes.namespace: “production” AND level: “error” AND message: “timeout”. Structured logging is critical: emit logs as JSON with consistent field names (timestamp, level, service, trace_id, message). This enables efficient indexing and querying without regex parsing.
Distributed Tracing with OpenTelemetry
OpenTelemetry (OTel) is the CNCF standard for telemetry collection, providing a single set of APIs, SDKs, and tools for metrics, logs, and traces. Tracing architecture: (1) Instrumentation — the OTel SDK creates spans for each operation. Auto-instrumentation libraries automatically create spans for HTTP clients/servers, database queries, gRPC calls, and message queue operations. Manual instrumentation wraps custom business logic in spans. (2) Context propagation — the trace_id and span_id are propagated across service boundaries via HTTP headers (W3C Trace Context: traceparent header). This links spans from different services into a single trace. (3) Export — the OTel SDK exports spans to the OTel Collector via OTLP (OpenTelemetry Protocol). The Collector processes spans (sampling, filtering, enrichment) and exports to a backend: Jaeger, Grafana Tempo, Datadog, or Honeycomb. (4) Visualization — the trace backend provides a waterfall view showing all spans in a trace with their durations, enabling identification of the slowest span (bottleneck). Sampling: at high traffic volumes, collecting 100% of traces is expensive. Head-based sampling (decide at trace start) or tail-based sampling (decide after trace completes, keeping errors and slow traces) reduces volume while retaining actionable data.
Alerting Architecture
Alerting converts observability data into actionable notifications. Architecture: (1) Alert rules — defined in Prometheus as PromQL expressions with thresholds and durations. Example: alert ErrorRateTooHigh if rate(http_requests_total{status=~”5..”}[5m]) / rate(http_requests_total[5m]) > 0.01 for 5m. The “for 5m” clause prevents alerting on brief spikes. (2) Alertmanager — receives firing alerts from Prometheus and handles: deduplication (multiple Prometheus instances may fire the same alert), grouping (group related alerts into a single notification), silencing (suppress alerts during maintenance windows), routing (send critical alerts to PagerDuty, warnings to Slack). (3) Notification channels — PagerDuty (paging on-call engineers for critical alerts), Slack (team channels for warnings and informational alerts), email (for non-urgent notifications). Alerting best practices: alert on symptoms (user-visible impact like error rate, latency), not causes (CPU usage, memory — these are dashboard metrics, not alert triggers). Use SLO-based alerting: alert when the error budget burn rate threatens the monthly SLO, not on arbitrary thresholds.
Grafana: Unified Observability Dashboard
Grafana is the visualization layer that unifies metrics, logs, and traces into a single UI. Data sources: Prometheus (metrics), Elasticsearch/Loki (logs), Tempo/Jaeger (traces), and 100+ other integrations. Key features: (1) Dashboard-as-code — dashboards are defined in JSON and stored in version control. Grafana provisioning automatically loads dashboards from a ConfigMap or file system on startup. (2) Explore mode — ad-hoc querying across data sources. Start with a metric anomaly, drill down to logs filtered by the anomalous time window, then jump to a trace to identify the root cause. (3) Correlations — link metrics to logs to traces. Click on a metric spike, see logs from that time window. Click on a log entry with a trace_id, jump to the full trace in Tempo. This metrics-to-logs-to-traces workflow is the golden path for incident investigation. (4) Grafana Loki — a log aggregation system designed for cost efficiency. Unlike Elasticsearch (which indexes log content), Loki stores log streams indexed only by labels (service, namespace, pod) and uses grep-like queries. 10-100x cheaper than Elasticsearch for log storage, but slower for full-text search.
Designing an Observability Strategy
Practical strategy for a microservices architecture: (1) Instrument everything with OpenTelemetry. Use auto-instrumentation for HTTP, database, and messaging. Add manual instrumentation for critical business operations. (2) Define RED metrics for every service: Rate (requests per second), Errors (error rate), Duration (latency distribution). These are the primary dashboard and alerting metrics. (3) Define USE metrics for infrastructure: Utilization (CPU, memory, disk), Saturation (queue depth, thread pool usage), Errors (hardware errors, OOM kills). (4) Emit structured JSON logs with mandatory fields: timestamp, level, service, trace_id, and message. The trace_id enables log-to-trace correlation. (5) Sample traces at 100% for errors and slow requests (tail-based sampling), 1-10% for healthy requests. (6) Set up SLO-based alerting on the RED metrics. Alert on error budget burn rate, not raw thresholds. (7) Build a standard Grafana dashboard template that every service deploys: RED metrics, resource usage, deployment markers, and links to logs and traces.