Observability is the ability to understand the internal state of a system by examining its external outputs. The three pillars of observability — metrics, logs, and traces — provide complementary views of system behavior. This guide covers the architecture of a production observability stack, from instrumentation with OpenTelemetry to storage with Prometheus and Elasticsearch, to visualization with Grafana — essential knowledge for SRE and system design interviews.
The Three Pillars of Observability
Metrics are numeric measurements over time: request rate, error rate, latency percentiles, CPU utilization, memory usage. Metrics are aggregated and stored as time series data. They answer: “what is happening?” and “is the system healthy?” Metrics are cheap to collect and query, making them ideal for dashboards and alerting. Logs are timestamped, structured records of discrete events: a request was received, an error occurred, a user logged in. Logs provide context that metrics cannot: the specific error message, the request parameters that caused the failure, the stack trace. Logs answer: “why did this happen?” Logs are expensive to store (high volume, high cardinality) but essential for debugging. Traces follow a single request as it traverses multiple services. A trace contains spans — each span represents one operation (an HTTP call, a database query, a function execution). Traces answer: “where is the bottleneck?” and “which service is slow?” Traces are the most expensive to collect but the most powerful for debugging distributed systems.
Metrics with Prometheus
Prometheus is the standard open-source metrics system for cloud-native applications. Architecture: (1) Instrumentation — applications expose metrics at an HTTP endpoint (/metrics) in the Prometheus exposition format. Client libraries (Go, Java, Python, Node.js) provide counters, gauges, histograms, and summaries. (2) Scraping — the Prometheus server pulls metrics from targets at a configured interval (typically 15-30 seconds). Service discovery (Kubernetes, Consul, DNS) automatically finds scrape targets. (3) Storage — Prometheus stores time series data in a local time-series database (TSDB). Each time series is identified by a metric name and a set of labels: http_requests_total{method=”GET”, status=”200″, service=”api”}. (4) Querying — PromQL (Prometheus Query Language) enables powerful queries: rate(http_requests_total[5m]) computes the per-second rate over a 5-minute window. histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) computes the P99 latency. Prometheus retention: typically 15-30 days of local storage. For long-term storage, use Thanos or Cortex to write to object storage (S3).
Logging with the ELK Stack
The ELK stack (Elasticsearch, Logstash, Kibana) is the most deployed logging solution. Modern variant: EFK (Elasticsearch, Fluentd/Fluent Bit, Kibana). Architecture: (1) Collection — Fluent Bit runs as a DaemonSet on each Kubernetes node, tailing container log files from /var/log/containers/. It parses structured logs (JSON), adds metadata (pod name, namespace, node), and forwards to Elasticsearch. (2) Processing — Fluentd (or Logstash) optionally sits between collectors and Elasticsearch for advanced parsing, filtering, and routing. Drop debug logs in production, parse unstructured logs into fields, route security logs to a separate index. (3) Storage — Elasticsearch indexes log documents for full-text search. Use index lifecycle management (ILM) to automatically roll over indexes daily and delete indexes older than 30 days. (4) Querying — Kibana provides a search UI, log stream view, and dashboards. KQL (Kibana Query Language) enables filtering: kubernetes.namespace: “production” AND level: “error” AND message: “timeout”. Structured logging is critical: emit logs as JSON with consistent field names (timestamp, level, service, trace_id, message). This enables efficient indexing and querying without regex parsing.
Distributed Tracing with OpenTelemetry
OpenTelemetry (OTel) is the CNCF standard for telemetry collection, providing a single set of APIs, SDKs, and tools for metrics, logs, and traces. Tracing architecture: (1) Instrumentation — the OTel SDK creates spans for each operation. Auto-instrumentation libraries automatically create spans for HTTP clients/servers, database queries, gRPC calls, and message queue operations. Manual instrumentation wraps custom business logic in spans. (2) Context propagation — the trace_id and span_id are propagated across service boundaries via HTTP headers (W3C Trace Context: traceparent header). This links spans from different services into a single trace. (3) Export — the OTel SDK exports spans to the OTel Collector via OTLP (OpenTelemetry Protocol). The Collector processes spans (sampling, filtering, enrichment) and exports to a backend: Jaeger, Grafana Tempo, Datadog, or Honeycomb. (4) Visualization — the trace backend provides a waterfall view showing all spans in a trace with their durations, enabling identification of the slowest span (bottleneck). Sampling: at high traffic volumes, collecting 100% of traces is expensive. Head-based sampling (decide at trace start) or tail-based sampling (decide after trace completes, keeping errors and slow traces) reduces volume while retaining actionable data.
Alerting Architecture
Alerting converts observability data into actionable notifications. Architecture: (1) Alert rules — defined in Prometheus as PromQL expressions with thresholds and durations. Example: alert ErrorRateTooHigh if rate(http_requests_total{status=~”5..”}[5m]) / rate(http_requests_total[5m]) > 0.01 for 5m. The “for 5m” clause prevents alerting on brief spikes. (2) Alertmanager — receives firing alerts from Prometheus and handles: deduplication (multiple Prometheus instances may fire the same alert), grouping (group related alerts into a single notification), silencing (suppress alerts during maintenance windows), routing (send critical alerts to PagerDuty, warnings to Slack). (3) Notification channels — PagerDuty (paging on-call engineers for critical alerts), Slack (team channels for warnings and informational alerts), email (for non-urgent notifications). Alerting best practices: alert on symptoms (user-visible impact like error rate, latency), not causes (CPU usage, memory — these are dashboard metrics, not alert triggers). Use SLO-based alerting: alert when the error budget burn rate threatens the monthly SLO, not on arbitrary thresholds.
Grafana: Unified Observability Dashboard
Grafana is the visualization layer that unifies metrics, logs, and traces into a single UI. Data sources: Prometheus (metrics), Elasticsearch/Loki (logs), Tempo/Jaeger (traces), and 100+ other integrations. Key features: (1) Dashboard-as-code — dashboards are defined in JSON and stored in version control. Grafana provisioning automatically loads dashboards from a ConfigMap or file system on startup. (2) Explore mode — ad-hoc querying across data sources. Start with a metric anomaly, drill down to logs filtered by the anomalous time window, then jump to a trace to identify the root cause. (3) Correlations — link metrics to logs to traces. Click on a metric spike, see logs from that time window. Click on a log entry with a trace_id, jump to the full trace in Tempo. This metrics-to-logs-to-traces workflow is the golden path for incident investigation. (4) Grafana Loki — a log aggregation system designed for cost efficiency. Unlike Elasticsearch (which indexes log content), Loki stores log streams indexed only by labels (service, namespace, pod) and uses grep-like queries. 10-100x cheaper than Elasticsearch for log storage, but slower for full-text search.
Designing an Observability Strategy
Practical strategy for a microservices architecture: (1) Instrument everything with OpenTelemetry. Use auto-instrumentation for HTTP, database, and messaging. Add manual instrumentation for critical business operations. (2) Define RED metrics for every service: Rate (requests per second), Errors (error rate), Duration (latency distribution). These are the primary dashboard and alerting metrics. (3) Define USE metrics for infrastructure: Utilization (CPU, memory, disk), Saturation (queue depth, thread pool usage), Errors (hardware errors, OOM kills). (4) Emit structured JSON logs with mandatory fields: timestamp, level, service, trace_id, and message. The trace_id enables log-to-trace correlation. (5) Sample traces at 100% for errors and slow requests (tail-based sampling), 1-10% for healthy requests. (6) Set up SLO-based alerting on the RED metrics. Alert on error budget burn rate, not raw thresholds. (7) Build a standard Grafana dashboard template that every service deploys: RED metrics, resource usage, deployment markers, and links to logs and traces.
{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is the difference between Prometheus pull-based and push-based metrics collection?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Prometheus uses a pull model: the Prometheus server periodically scrapes (HTTP GET) metrics endpoints exposed by applications. The application exposes /metrics, and Prometheus fetches it every 15-30 seconds. Advantages of pull: (1) The Prometheus server controls the scrape rate — applications cannot overwhelm it by pushing too fast. (2) If an application is down, Prometheus detects it immediately (scrape fails) — no ambiguity between the application being silent and the application being dead. (3) No need for applications to know where to send metrics — they just expose an endpoint. Push-based systems (Graphite, InfluxDB, Datadog Agent) require applications to actively send metrics to a collector. Advantages of push: works for short-lived jobs (batch jobs that complete before Prometheus scrapes), works through firewalls (the application pushes outbound, no inbound port needed), and works for serverless functions. Prometheus supports push via the Pushgateway for short-lived jobs: the job pushes metrics to the Pushgateway, and Prometheus scrapes the Pushgateway. OpenTelemetry supports both models: the OTel Collector can scrape Prometheus endpoints and also receive pushed OTLP data.” } }, { “@type”: “Question”, “name”: “How do you correlate metrics, logs, and traces for incident investigation?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Correlation is the golden path for incident investigation: metrics tell you something is wrong, logs explain what happened, and traces show where the bottleneck is. Implementation: (1) Include trace_id in all log entries. When the OTel SDK creates a span, it sets the trace_id in the logging context (MDC in Java, context variables in Python). The structured JSON log includes trace_id as a field. (2) Add deployment annotations to Grafana dashboards. When you see a metric anomaly, the annotation shows whether a deployment occurred at that time. (3) Grafana data source linking: configure Grafana to link from a metric panel to Loki logs filtered by the same time window and service label. From a log entry with a trace_id, link to Tempo or Jaeger to view the full trace. Investigation workflow: (1) Alert fires: error rate SLO burn rate exceeds threshold. (2) Open Grafana dashboard: identify the time window and affected service from the error rate graph. (3) Click to Loki logs: filter by service and time window, see error messages and stack traces. (4) Find a trace_id in the error log, click to Tempo: see the full request trace, identify the failing span (a database timeout, a downstream service 500). (5) Root cause identified in under 5 minutes.” } }, { “@type”: “Question”, “name”: “How does Grafana Loki differ from Elasticsearch for log storage?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Elasticsearch indexes the full content of every log line, creating an inverted index that enables fast full-text search across any field. This is powerful but expensive: indexing requires significant CPU and memory, and the index itself can be larger than the raw log data. Cost: Elasticsearch logging infrastructure often costs 5-10x the application infrastructure it monitors. Grafana Loki takes a different approach: it indexes only the log labels (service name, namespace, pod name, log level) and stores the log content as compressed chunks in object storage (S3). Queries filter by labels first (fast, indexed), then grep through the matching log chunks (slower for full-text search). This is like grep with an index on the filename but not the file content. Cost advantage: Loki storage costs are 10-100x lower than Elasticsearch because object storage is cheap and there is no content indexing overhead. Trade-off: Loki is slower for broad text searches across large time ranges. Searching for a specific error message across all services for the past 30 days is fast in Elasticsearch (indexed) but slow in Loki (must scan all matching chunks). Loki excels when you know the service and time range and need to see recent logs — the most common debugging pattern.” } }, { “@type”: “Question”, “name”: “What are the RED and USE methods for monitoring microservices?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “RED and USE are complementary monitoring frameworks. RED (Rate, Errors, Duration) monitors the request-driven workload of each service. Rate: requests per second (throughput). Errors: the number or rate of failed requests (HTTP 5xx, gRPC errors). Duration: the distribution of request latency (P50, P95, P99). RED answers: is the service handling traffic? Is it failing? Is it slow? Every microservice should have a RED dashboard. RED was proposed by Tom Wilkie (Grafana) as the service-level equivalent of Google Four Golden Signals (latency, traffic, errors, saturation). USE (Utilization, Saturation, Errors) monitors infrastructure resources. For each resource (CPU, memory, disk, network): Utilization is the percentage of the resource in use (CPU at 75%). Saturation is the degree to which the resource is overloaded (request queue length, number of threads waiting). Errors are hardware or software errors related to the resource (disk I/O errors, OOM kills). USE answers: is the infrastructure healthy? Is anything at capacity? USE was proposed by Brendan Gregg. In practice: use RED for application dashboards and alerting, USE for infrastructure dashboards and capacity planning.” } } ] }