What is the difference between monitoring and observability?

Monitoring tracks known failure modes with predefined metrics and alerts (is the error rate above 1%?). Observability is the ability to understand the internal state of a system from its external outputs (metrics, logs, traces) — including unexpected failures you did not anticipate. A highly observable system allows engineers to ask arbitrary questions and get answers without deploying new instrumentation. The shift from monitoring to observability reflects the move from simple services to complex distributed systems where known failure modes are a small fraction of actual issues.

What is an error budget and how do burn rate alerts work?

An error budget is the acceptable amount of unreliability permitted by the SLO. For a 99.9% availability SLO over 30 days, the error budget is 0.1% of 30*24*60 = 43.2 minutes of downtime. Burn rate measures how fast the error budget is consumed. A burn rate of 1x means budget is consumed exactly on pace. A burn rate of 10x means 10% of the monthly budget is consumed per hour — the entire budget would run out in 3 days. Burn rate alerts fire at multiple thresholds (fast burn: 14x for 1 hour; slow burn: 3x for 6 hours) to catch both acute and chronic issues.

How does OpenTelemetry unify metrics, logs, and traces?

OpenTelemetry (OTel) is a vendor-neutral observability framework with a unified data model, SDK, and protocol (OTLP). Applications instrument once with the OTel SDK, which captures spans (traces), metrics, and log records with consistent attributes (service.name, http.status_code, etc.). The OTel Collector receives telemetry via OTLP, processes it (sampling, enrichment, filtering), and exports to multiple backends (Jaeger for traces, Prometheus for metrics, Loki for logs) simultaneously. This prevents vendor lock-in and enables correlation across signals — for example, linking a metric anomaly to a specific trace via exemplars.

Low Level Design: Observability System Design

⏱ 8 min read

The Three Pillars of Observability

Observability is the ability to understand the internal state of a system from its external outputs. The three pillars serve complementary purposes:

Metrics: Numeric time-series data aggregated over time. Used for trends, dashboards, and alerting. Efficient to store and query, but low resolution — they tell you something is wrong, not why. Example: http_requests_total{status="500"} spikes.
Logs: Detailed event records with full context. Used for debugging and forensic analysis. High resolution but expensive to store and search at scale. Example: a log line showing the exact SQL query that failed with its parameters.
Traces: Records of a request’s journey across services, showing each operation (span) with its duration and parent-child relationships. Used for latency attribution — finding which service or database call is responsible for slowness. Example: a 2-second API call traced to a missing index on a downstream DB query.

Modern observability systems correlate all three: a metric alert links to a trace (via exemplars), and that trace links to the relevant log lines (via trace_id). OpenTelemetry provides a unified SDK for all three signals.

Metrics Pipeline Architecture

A metrics pipeline moves data from instrumented services to storage and alerting:

Instrumentation: Application code records metrics using a client library (Prometheus client, OpenTelemetry SDK). Metrics are exposed on an HTTP endpoint (/metrics) or pushed to a collector.
Collection: Pull model (Prometheus scrapes /metrics endpoints on a configured interval) vs. push model (agents like Telegraf or OTLP exporters push to a remote endpoint). Pull is easier to reason about — the server controls the scrape rate and can detect when targets disappear.
Storage: Time series database (TSDB). Prometheus has a built-in TSDB optimized for recent data. For long-term storage, remote write to Thanos, Cortex, or VictoriaMetrics — these provide horizontal scalability and multi-month retention.
Alerting: Prometheus evaluates alert rules continuously; firing alerts are routed through Alertmanager to PagerDuty, Slack, or email with deduplication and grouping.

Metric Types

Prometheus defines four metric types:

Counter: Monotonically increasing value. Use for counts of events (requests, errors, bytes sent). Always use rate() or increase() in queries — the raw counter value is rarely useful. Example: http_requests_total.
Gauge: Arbitrary value that can go up or down. Use for current state snapshots: memory usage, queue depth, active connections. Example: go_goroutines.
Histogram: Samples observations into configurable buckets and exposes counts per bucket plus sum and count. Used for latency and size distributions. Allows quantile calculation in PromQL with histogram_quantile(). Buckets must be configured at instrumentation time.
Summary: Computes quantiles client-side. Less flexible for aggregation across instances compared to histograms — prefer histograms for multi-instance deployments.

Prometheus and Grafana

Prometheus is the de facto standard for metrics in cloud-native environments. Key design decisions:

Pull-based scraping: Prometheus polls each target at a configured interval (default 15s). Service discovery (Kubernetes SD, Consul SD) automatically finds targets. This makes it easy to detect dead targets.
PromQL: A functional query language for time series. Supports rate calculations, aggregations, joins between series. Example: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) gives the error rate as a fraction.
Alertmanager: Receives alerts from Prometheus, handles deduplication (same alert fires on multiple replicas → one notification), grouping (related alerts combined), and routing (different teams get different alert subsets).

Grafana connects to Prometheus (and Loki for logs, Tempo for traces) as data sources. Dashboards are defined in JSON and can be version-controlled. Grafana Alerting can replace Alertmanager for unified alert management across data sources.

Structured Logging

Structured logging emits logs as machine-parseable records (JSON) rather than free-form text strings. Every log entry should include a consistent set of fields:

{
  "timestamp": "2026-04-18T14:32:01Z",
  "level": "error",
  "service": "order-service",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "request_id": "req-8821",
  "user_id": 42,
  "message": "payment charge failed",
  "error": "card declined: insufficient funds"
}

Including trace_id is critical: it allows jumping from a log line directly to the corresponding distributed trace. request_id allows correlating all log lines for a single request within a service. Use a logging library that injects these fields from context automatically (Go’s slog, Java’s MDC, Python’s structlog) rather than manually formatting strings.

Log Aggregation Pipeline

At scale, log aggregation follows a fan-in pipeline:

Shippers: Filebeat or Fluentd run as DaemonSets on each node, tail container log files, parse JSON, enrich with Kubernetes metadata (pod name, namespace, labels), and forward to a buffer.
Buffer/Queue: Kafka decouples shippers from storage. It absorbs traffic spikes and allows multiple consumers (search index + cold storage + stream processing) from a single log stream.
Storage and Search: Elasticsearch/OpenSearch indexes logs for full-text and field-based queries. Grafana Loki is a lighter alternative that indexes only labels (not full text), making it cheaper at the cost of query power. Loki stores compressed log chunks in object storage (S3).
Retention: Hot storage (fast SSD, 7-30 days) for recent debugging; cold storage (S3 Glacier) for compliance retention (1-7 years).

Distributed Tracing

A trace is a tree of spans. Each span represents one operation: an HTTP call, a DB query, a cache lookup. Spans carry:

Trace ID (shared across all spans in one request)
Span ID (unique per span)
Parent Span ID (builds the tree structure)
Start time and duration
Status (OK / error)
Attributes (key-value pairs: HTTP method, DB statement, user ID)

The OpenTelemetry SDK instruments libraries automatically (HTTP clients, DB drivers, message queue consumers) and propagates trace context via HTTP headers (traceparent per W3C Trace Context spec). Spans are exported via OTLP (OpenTelemetry Protocol) to an OTel Collector, which buffers, transforms, and routes them to Jaeger or Grafana Tempo for storage and querying. Sampling is critical at scale: head-based sampling (decide at trace root, e.g., sample 1% of traces) or tail-based sampling (buffer spans and sample based on outcome — always keep error traces and slow traces).

Exemplars: Linking Metrics to Traces

Exemplars solve the gap between metrics and traces. A histogram metric data point can carry an exemplar: a sample observation annotated with a trace ID. When Grafana shows a latency histogram and you see a spike at the 99th percentile, you can click the exemplar point to jump directly to the trace that produced that observation. This workflow — metric alert → exemplar → trace → log lines — is the practical observability loop for diagnosing production incidents. Prometheus supports exemplars in the OpenMetrics format; Grafana renders them as dots on histogram panels.

SLO Tracking and Error Budgets

Service Level Objectives (SLOs) quantify reliability targets. Design decisions:

SLI (Service Level Indicator): The metric being measured. Error rate SLI: good_requests / total_requests. Latency SLI: fraction of requests under 200ms.
SLO target: e.g., 99.9% of requests succeed over a 28-day rolling window.
Error budget: 1 - SLO_target = 0.1% of requests can fail. Over 28 days (~2,419,200 seconds), that’s ~2,419 seconds of allowed downtime.
Burn rate alerts: Alert when the error budget is being consumed faster than sustainable. A burn rate of 1.0 means you’re consuming budget at exactly the rate that would exhaust it by end of window. Alert at burn rate > 14.4 (budget consumed in 2 hours) for a fast-burn page, and burn rate > 1 over 6 hours for a slow-burn ticket.

Implementing SLO tracking in Prometheus uses recording rules to compute the good/total ratio over the window, and alert rules for burn rate thresholds. Grafana dashboards visualize remaining budget and burn rate trends.

Cardinality Explosion in Metrics

Each unique combination of label values creates a new time series in the TSDB. This is cardinality. High-cardinality labels cause storage and memory explosion:

Adding a user_id label to an HTTP request counter creates one time series per user — millions of series for a large product. Prometheus will OOM.
Adding request_id or trace_id as a label is catastrophically high-cardinality. These belong in logs and traces, not metrics.
Safe labels have bounded cardinality: status_code (~10 values), http_method (~5 values), service (tens of values), endpoint (hundreds — manageable).

To detect cardinality issues: prometheus_tsdb_head_series gauge shows total active series. Tools like mimirtool analyze or Grafana’s cardinality explorer identify which label combinations are responsible for high series counts. Mitigation: drop high-cardinality labels at the collector layer (OTel Collector transform processor) before metrics reach Prometheus.

OpenTelemetry as the Unified Standard

OpenTelemetry (OTel) is the CNCF project that standardizes observability instrumentation. A single OTel SDK emits metrics, logs, and traces in a vendor-neutral format. Key components:

SDK: Language-specific library (Go, Java, Python, JS, etc.) with auto-instrumentation plugins for popular frameworks and libraries.
OTLP: The OpenTelemetry Protocol — a gRPC and HTTP/protobuf transport for all three signals. Replaces proprietary agents (Jaeger agent, Datadog agent) with a single protocol.
Collector: A vendor-neutral proxy that receives OTLP, applies processors (sampling, attribute enrichment, filtering), and exports to multiple backends simultaneously (Prometheus, Jaeger, Datadog, Honeycomb). Decouples applications from backend choices.

Adopting OTel means you can switch observability backends (e.g., from self-hosted Jaeger to Grafana Cloud) without changing application instrumentation code — only the collector configuration changes.