System Design Interview: Monitoring and Observability Platform (Datadog)

⏱ 7 min read

The Three Pillars of Observability

Observability is the ability to understand the internal state of a system from its external outputs. Three signals provide different lenses: (1) Metrics: numeric time-series measurements — CPU usage at 2:00 PM, request rate per second, error rate. Aggregated and sampled, metrics are cheap to store and fast to query. (2) Logs: discrete event records — “user 12345 failed login at 2:00:03 PM from IP 1.2.3.4.” Rich context, but expensive to store and slow to query at scale. (3) Traces: records of requests as they flow through distributed services — a single user request that touches API server, auth service, database, and cache produces one trace with spans for each hop. Expensive to collect, but irreplaceable for diagnosing latency and failures across service boundaries.

Metrics Pipeline

Metrics flow through four stages: (1) Collection: agents running on each host (Datadog Agent, Prometheus Node Exporter) scrape metrics from the OS, applications, and infrastructure every 15-60 seconds. Applications emit custom metrics via StatsD UDP datagrams or Prometheus exposition format. (2) Aggregation: the agent aggregates metrics locally for 10 seconds before sending — converts 1,000 individual counter increments into one batched count. This reduces network traffic 100x. (3) Ingestion: metric data lands in a distributed time-series store via a write-optimized ingestion API. At Datadog scale (25 trillion data points per day), writes are streamed to Kafka and persisted by a cluster of time-series database nodes. (4) Query: dashboards and alerts query the time-series store, applying aggregation functions (avg, max, percentile) over time windows and tag dimensions.

Time-Series Storage

Time-series databases are optimized for write-heavy workloads where data is always appended in time order. InfluxDB uses a Log-Structured Merge tree; Prometheus uses chunk-based local storage with remote write for long-term retention; Datadog built a custom columnar time-series store (Husky). Key optimizations: (1) Delta encoding: store the difference between consecutive values rather than absolute values — CPU usage oscillating around 40% stores tiny deltas, not repeated 40s. (2) Run-length encoding: a metric that stays constant for an hour stores one value + count instead of 3,600 individual values. (3) Downsampling: high-resolution data (1s granularity) is retained for 15 days; after that, it is rolled up to 1-minute resolution for 3 months, then 1-hour resolution forever. Queries on old data automatically use the appropriate resolution.

Log Aggregation

Logs are shipped from applications to a central aggregation system. Architecture: application writes logs to stdout; a log shipper (Filebeat, Fluentd, or Datadog Agent) reads the log file, parses and enriches each line (adds host, service, environment tags), and forwards batches to Kafka. A stream processing job (Logstash, Flink) filters, transforms, and indexes logs into Elasticsearch for full-text search. Challenges at scale: (1) Cardinality: free-text fields (user IDs in log messages) create millions of unique terms in the Elasticsearch index — this explodes index size and slows queries. Structured logging (JSON with typed fields) mitigates this. (2) Sampling: at 10GB/second log volume, storing everything is prohibitively expensive. Sample high-volume INFO logs at 10%, keep all ERROR logs. (3) Retention: store raw logs in S3 cold storage (cheap, slow) and only index recent data in Elasticsearch (expensive, fast).

Distributed Tracing

A trace is a directed acyclic graph of spans, where each span represents work in one service. The trace ID propagates across service boundaries via HTTP headers (W3C Trace Context: traceparent, tracestate). Each service creates a child span when processing a request, records the start time, end time, status, and key-value attributes, then reports it to the trace collector (Jaeger, Zipkin, Datadog APM). At high throughput, trace head sampling (record 1% of traces) reduces storage cost — but biases toward sampling fast, successful requests. Tail-based sampling is more intelligent: buffer spans in a collector, and after the root span arrives (indicating the full trace is complete), decide whether to retain based on error status or latency — keep 100% of error traces and slow traces, sample 1% of fast successful traces.

Alerting Architecture

Alerts evaluate metric queries on a schedule (every 60 seconds) and trigger notifications when thresholds are crossed. Architecture: a rule engine reads alert definitions from a database, runs the metric query, evaluates the threshold condition, and fires a notification (PagerDuty, Slack, email) when transitioning from OK to ALERT state. Alert deduplication: track alert state per rule — only notify on state transitions, not on every evaluation. Alert grouping: multiple related alerts fire at once during an incident; group them into a single notification to prevent alert fatigue. Multi-window alerts: an alert on a short window (1 minute) detects fast spikes; an alert on a long window (1 hour) catches slow degradations. Combine both to cover different failure modes.

Interview Tips

Three pillars (metrics, logs, traces) is the expected opening — know the tradeoffs of each
Time-series storage: delta encoding + downsampling are the key optimizations
Log aggregation: Kafka as the ingestion buffer is the standard pattern
Tail-based trace sampling is more sophisticated than head sampling — mention both
Alert deduplication (state transitions, not every evaluation) prevents alert fatigue

Frequently Asked Questions

What are the three pillars of observability?

The three pillars of observability are metrics, logs, and traces. Metrics are numeric time-series measurements (CPU usage, request rate, error rate) that are aggregated and cheap to store — ideal for dashboards and alerting. Logs are discrete event records with rich context ("user 12345 failed login from IP 1.2.3.4") — expensive to store and full-text search at scale, but irreplaceable for debugging specific incidents. Traces record how a single request flows through multiple services — each service creates a span recording its work; spans are linked by a trace ID that propagates across service calls via HTTP headers. Traces are expensive to collect but essential for diagnosing latency issues and failures in microservices architectures where a single user request may touch 10+ services.

How does a time-series database store metrics efficiently?

Time-series databases apply several compression techniques that exploit the temporal nature of metrics data: (1) Delta encoding: store the difference between consecutive values rather than absolute values. A CPU metric oscillating around 40% stores tiny deltas (0.1, -0.2, 0.3) instead of repeated 40s. (2) Delta-of-delta encoding: for counters that increase monotonically, the differences between deltas are even smaller. (3) Run-length encoding: a constant metric stores one value plus the count of repetitions. (4) Gorilla encoding (used by Facebook and Prometheus): XOR-based float encoding that exploits the fact that consecutive timestamps and values share many bits. Together these achieve 10-12x compression ratios, reducing storage cost dramatically. Downsampling further reduces cost: keep 15-second granularity for 2 weeks, roll up to 1-minute for 3 months, 1-hour for 2 years.

What is tail-based sampling for distributed traces?

Head-based sampling decides whether to record a trace at the moment the root span starts — before the trace is complete. It is simple to implement but biased: it samples by arrival rate, keeping 1% of fast successful requests and 1% of slow failing requests, which misses most errors. Tail-based sampling buffers spans in a collector until the root span completes (indicating the full trace is available), then makes the sampling decision based on the complete trace. Decision criteria: keep 100% of traces that contain an error span, 100% of traces with latency above the 99th percentile, and 1% of fast successful traces. This dramatically improves the signal-to-noise ratio for debugging. The tradeoff: tail-based sampling requires buffering all spans for potentially tens of seconds, consuming significant memory in the trace collector.