How does Datadog ingest trillions of metric data points per day?

Pipeline: (1) Agents on every host collect metrics every 15 seconds from OS, containers, and 500+ integrations. Compress and batch before sending. (2) Intake API: horizontally scaled HTTP servers validate auth, format, and rate limits. (3) Kafka routes metrics by org_id and metric name to processing tiers. (4) Aggregation: for high-cardinality metrics (per-request latency with 100K tag combinations), pre-aggregate to min/max/sum/count/percentiles per 10-second window. Reduces storage 10-100x. (5) Custom TSDB stores with: columnar per-metric storage, delta-of-delta timestamp compression, and XOR value compression. Retention tiers: 15-second resolution for 15 days, 1-minute for 15 months, 1-hour for 5 years. Automatic rollup reduces costs for historical data.

How does Datadog ensure multi-tenant data isolation?

Every metric, log, and trace is tagged with org_id. All queries include org_id as a mandatory filter -- no cross-organization access is possible. Performance isolation: per-organization rate limiting prevents one customer from consuming disproportionate resources. Custom metric limits per plan tier. Dashboard query concurrency limits per org. Resource pools ensure alerting queries take priority over dashboard refreshes. Billing metering: real-time tracking of monitored hosts, custom metric count, log volume, APM spans. Per-hour granularity. Customers see usage dashboards to avoid surprise bills. The architecture ensures: a noisy neighbor cannot degrade other customers, data is never visible across organizations, and resource consumption is fair and metered.

System Design: Design Datadog — Monitoring Platform, Metrics Ingestion, Dashboards, Alerting, Tracing, Log Management

⏱ 6 min read

Datadog monitors infrastructure and applications for thousands of organizations, ingesting trillions of data points per day across metrics, logs, and traces. Designing a monitoring platform tests your understanding of high-throughput time-series ingestion, multi-tenant data isolation, real-time dashboards, and correlating signals across the three pillars of observability. This guide covers the architecture that makes Datadog a unified observability platform.

Agent and Data Collection

The Datadog Agent runs on every monitored host (VM, container, Kubernetes pod). It collects: (1) Infrastructure metrics — CPU, memory, disk, network (from the OS). Process-level metrics (per-process CPU and memory). Container metrics (from the Docker/containerd API). Kubernetes metrics (from the kubelet and kube-state-metrics). (2) Integration metrics — the agent includes 500+ integrations. Each collects metrics from a specific technology: PostgreSQL (connections, query time, replication lag), Redis (memory, hit rate, commands/sec), Nginx (requests, connections, upstream response time), and custom applications (via StatsD or DogStatsD protocol). (3) Logs — tail log files, collect from journald, or receive via TCP/UDP. Parse, enrich with tags (host, service, environment), and forward. (4) APM traces — instrument application code (auto-instrumentation for Java, Python, Go, Node.js). Collect distributed traces and send to the Datadog backend. Agent architecture: the agent is a multi-component process: collector (gathers metrics from integrations every 15 seconds), forwarder (batches and sends data to the Datadog intake API via HTTPS), and log agent (tails files, parses, and ships logs). The agent compresses and batches data to minimize network overhead. At 15-second collection intervals: ~4 requests per minute to the Datadog API per host.

Metrics Ingestion Pipeline

Datadog ingests trillions of metric data points per day. Pipeline: (1) Intake API — receives metric payloads from agents. Validates: authentication (API key), payload format, and rate limits (per-organization). The intake is a horizontally scaled fleet of stateless HTTP servers behind a load balancer. (2) Routing — metrics are routed by organization_id and metric name to the correct processing and storage tier. Kafka acts as the message bus between intake and processing. (3) Aggregation — for high-cardinality metrics (per-request latency with 100K unique tag combinations), the ingestion pipeline pre-aggregates: compute min, max, sum, count, and percentiles per 10-second window. This reduces storage volume by 10-100x while preserving statistical accuracy. (4) Storage — metrics are stored in a custom time-series database optimized for: high write throughput (millions of data points per second), efficient time-range queries (dashboard time windows), and tag-based filtering (show CPU for hosts in us-east-1 with service=api). The TSDB uses: columnar storage per metric, delta-of-delta timestamp compression, and XOR value compression (similar to Gorilla paper). Retention: high-resolution (15-second intervals) for 15 days. 1-minute aggregates for 15 months. 1-hour aggregates for 5 years. Automatic rollup reduces storage costs for historical data.

Multi-Tenant Architecture

Datadog serves thousands of organizations on shared infrastructure. Isolation: (1) Data isolation — each metric, log, and trace is tagged with org_id. All queries include org_id as a mandatory filter. No cross-organization data access is possible. (2) Performance isolation — per-organization rate limiting prevents one customer from consuming disproportionate resources. Custom metric limits: each plan tier has a maximum number of custom metrics. Exceeding the limit blocks new metrics (not existing ones). (3) Compute isolation — dashboard queries are queued and executed with per-organization concurrency limits. A single customer complex dashboard cannot starve other customers queries. Resource pools: priority queues ensure real-time alerting queries take precedence over dashboard refreshes. Billing: Datadog meters: number of monitored hosts (the primary billing unit), custom metrics count, log volume (GB ingested and retained), APM traces (spans per month), and additional products (Synthetics, RUM, CSPM). Metering is real-time: the agent reports its host count, the intake tracks metric cardinality, and the log pipeline meters ingested bytes. Usage data feeds the billing system with per-hour granularity. Customers see their usage in real-time dashboards to avoid surprise bills.

Dashboards and Query Engine

A dashboard contains widgets (panels) that visualize metrics, logs, and traces. Each widget executes a query against the storage backend. Query language: Datadog uses a custom query language: avg:system.cpu.user{service:api, env:production} by {host}. This reads: compute the average of the system.cpu.user metric, filtered by the tags service=api AND env=production, grouped by host. Query execution: (1) The frontend sends the query to the API. (2) The query engine parses the query, determines which storage shards hold the relevant data (based on metric name and time range), and fans out queries to those shards. (3) Each shard returns its partial result. The query engine merges and aggregates. (4) The result is returned to the frontend for rendering. Dashboard refresh: dashboards auto-refresh every 30-60 seconds. For a dashboard with 20 widgets: 20 queries per refresh. For popular dashboards viewed by many users: cache the query results for 15-30 seconds. Identical queries from different users within the cache window share the same result. Template variables: dashboards support dropdown filters (select service, select environment). Changing a variable re-executes all widget queries with the new filter. This enables one dashboard to serve all services — users select their service from the dropdown. Dashboard-as-code: dashboards can be created and managed via API or Terraform (the Datadog Terraform provider manages dashboards, monitors, and SLOs as code).

Alerting and On-Call Integration

Datadog monitors (alerts) evaluate metric queries against thresholds. Monitor types: (1) Metric monitor — alert when a metric crosses a threshold: avg(last_5m):avg:system.cpu.user{service:api} > 80. (2) Log monitor — alert on log patterns: count of error logs from service:api in the last 5 minutes > 100. (3) APM monitor — alert on service latency or error rate: p99:trace.request.duration{service:api} > 500ms. (4) Composite monitor — combine multiple monitors with boolean logic: alert when CPU > 80% AND error rate > 5% (both conditions must be true). (5) Anomaly detection — ML-based: alert when a metric deviates from its expected pattern (learned from historical data). No manual threshold needed. Good for: metrics with seasonal patterns (traffic varies by time of day). Notification routing: monitors send notifications to: PagerDuty (pages the on-call engineer for critical alerts), Slack (posts to the team channel for warnings), email, and webhooks (trigger custom automation). Muting: suppress alerts during maintenance windows (scheduled downtime). Downtime schedules can be created via the UI, API, or Terraform. SLO tracking: define SLOs (99.9% availability, P99 < 200ms) based on monitor data. Track error budget consumption over a 30-day rolling window. Alert when the burn rate threatens the SLO. Dashboard widgets show SLO status: remaining error budget, burn rate, and historical compliance.