Low Level Design: Telemetry Pipeline – Tech Interview Dot Org

Problem Statement

Design a telemetry pipeline that collects metrics from thousands of services and infrastructure nodes, applies sampling and aggregation, detects anomalies, stores data efficiently with downsampling for long-term retention, and exposes a query API for dashboards and alerting.

Requirements

Functional Requirements

Metric collection from services, agents, and infrastructure exporters
Configurable sampling strategies (full, probabilistic, rate-limited)
Real-time aggregation (sum, avg, percentiles, cardinality)
Anomaly detection with configurable sensitivity
Tiered long-term storage with automatic downsampling
Query API supporting range queries, label filtering, and aggregation functions

Non-Functional Requirements

Ingest 10M data points/second
Query latency <100ms for recent data (last 24 hours)
Query latency <1s for historical data (up to 2 years)
99.9% pipeline availability

High-Level Architecture

Producers (StatsD / Prometheus / OTLP)
        |
  Collection Gateway
  (gRPC / HTTP push, Prometheus pull)
        |
  Message Broker (Kafka)
        |
  +-----------+-----------+
  |           |           |
Sampling  Aggregation  Anomaly
Filter    Engine       Detector
  |           |           |
  +-----+-----+           |
        |               Alert
  Storage Writer          Svc
        |
  Hot (TSDB) -> Warm (Columnar) -> Cold (Object Store)
        |
  Query Router
  (PromQL-compatible / REST)

Component Design

1. Collection Gateway

Supports three ingestion modes:

Push (gRPC/HTTP): Services send OTLP or StatsD payloads. The gateway validates, normalizes label keys, and batches into Kafka.
Pull (Prometheus scrape): A scrape scheduler maintains a list of scrape targets from a service discovery registry (Consul / Kubernetes API). It scrapes each target on its configured interval and forwards the result to Kafka.
Agent forwarding: Infrastructure agents (e.g., Telegraf, Vector) ship metrics over gRPC. TLS mutual authentication is enforced.

Each metric is assigned a metric_id (hash of name + sorted label set) for efficient storage. High-cardinality labels (e.g., request IDs) are rejected at ingestion based on a per-metric cardinality policy.

2. Sampling Strategies

The sampling filter sits between the gateway and the aggregation engine. Sampling policy is configured per metric family:

None (full): All data points pass through. Used for critical business metrics.
Probabilistic: Each data point is kept with probability p. Values are scaled by 1/p to preserve statistical accuracy. Suitable for high-frequency, low-stakes metrics.
Rate-limited: At most N samples per second per label combination. Excess samples are dropped. Used for noisy infrastructure metrics where trends matter more than individual spikes.
Adaptive: Automatically tightens sampling rate when pipeline backpressure is detected. Falls back to full sampling when the system is healthy.

3. Aggregation Engine

A stateful stream processing job (Apache Flink or a custom micro-batch worker) computes aggregates over tumbling windows.

Window: 1 minute (configurable)
Operations per metric family:
  - count, sum, min, max
  - mean, stddev
  - p50, p95, p99 (T-Digest sketch)
  - HyperLogLog for cardinality estimates
Output schema:
  { metric_id, window_start, window_end,
    labels: map, aggregates: map }

Raw data points are forwarded to the storage writer in parallel with aggregation so that sub-minute queries remain possible against the hot tier.

4. Anomaly Detection

The anomaly detector subscribes to the aggregation output topic. It maintains per-metric statistical baselines using a seasonal decomposition model (STL decomposition or Holt-Winters).

Detection approach:

Compute residual = observed – predicted (from baseline model)
Score = residual / rolling stddev
If score > threshold (default 3.0 sigma), emit an anomaly event
Alert suppression: consecutive anomalies within a 5-minute window are deduplicated
Models are retrained weekly in batch against the warm tier to capture seasonality changes

The Alert Service receives anomaly events, evaluates routing rules (severity, team ownership), and dispatches to PagerDuty, Slack, or webhook endpoints.

5. Storage Tiers and Downsampling

Tier      Store           Resolution   Retention
-------   ----------      ----------   ---------
Hot       Prometheus TSDB 15 seconds   3 days
          or VictoriaMetrics
Warm      ClickHouse      1 minute     90 days
Cold      Parquet on S3   1 hour       2 years

Downsampling job (runs every hour):

Reads raw points from the hot tier for the previous hour
Computes 1-minute aggregates and writes to the warm tier (ClickHouse)
Marks raw points older than 3 days for eviction from the hot tier
Nightly, a batch job computes 1-hour aggregates from warm tier and writes Parquet files to S3, partitioned by date/metric_family/

Parquet files use dictionary encoding for label values and delta encoding for timestamps, achieving 10-20x compression over raw JSON.

6. Query Router

Exposes a PromQL-compatible HTTP API. The router inspects the time range and routes sub-queries to the appropriate tier:

Last 3 days: hot tier (Prometheus/VictoriaMetrics remote read)
3–90 days: warm tier (ClickHouse SQL translation layer)
90 days+: cold tier (S3 Select or Athena for ad-hoc; pre-aggregated summaries for dashboards)

Cross-tier queries (e.g., last 100 days) are split, executed in parallel, and results are merged with timestamp alignment. The router caches frequent dashboard queries in Redis with a 60-second TTL.

API Design

Push Metrics (OTLP)

POST /v1/metrics
Content-Type: application/x-protobuf
Body: OTLP ExportMetricsServiceRequest
Response: 202 Accepted

Range Query

GET /api/v1/query_range
  ?query=rate(http_requests_total{service="api"}[5m])
  &start=2024-01-01T00:00:00Z
  &end=2024-01-02T00:00:00Z
  &step=60s
Response: Prometheus-compatible JSON matrix

Instant Query

GET /api/v1/query
  ?query=avg by (region) (cpu_usage_percent)
  &time=2024-01-01T12:00:00Z

Scaling and Reliability

Kafka partitioning: Partition by metric_id to co-locate a metric on the same aggregation worker, avoiding cross-partition joins.
Backpressure: Adaptive sampling automatically reduces ingest volume when Kafka consumer lag exceeds 10 seconds.
Hot tier replication: VictoriaMetrics cluster mode with 2x replication and vmstorage sharding across nodes.
Write path HA: Dual-write to two Kafka clusters in different AZs. If one cluster is unavailable, the gateway fails over within 5 seconds.
Query isolation: Dashboard queries and alerting queries use separate query router pools to prevent slow ad-hoc queries from starving alert evaluation.

Interview Discussion Points

Why T-Digest is better than exact histogram storage for percentile computation at scale
Cardinality explosion: how a single high-cardinality label (user_id on a metric) can cause OOM in a TSDB
The trade-off between downsampling fidelity and storage cost: losing spike visibility at 1-hour resolution
Pull vs. push for metric collection: staleness, discoverability, and firewall traversal
How to handle clock skew when merging data points from distributed producers

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a telemetry pipeline and how does it differ from a logging pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A telemetry pipeline ingests, processes, and routes numeric metrics and traces emitted by services and devices (e.g. CPU usage, request latency, error rate). A logging pipeline handles unstructured or semi-structured text events (e.g. application log lines). Key differences: telemetry data is numeric and time-series oriented, enabling aggregation, downsampling, and alerting on thresholds; logs are text blobs suited for keyword search and debugging. Telemetry pipelines typically use columnar or time-series storage (e.g. Prometheus, InfluxDB, Cortex) while logging pipelines use inverted-index stores (e.g. Elasticsearch, Loki).”
}
},
{
“@type”: “Question”,
“name”: “How does metric sampling work in a telemetry pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Sampling reduces data volume by recording only a fraction of incoming data points. Head-based sampling selects a data point at ingestion time with probability p (e.g. 1%) independently of its value. Tail-based sampling buffers a window of points and selects based on outcome (e.g. always keep error samples, sample successful ones at 1%). For gauge and counter metrics, reservoir sampling maintains a fixed-size representative sample from an unbounded stream. Sampling metadata (sample rate) is stored alongside each point so downstream aggregations can correctly extrapolate totals.”
}
},
{
“@type”: “Question”,
“name”: “How does time-series data get downsampled for long-term storage?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Raw high-resolution data (e.g. 10-second intervals) is retained for a short period (e.g. 7 days). A downsampling job periodically aggregates raw points into coarser resolutions (e.g. 1-minute, 1-hour, 1-day) by computing min, max, average, sum, and count within each window, then writes the rollup to a separate retention tier. Older raw data is deleted after the rollup is confirmed. Multi-resolution storage allows recent queries to use high-resolution data and historical queries to use rollups without scanning raw data. This pattern is called recording rules in Prometheus or continuous aggregates in TimescaleDB.”
}
},
{
“@type”: “Question”,
“name”: “How does anomaly detection work in a telemetry pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Anomaly detection runs as a stream processing stage or a periodic batch job over incoming metrics. Statistical methods include z-score (flag points more than N standard deviations from a rolling mean), IQR-based outlier detection, and seasonal decomposition (STL) to separate trend, seasonality, and residual before thresholding the residual. ML-based approaches train isolation forests or autoencoders on historical baselines. Detected anomalies are emitted as alert events to a notification service. Sensitivity tuning (threshold, minimum duration, cooldown period) reduces false-positive alert fatigue.”
}
}
]
}