Problem Statement
Design a telemetry pipeline that collects metrics from thousands of services and infrastructure nodes, applies sampling and aggregation, detects anomalies, stores data efficiently with downsampling for long-term retention, and exposes a query API for dashboards and alerting.
Requirements
Functional Requirements
- Metric collection from services, agents, and infrastructure exporters
- Configurable sampling strategies (full, probabilistic, rate-limited)
- Real-time aggregation (sum, avg, percentiles, cardinality)
- Anomaly detection with configurable sensitivity
- Tiered long-term storage with automatic downsampling
- Query API supporting range queries, label filtering, and aggregation functions
Non-Functional Requirements
- Ingest 10M data points/second
- Query latency <100ms for recent data (last 24 hours)
- Query latency <1s for historical data (up to 2 years)
- 99.9% pipeline availability
High-Level Architecture
Producers (StatsD / Prometheus / OTLP)
|
Collection Gateway
(gRPC / HTTP push, Prometheus pull)
|
Message Broker (Kafka)
|
+-----------+-----------+
| | |
Sampling Aggregation Anomaly
Filter Engine Detector
| | |
+-----+-----+ |
| Alert
Storage Writer Svc
|
Hot (TSDB) -> Warm (Columnar) -> Cold (Object Store)
|
Query Router
(PromQL-compatible / REST)
Component Design
1. Collection Gateway
Supports three ingestion modes:
- Push (gRPC/HTTP): Services send OTLP or StatsD payloads. The gateway validates, normalizes label keys, and batches into Kafka.
- Pull (Prometheus scrape): A scrape scheduler maintains a list of scrape targets from a service discovery registry (Consul / Kubernetes API). It scrapes each target on its configured interval and forwards the result to Kafka.
- Agent forwarding: Infrastructure agents (e.g., Telegraf, Vector) ship metrics over gRPC. TLS mutual authentication is enforced.
Each metric is assigned a metric_id (hash of name + sorted label set) for efficient storage. High-cardinality labels (e.g., request IDs) are rejected at ingestion based on a per-metric cardinality policy.
2. Sampling Strategies
The sampling filter sits between the gateway and the aggregation engine. Sampling policy is configured per metric family:
- None (full): All data points pass through. Used for critical business metrics.
- Probabilistic: Each data point is kept with probability p. Values are scaled by 1/p to preserve statistical accuracy. Suitable for high-frequency, low-stakes metrics.
- Rate-limited: At most N samples per second per label combination. Excess samples are dropped. Used for noisy infrastructure metrics where trends matter more than individual spikes.
- Adaptive: Automatically tightens sampling rate when pipeline backpressure is detected. Falls back to full sampling when the system is healthy.
3. Aggregation Engine
A stateful stream processing job (Apache Flink or a custom micro-batch worker) computes aggregates over tumbling windows.
Window: 1 minute (configurable)
Operations per metric family:
- count, sum, min, max
- mean, stddev
- p50, p95, p99 (T-Digest sketch)
- HyperLogLog for cardinality estimates
Output schema:
{ metric_id, window_start, window_end,
labels: map, aggregates: map }
Raw data points are forwarded to the storage writer in parallel with aggregation so that sub-minute queries remain possible against the hot tier.
4. Anomaly Detection
The anomaly detector subscribes to the aggregation output topic. It maintains per-metric statistical baselines using a seasonal decomposition model (STL decomposition or Holt-Winters).
Detection approach:
- Compute residual = observed – predicted (from baseline model)
- Score = residual / rolling stddev
- If score > threshold (default 3.0 sigma), emit an anomaly event
- Alert suppression: consecutive anomalies within a 5-minute window are deduplicated
- Models are retrained weekly in batch against the warm tier to capture seasonality changes
The Alert Service receives anomaly events, evaluates routing rules (severity, team ownership), and dispatches to PagerDuty, Slack, or webhook endpoints.
5. Storage Tiers and Downsampling
Tier Store Resolution Retention
------- ---------- ---------- ---------
Hot Prometheus TSDB 15 seconds 3 days
or VictoriaMetrics
Warm ClickHouse 1 minute 90 days
Cold Parquet on S3 1 hour 2 years
Downsampling job (runs every hour):
- Reads raw points from the hot tier for the previous hour
- Computes 1-minute aggregates and writes to the warm tier (ClickHouse)
- Marks raw points older than 3 days for eviction from the hot tier
- Nightly, a batch job computes 1-hour aggregates from warm tier and writes Parquet files to S3, partitioned by
date/metric_family/
Parquet files use dictionary encoding for label values and delta encoding for timestamps, achieving 10-20x compression over raw JSON.
6. Query Router
Exposes a PromQL-compatible HTTP API. The router inspects the time range and routes sub-queries to the appropriate tier:
- Last 3 days: hot tier (Prometheus/VictoriaMetrics remote read)
- 3–90 days: warm tier (ClickHouse SQL translation layer)
- 90 days+: cold tier (S3 Select or Athena for ad-hoc; pre-aggregated summaries for dashboards)
Cross-tier queries (e.g., last 100 days) are split, executed in parallel, and results are merged with timestamp alignment. The router caches frequent dashboard queries in Redis with a 60-second TTL.
API Design
Push Metrics (OTLP)
POST /v1/metrics Content-Type: application/x-protobuf Body: OTLP ExportMetricsServiceRequest Response: 202 Accepted
Range Query
GET /api/v1/query_range
?query=rate(http_requests_total{service="api"}[5m])
&start=2024-01-01T00:00:00Z
&end=2024-01-02T00:00:00Z
&step=60s
Response: Prometheus-compatible JSON matrix
Instant Query
GET /api/v1/query ?query=avg by (region) (cpu_usage_percent) &time=2024-01-01T12:00:00Z
Scaling and Reliability
- Kafka partitioning: Partition by
metric_idto co-locate a metric on the same aggregation worker, avoiding cross-partition joins. - Backpressure: Adaptive sampling automatically reduces ingest volume when Kafka consumer lag exceeds 10 seconds.
- Hot tier replication: VictoriaMetrics cluster mode with 2x replication and vmstorage sharding across nodes.
- Write path HA: Dual-write to two Kafka clusters in different AZs. If one cluster is unavailable, the gateway fails over within 5 seconds.
- Query isolation: Dashboard queries and alerting queries use separate query router pools to prevent slow ad-hoc queries from starving alert evaluation.
Interview Discussion Points
- Why T-Digest is better than exact histogram storage for percentile computation at scale
- Cardinality explosion: how a single high-cardinality label (user_id on a metric) can cause OOM in a TSDB
- The trade-off between downsampling fidelity and storage cost: losing spike visibility at 1-hour resolution
- Pull vs. push for metric collection: staleness, discoverability, and firewall traversal
- How to handle clock skew when merging data points from distributed producers
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a telemetry pipeline and how does it differ from a logging pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A telemetry pipeline ingests, processes, and routes numeric metrics and traces emitted by services and devices (e.g. CPU usage, request latency, error rate). A logging pipeline handles unstructured or semi-structured text events (e.g. application log lines). Key differences: telemetry data is numeric and time-series oriented, enabling aggregation, downsampling, and alerting on thresholds; logs are text blobs suited for keyword search and debugging. Telemetry pipelines typically use columnar or time-series storage (e.g. Prometheus, InfluxDB, Cortex) while logging pipelines use inverted-index stores (e.g. Elasticsearch, Loki).”
}
},
{
“@type”: “Question”,
“name”: “How does metric sampling work in a telemetry pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Sampling reduces data volume by recording only a fraction of incoming data points. Head-based sampling selects a data point at ingestion time with probability p (e.g. 1%) independently of its value. Tail-based sampling buffers a window of points and selects based on outcome (e.g. always keep error samples, sample successful ones at 1%). For gauge and counter metrics, reservoir sampling maintains a fixed-size representative sample from an unbounded stream. Sampling metadata (sample rate) is stored alongside each point so downstream aggregations can correctly extrapolate totals.”
}
},
{
“@type”: “Question”,
“name”: “How does time-series data get downsampled for long-term storage?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Raw high-resolution data (e.g. 10-second intervals) is retained for a short period (e.g. 7 days). A downsampling job periodically aggregates raw points into coarser resolutions (e.g. 1-minute, 1-hour, 1-day) by computing min, max, average, sum, and count within each window, then writes the rollup to a separate retention tier. Older raw data is deleted after the rollup is confirmed. Multi-resolution storage allows recent queries to use high-resolution data and historical queries to use rollups without scanning raw data. This pattern is called recording rules in Prometheus or continuous aggregates in TimescaleDB.”
}
},
{
“@type”: “Question”,
“name”: “How does anomaly detection work in a telemetry pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Anomaly detection runs as a stream processing stage or a periodic batch job over incoming metrics. Statistical methods include z-score (flag points more than N standard deviations from a rolling mean), IQR-based outlier detection, and seasonal decomposition (STL) to separate trend, seasonality, and residual before thresholding the residual. ML-based approaches train isolation forests or autoencoders on historical baselines. Detected anomalies are emitted as alert events to a notification service. Sensitivity tuning (threshold, minimum duration, cooldown period) reduces false-positive alert fatigue.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture