Requirements and Constraints
An IoT data pipeline ingests telemetry from potentially millions of connected devices, processes and enriches the stream, stores time-series data efficiently, and makes it available for dashboards and alerting. Functional requirements: support MQTT and CoAP device protocols, validate and decode binary or JSON payloads, apply stream processing (filtering, enrichment, aggregation), downsample raw data for long-term retention, and store in a time-series database. Non-functional: ingest 500,000 messages per second at peak, end-to-end latency under 2 seconds for alerting paths, store 2 years of data per device, and tolerate individual device floods without impacting others.
Key constraints: devices often send duplicate messages (reconnect bursts), payload formats vary by device firmware version, and network conditions are unreliable so out-of-order delivery is common.
Core Data Model
- devices(device_id PK, tenant_id, device_type, firmware_version, payload_schema_id FK, registered_at, last_seen_at)
- payload_schemas(schema_id PK, device_type, firmware_version, format ENUM('json','protobuf','cbor'), schema_definition JSONB)
- raw_metrics — time-series table: (device_id, metric_name, value DOUBLE, tags JSONB, ts TIMESTAMPTZ) — partitioned by ts, primary key (device_id, metric_name, ts)
- downsampled_metrics_1h(device_id, metric_name, bucket TIMESTAMPTZ, min_val, max_val, avg_val, count) — materialized hourly rollup
- alert_rules(rule_id PK, tenant_id, metric_name, condition_expr, threshold, window_seconds, severity)
- pipeline_offsets(consumer_group, partition_id, committed_offset) — Kafka consumer state
Ingestion Layer
MQTT devices connect to a cluster of MQTT brokers (Mosquitto or EMQX). Each broker authenticates devices via mutual TLS with device certificates validated against the device registry. Incoming messages are published to Kafka topics partitioned by device_id, ensuring ordered delivery per device and distributing load across partitions. CoAP devices send UDP datagrams to a CoAP gateway that translates to the same Kafka topic format.
A rate limiter per device_id runs in the broker layer. If a device sends more than a configurable threshold (e.g., 100 messages/second), excess messages are dropped and a rate-limit event is recorded. This prevents a malfunctioning device from consuming disproportionate pipeline capacity.
Stream Processing
Stream processors (Apache Flink or Kafka Streams) consume from the raw ingestion topic. Processing steps in the pipeline:
- Deserialization and schema resolution: Look up payload_schema by device_id and firmware_version (cached in memory, refreshed on schema change events). Decode the binary or JSON payload into a canonical metric struct.
- Deduplication: Maintain a sliding-window bloom filter keyed by (device_id, message_sequence_number) to drop duplicates within a 60-second window. False-positive rate is acceptable given the low cost of occasional dropped non-duplicates vs. the overhead of exact deduplication.
- Enrichment: Join against a broadcast device metadata stream (device type, tenant, location tags) to attach tags to each metric point without a per-message database call.
- Alert evaluation: Tumbling and sliding window aggregations compute metric averages and counts over alert rule windows. Alerts are evaluated in-stream and emitted to an alert topic consumed by a notification service.
- Write fan-out: Processed metrics are written to the time-series store and simultaneously forwarded to a hot-path topic for real-time dashboard subscriptions.
Downsampling Strategy
Raw metrics are retained at full resolution for 7 days. Beyond that, data is downsampled to reduce storage. A downsampling job runs hourly and computes min, max, avg, and count over each one-hour bucket per (device_id, metric_name), writing to downsampled_metrics_1h. A daily job further aggregates to a 24-hour bucket table for multi-year retention. Queries for time ranges beyond 7 days are automatically redirected to the appropriate rollup table by a query routing layer that inspects the requested time range.
For time-series databases like TimescaleDB or InfluxDB, continuous aggregates or downsampling tasks handle this natively. For a custom implementation on PostgreSQL, a pg_cron job executes the rollup INSERT … SELECT with conflict handling for idempotency.
Scalability Considerations
- Kafka partitioning: Partition count should be 3-5x the expected peak consumer parallelism. Use device_id as partition key for ordering guarantees per device.
- Time-series write throughput: Batch writes in 500ms windows; a single TimescaleDB chunk insert is far more efficient than row-by-row inserts. Use COPY or batch prepared statements.
- Multi-tenancy isolation: Separate Kafka consumer groups per tenant for stream processing to prevent noisy-neighbor throughput impact.
- Backpressure: If the time-series store write latency spikes, stream processors apply backpressure by pausing Kafka consumption rather than accumulating unbounded in-memory buffers.
API Design
GET /devices/{id}/metrics?metric=temperature&from=&to=&resolution=raw|1h|1d— time-series query with automatic rollup routingGET /devices/{id}/metrics/latest— most recent value per metric name for live dashboard widgetsPOST /alert-rules— create an alert rule with metric, condition, threshold, and windowGET /tenants/{id}/metrics/aggregate— fleet-level aggregation across all devices in a tenantGET /pipeline/lag— internal: Kafka consumer group lag per topic/partition for SRE monitoring
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you ingest MQTT and CoAP telemetry at IoT scale?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A horizontally scaled broker cluster (e.g., EMQX for MQTT, Californium for CoAP) terminates device connections and authenticates devices via mutual TLS or pre-shared keys. Messages are immediately forwarded to a Kafka topic partitioned by device ID, decoupling ingestion throughput from processing latency. Back-pressure is managed at the broker with per-client rate limits and topic-level quotas so a single noisy device cannot starve the pipeline. Connection state is tracked in a distributed registry to support last-will messages and reconnection logic.”
}
},
{
“@type”: “Question”,
“name”: “How does Flink handle out-of-order IoT events using watermarks?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Flink assigns event-time timestamps extracted from the device payload and generates bounded-out-of-orderness watermarks (e.g., maxOutOfOrderness = 30 seconds) per Kafka partition. Window operators (tumbling or sliding) trigger only when the watermark advances past the window end, allowing late-arriving messages to be included up to the configured lateness bound. Events that arrive after the allowed lateness are routed to a side output for reprocessing or alerting. Watermark alignment across partitions prevents one slow partition from stalling the entire pipeline.”
}
},
{
“@type”: “Question”,
“name”: “How does TimescaleDB downsampling policy reduce IoT storage costs?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “TimescaleDB continuous aggregates define materialized views that pre-compute minute, hour, and day rollups (avg, min, max, last) over raw hypertable chunks. A retention policy drops raw data after a configured interval (e.g., 7 days) while preserving the aggregates indefinitely. Compression policies further reduce chunk size on-disk using columnar compression with delta-of-delta encoding for timestamps and XOR encoding for float values. This tiered approach typically achieves 10-50x storage reduction compared to retaining raw samples.”
}
},
{
“@type”: “Question”,
“name”: “How is a Bloom filter used for deduplication in an IoT pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Devices may retransmit messages due to QoS-1/2 semantics or network instability. A Bloom filter keyed on (device_id, message_id) is held in shared memory (or Redis) at each pipeline stage. Before writing a message to the sink, the filter is checked; if the key is present, the message is discarded as a probable duplicate. False positives cause occasional missed writes but are bounded by the filter's error rate (typically configured at 0.1%). The filter is partitioned by time window and rotated on a schedule aligned to the maximum expected redelivery window to prevent unbounded growth.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture