Problem Statement
Design a scalable IoT data ingestion service that handles millions of connected devices sending sensor readings, events, and telemetry data. The system must support heterogeneous device types, unreliable network conditions, and high-throughput writes.
Requirements
Functional Requirements
- Device registration and identity management
- Support MQTT and HTTP/HTTPS ingestion protocols
- Schema validation and payload normalization
- Time-series storage for sensor readings
- Edge buffering for offline or intermittently connected devices
- At-least-once delivery guarantees
- Fan-out to downstream consumers (analytics, alerting)
Non-Functional Requirements
- Ingest 1M+ messages/second at peak
- End-to-end latency under 500ms (p99)
- 99.99% availability
- Data retention configurable per device class
High-Level Architecture
The system consists of five layers: Device Layer, Protocol Gateway, Ingestion Pipeline, Storage Layer, and Consumer Layer.
Devices (MQTT / HTTP)
|
Protocol Gateway
(MQTT Broker / HTTP API)
|
Message Broker (Kafka)
|
+-----------+----------+
| | |
Schema Transform Dead-Letter
Validator & Enrich Queue
|
Time-Series DB + Object Store
|
Consumers (Analytics / Alerts)
Component Design
1. Device Registry
Each device has a unique identity stored in a relational database.
CREATE TABLE devices (
device_id UUID PRIMARY KEY,
tenant_id UUID NOT NULL,
device_class VARCHAR(64),
auth_token VARCHAR(256),
schema_id UUID,
status ENUM('active','inactive','quarantined'),
registered_at TIMESTAMP,
last_seen_at TIMESTAMP
);
On connection, the gateway performs a token lookup. Results are cached in Redis with a short TTL to reduce DB load.
2. Protocol Gateway
MQTT path: An MQTT broker cluster (e.g., EMQX or HiveMQ) authenticates devices via client certificates or shared tokens. Each broker node handles ~100K concurrent connections. Topics follow the pattern tenant/{tenant_id}/device/{device_id}/data. Broker plugins publish messages to Kafka.
HTTP path: A stateless API gateway accepts POST requests to /v1/ingest/{device_id}. It validates the bearer token, batches small payloads, and writes to Kafka. Supports gzip and MessagePack in addition to JSON.
3. Ingestion Pipeline
Kafka topics are partitioned by device_id to preserve per-device ordering. Three consumer groups operate in parallel:
- Schema Validator: Fetches the device schema from a schema registry (Apache Avro / Confluent Schema Registry). Invalid messages are routed to a dead-letter topic with error metadata.
- Transform & Enrich: Normalizes units, fills default fields, attaches tenant metadata, and emits to a
normalizedtopic. - Router: Fans out to time-series storage writers and downstream event consumers.
4. Time-Series Storage
Primary store: Apache Cassandra or ClickHouse for high write throughput.
CREATE TABLE readings ( tenant_id UUID, device_id UUID, metric_name TEXT, ts TIMESTAMP, value DOUBLE, tags MAP<TEXT, TEXT>, PRIMARY KEY ((tenant_id, device_id, metric_name), ts) ) WITH CLUSTERING ORDER BY (ts DESC) AND default_time_to_live = 2592000;
Hot data (last 7 days) lives in the time-series DB. Warm data (7–90 days) is compacted and moved to columnar storage (Parquet on S3). Cold data beyond 90 days is archived as compressed blobs.
5. Edge Buffering for Offline Devices
Devices run a local edge agent that:
- Buffers messages to local flash storage using a circular log
- Applies local schema validation and drop rules to avoid buffer overflow
- Reconnects with exponential backoff (jitter added)
- Replays buffered messages with original timestamps on reconnect
- Marks messages with a
buffered=trueflag so the pipeline can skip real-time alerting for stale data
The gateway deduplicates replayed messages using a Bloom filter keyed on device_id + sequence_number with a 24-hour window.
6. Schema Registry
Schemas are versioned and stored centrally. Devices reference a schema_id in their registry record. The validator service caches schemas in-process for 60 seconds. Schema evolution follows backward-compatible rules: adding optional fields is allowed; removing or renaming fields requires a new major version.
API Design
Device Registration
POST /v1/devices
{
"device_class": "temperature_sensor",
"schema_id": "uuid",
"metadata": { "location": "warehouse-A" }
}
Response: { "device_id": "uuid", "auth_token": "..." }
HTTP Ingestion
POST /v1/ingest/{device_id}
Authorization: Bearer {auth_token}
{
"ts": 1700000000000,
"metrics": { "temperature": 23.4, "humidity": 61.2 },
"seq": 10042
}
Response: 202 Accepted
Query API
GET /v1/devices/{device_id}/metrics?metric=temperature&from=ISO8601&to=ISO8601&resolution=1m
Scaling and Reliability
- Gateway: Horizontal scaling behind a Layer-4 load balancer. MQTT brokers form a cluster with shared session state.
- Kafka: 3x replication factor, min-insync-replicas=2. Partition count sized to peak throughput / consumer throughput.
- Storage writes: Batched writes (500ms windows) to reduce Cassandra write amplification.
- Backpressure: If Kafka consumer lag exceeds threshold, the gateway applies token-bucket rate limiting per tenant.
- Multi-region: Kafka MirrorMaker replicates to a secondary region for DR. Time-series DB uses cross-DC replication with LOCAL_QUORUM reads.
Interview Discussion Points
- How would you handle a noisy-neighbor tenant consuming disproportionate Kafka partitions?
- MQTT QoS levels (0/1/2) and their trade-offs for IoT workloads
- Why time-series DBs beat general-purpose DBs for this workload (write patterns, TTL, compression)
- Deduplication window sizing: too short misses replays, too long wastes memory
- Schema evolution without downtime: deploy consumers before producers when adding fields
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture