What is an IoT data ingestion service and what protocols does it support?

An IoT data ingestion service receives high-volume telemetry streams from physical devices and routes them for storage and processing. It typically supports MQTT (lightweight publish-subscribe over TCP, ideal for constrained devices), HTTP/HTTPS (REST endpoints for devices with more resources), CoAP (UDP-based protocol for constrained networks), and AMQP. A protocol gateway normalizes all incoming messages into a canonical internal format before forwarding to downstream queues or stream processors.

How does schema validation work for IoT device data?

Each device type is registered with a schema (e.g. JSON Schema or Avro) describing expected fields, types, and value ranges. When a message arrives, the ingestion service looks up the device's registered schema by device type or topic, then validates the payload against it. Invalid messages are rejected with an error code, routed to a dead-letter queue for investigation, and an alert may be raised if the error rate for a device exceeds a threshold. Valid messages are forwarded to the processing pipeline. Schema versioning allows devices to upgrade their payload format without breaking existing consumers.

How does edge buffering handle device data when connectivity is lost?

An edge agent running on a gateway device buffers messages in local persistent storage (e.g. an embedded queue or SQLite) when the upstream connection to the cloud ingestion service is unavailable. Messages are timestamped at capture time so the original event time is preserved. When connectivity is restored, the edge agent replays buffered messages in order, typically with a configurable max-buffer-size and retention period to bound storage usage. The ingestion service handles out-of-order and late-arriving messages by using event time rather than ingestion time for partitioning and windowing.

How does an IoT ingestion service scale to millions of concurrent device connections?

The connection layer is scaled horizontally with stateless broker nodes behind a load balancer; each node handles tens of thousands of persistent MQTT or WebSocket connections using async I/O (e.g. event-loop architecture). A distributed registry (e.g. Redis) maps device IDs to broker node IDs so messages can be routed across the cluster. Incoming messages are written to a partitioned message queue (Kafka or Kinesis) keyed by device ID, decoupling ingestion throughput from downstream processing speed. Auto-scaling policies add broker and consumer nodes based on connection count and queue lag metrics.

Low Level Design: IoT Data Ingestion Service

⏱ 6 min read

Problem Statement

Design a scalable IoT data ingestion service that handles millions of connected devices sending sensor readings, events, and telemetry data. The system must support heterogeneous device types, unreliable network conditions, and high-throughput writes.

Requirements

Functional Requirements

Device registration and identity management
Support MQTT and HTTP/HTTPS ingestion protocols
Schema validation and payload normalization
Time-series storage for sensor readings
Edge buffering for offline or intermittently connected devices
At-least-once delivery guarantees
Fan-out to downstream consumers (analytics, alerting)

Non-Functional Requirements

Ingest 1M+ messages/second at peak
End-to-end latency under 500ms (p99)
99.99% availability
Data retention configurable per device class

High-Level Architecture

The system consists of five layers: Device Layer, Protocol Gateway, Ingestion Pipeline, Storage Layer, and Consumer Layer.

Devices (MQTT / HTTP)
        |
  Protocol Gateway
  (MQTT Broker / HTTP API)
        |
  Message Broker (Kafka)
        |
  +-----------+----------+
  |           |          |
Schema    Transform   Dead-Letter
Validator  & Enrich    Queue
  |
Time-Series DB + Object Store
        |
  Consumers (Analytics / Alerts)

Component Design

1. Device Registry

Each device has a unique identity stored in a relational database.

CREATE TABLE devices (
  device_id     UUID PRIMARY KEY,
  tenant_id     UUID NOT NULL,
  device_class  VARCHAR(64),
  auth_token    VARCHAR(256),
  schema_id     UUID,
  status        ENUM('active','inactive','quarantined'),
  registered_at TIMESTAMP,
  last_seen_at  TIMESTAMP
);

On connection, the gateway performs a token lookup. Results are cached in Redis with a short TTL to reduce DB load.

2. Protocol Gateway

MQTT path: An MQTT broker cluster (e.g., EMQX or HiveMQ) authenticates devices via client certificates or shared tokens. Each broker node handles ~100K concurrent connections. Topics follow the pattern tenant/{tenant_id}/device/{device_id}/data. Broker plugins publish messages to Kafka.

HTTP path: A stateless API gateway accepts POST requests to /v1/ingest/{device_id}. It validates the bearer token, batches small payloads, and writes to Kafka. Supports gzip and MessagePack in addition to JSON.

3. Ingestion Pipeline

Kafka topics are partitioned by device_id to preserve per-device ordering. Three consumer groups operate in parallel:

Schema Validator: Fetches the device schema from a schema registry (Apache Avro / Confluent Schema Registry). Invalid messages are routed to a dead-letter topic with error metadata.
Transform & Enrich: Normalizes units, fills default fields, attaches tenant metadata, and emits to a normalized topic.
Router: Fans out to time-series storage writers and downstream event consumers.

4. Time-Series Storage

Primary store: Apache Cassandra or ClickHouse for high write throughput.

CREATE TABLE readings (
  tenant_id   UUID,
  device_id   UUID,
  metric_name TEXT,
  ts          TIMESTAMP,
  value       DOUBLE,
  tags        MAP<TEXT, TEXT>,
  PRIMARY KEY ((tenant_id, device_id, metric_name), ts)
) WITH CLUSTERING ORDER BY (ts DESC)
  AND default_time_to_live = 2592000;

Hot data (last 7 days) lives in the time-series DB. Warm data (7–90 days) is compacted and moved to columnar storage (Parquet on S3). Cold data beyond 90 days is archived as compressed blobs.

5. Edge Buffering for Offline Devices

Devices run a local edge agent that:

Buffers messages to local flash storage using a circular log
Applies local schema validation and drop rules to avoid buffer overflow
Reconnects with exponential backoff (jitter added)
Replays buffered messages with original timestamps on reconnect
Marks messages with a buffered=true flag so the pipeline can skip real-time alerting for stale data

The gateway deduplicates replayed messages using a Bloom filter keyed on device_id + sequence_number with a 24-hour window.

6. Schema Registry

Schemas are versioned and stored centrally. Devices reference a schema_id in their registry record. The validator service caches schemas in-process for 60 seconds. Schema evolution follows backward-compatible rules: adding optional fields is allowed; removing or renaming fields requires a new major version.

API Design

Device Registration

POST /v1/devices
{
  "device_class": "temperature_sensor",
  "schema_id": "uuid",
  "metadata": { "location": "warehouse-A" }
}
Response: { "device_id": "uuid", "auth_token": "..." }

HTTP Ingestion

POST /v1/ingest/{device_id}
Authorization: Bearer {auth_token}
{
  "ts": 1700000000000,
  "metrics": { "temperature": 23.4, "humidity": 61.2 },
  "seq": 10042
}
Response: 202 Accepted

Query API

GET /v1/devices/{device_id}/metrics?metric=temperature&from=ISO8601&to=ISO8601&resolution=1m

Scaling and Reliability

Gateway: Horizontal scaling behind a Layer-4 load balancer. MQTT brokers form a cluster with shared session state.
Kafka: 3x replication factor, min-insync-replicas=2. Partition count sized to peak throughput / consumer throughput.
Storage writes: Batched writes (500ms windows) to reduce Cassandra write amplification.
Backpressure: If Kafka consumer lag exceeds threshold, the gateway applies token-bucket rate limiting per tenant.
Multi-region: Kafka MirrorMaker replicates to a secondary region for DR. Time-series DB uses cross-DC replication with LOCAL_QUORUM reads.

Interview Discussion Points

How would you handle a noisy-neighbor tenant consuming disproportionate Kafka partitions?
MQTT QoS levels (0/1/2) and their trade-offs for IoT workloads
Why time-series DBs beat general-purpose DBs for this workload (write patterns, TTL, compression)
Deduplication window sizing: too short misses replays, too long wastes memory
Schema evolution without downtime: deploy consumers before producers when adding fields