IoT platforms manage millions of connected devices — sensors, cameras, industrial equipment, smart home devices — collecting telemetry data and enabling remote control. AWS IoT, Azure IoT Hub, and Google Cloud IoT process trillions of messages from billions of devices. Designing an IoT platform tests your understanding of device communication protocols, massive telemetry ingestion, edge computing, and real-time alerting. This guide covers the architecture for a system design interview.
Device Communication: MQTT Protocol
MQTT (Message Queuing Telemetry Transport) is the standard IoT messaging protocol. It is designed for: low bandwidth (IoT devices often have limited connectivity), low power (battery-operated sensors), and unreliable networks (cellular, satellite). MQTT architecture: a broker (server) mediates all communication. Devices (clients) connect to the broker and: publish messages to topics (sensors/temperature/room1 -> 23.5), subscribe to topics (receive commands: devices/thermostat1/commands), and receive retained messages (the last value published to a topic — useful for new subscribers). QoS levels: (0) At most once — fire and forget. May lose messages. Lowest overhead. Use for: frequent sensor readings where a missed reading is acceptable. (1) At least once — guaranteed delivery with possible duplicates. Use for: important but idempotent operations. (2) Exactly once — guaranteed single delivery. Highest overhead (4-packet handshake). Use for: billing, critical commands. MQTT over WebSocket: for browser-based dashboards, MQTT can run over WebSocket. The broker supports both native MQTT (port 1883/8883) and WebSocket connections. Alternative protocols: CoAP (constrained devices, UDP-based, REST-like), AMQP (enterprise messaging), and HTTP (simple but higher overhead per message — not ideal for frequent telemetry).
Telemetry Ingestion Pipeline
Scale: 1 million devices sending telemetry every 10 seconds = 100,000 messages per second. Each message is small (100-500 bytes): temperature, humidity, GPS coordinates, battery level. Pipeline: (1) MQTT broker cluster — multiple broker instances behind a load balancer. Each broker handles 10,000-50,000 concurrent device connections. With 1M devices: 20-100 broker instances. Use EMQX, Mosquitto, or AWS IoT Core (managed). (2) Bridge to Kafka — the MQTT broker bridges incoming messages to Kafka topics partitioned by device_id. Kafka provides durability, replay, and decoupling from downstream consumers. (3) Stream processing — Flink or Spark Streaming consumes from Kafka: aggregate telemetry (average temperature per room per minute), detect anomalies (temperature spike > 3 standard deviations from baseline), and trigger alerts (temperature > 40C -> fire alarm). (4) Storage — raw telemetry in a time-series database (InfluxDB, TimescaleDB) for operational dashboards. Aggregated data in a data warehouse (BigQuery, Redshift) for analytics. (5) Device state — the latest telemetry from each device is stored in a “device shadow” or “digital twin” (see below). Latency: from device publish to alert trigger: under 5 seconds for real-time alerting.
Device Management
Managing millions of devices: (1) Device registry — a database of all registered devices: device_id, device_type, firmware_version, owner, location, status (online/offline), last_seen, metadata. The registry is the authoritative list of devices. New devices must be registered (provisioned) before they can connect. (2) Authentication — each device has a unique identity (X.509 certificate or token). On connection: the broker verifies the device identity. Compromised device certificates can be revoked. (3) OTA (Over-The-Air) firmware updates — push firmware updates to devices remotely. Architecture: publish the new firmware to S3. Send a command to targeted devices (by device type, firmware version, or device group): “download firmware v2.1 from URL, verify checksum, install, and report status.” Roll out gradually: update 1% of devices first (canary), monitor for errors, then expand. Rollback: if the new firmware causes failures, push the previous version. (4) Device groups — organize devices by: location (all sensors in Building A), type (all temperature sensors), or custom tags. Groups enable bulk operations: update all cameras to firmware v3, restart all offline sensors in Zone 5. (5) Monitoring — track device health: connection status, last heartbeat, error rates, battery level. Alert on: device offline for > 1 hour, battery below 10%, error rate spike.
Digital Twin
A digital twin is a virtual representation of a physical device, stored in the cloud. It maintains the device “desired state” (what the cloud wants the device to be) and “reported state” (what the device actually is). Example: a smart thermostat. Desired state: {“temperature”: 22, “mode”: “auto”}. Reported state: {“temperature”: 23.5, “mode”: “auto”, “battery”: 87}. When the user sets the temperature to 22 via the app: the cloud updates the desired state. The device receives the desired state change (via MQTT subscription). The device adjusts the thermostat and reports back the actual state. The digital twin reconciles: if desired matches reported, the device is in sync. If they differ (device is offline and has not applied the change), the delta is tracked and applied when the device reconnects. AWS calls this “Device Shadow.” Azure calls it “Device Twin.” Google calls it “Device State.” Benefits: (1) The app can read the device state without connecting to the device directly (just read the twin). (2) Offline devices: commands are queued in the twin. When the device reconnects, it receives the accumulated state changes. (3) Simulation: the twin can be used for what-if analysis without affecting the physical device.
Edge Computing
Not all processing should happen in the cloud. Edge computing runs logic on devices or local gateways: (1) Latency — a factory robot needs sub-10ms response. Cloud round-trip is 50-200ms. Edge processing: the local gateway runs the decision logic. (2) Bandwidth — a camera generating 1 GB/hour of video. Uploading everything to the cloud is expensive. Edge processing: run object detection on the edge device, upload only detected events (95% bandwidth reduction). (3) Offline operation — if internet connectivity is lost, edge devices continue operating with local logic. Queue telemetry for later upload. Edge architecture: an edge gateway (Raspberry Pi, industrial PC, or AWS Greengrass/Azure IoT Edge) runs: local MQTT broker (devices communicate even without internet), stream processing (filter, aggregate, anomaly detection), ML inference (pre-trained models deployed to the edge for image classification, predictive maintenance), and local storage (buffer telemetry during connectivity loss). Cloud-to-edge deployment: the cloud deploys models and rules to edge devices. The edge runs them locally. Results are synced to the cloud when connected. This is a hybrid architecture: time-sensitive and bandwidth-heavy processing at the edge, aggregation and long-term analytics in the cloud.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Why is MQTT the standard protocol for IoT devices?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”MQTT is designed for IoT constraints: low bandwidth (small message overhead — 2 bytes minimum header), low power (minimal CPU for constrained sensors), and unreliable networks (built-in reconnection, QoS levels). Architecture: a central broker mediates all communication. Devices publish telemetry to topics (sensors/temp/room1 -> 23.5) and subscribe to receive commands (devices/thermostat1/cmd). Three QoS levels: 0 (at most once, fire-and-forget for frequent readings), 1 (at least once with possible duplicates for important events), 2 (exactly once for billing/critical commands — 4-packet handshake). Retained messages deliver the last value to new subscribers immediately. MQTT over WebSocket enables browser dashboards. Alternatives: CoAP (UDP, REST-like for very constrained devices), HTTP (simple but high overhead per message).”}},{“@type”:”Question”,”name”:”What is a digital twin and how does it help manage IoT devices?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A digital twin is a cloud-side virtual representation of a physical device. It maintains two states: desired (what the cloud wants) and reported (what the device actually is). Example: smart thermostat desired: {temp: 22, mode: auto}. Reported: {temp: 23.5, mode: auto, battery: 87%}. When the user sets temperature via the app: the cloud updates the desired state. The device receives the change via MQTT, adjusts, and reports back. The twin tracks the delta between desired and reported. Benefits: (1) The app reads device state from the twin (no direct device connection needed). (2) Offline devices: commands queue in the twin and sync when the device reconnects. (3) Simulation: test what-if scenarios on the twin without affecting the physical device. AWS Device Shadow, Azure Device Twin, and Google Device State implement this pattern.”}}]}