Question 1

What are the differences between MQTT QoS levels 0, 1, and 2?

Accepted Answer

MQTT defines three Quality of Service levels. QoS 0 (at most once) fires the message and forgets it—no acknowledgment, no retry; best for high-frequency sensor telemetry where occasional loss is acceptable. QoS 1 (at least once) requires a PUBACK from the receiver; the sender retains and retransmits until acknowledged, which means duplicates are possible. QoS 2 (exactly once) uses a four-way handshake (PUBLISH → PUBREC → PUBREL → PUBCOMP) to guarantee delivery without duplicates; it has the highest overhead and suits critical commands or billing events.

Question 2

How does hypertable partitioning work in a time-series database?

Accepted Answer

A hypertable (as in TimescaleDB) is a logical table that is automatically partitioned into chunks along the time dimension—for example, one chunk per day or per week. Each chunk is a regular PostgreSQL table stored separately on disk, enabling chunk-level operations: pruning queries to only scan relevant time ranges, dropping old data in O(1) by deleting a chunk, and tiering cold chunks to cheaper storage. Space partitioning (by device ID hash) can be added as a second dimension to parallelize ingest and prevent hot spots on a single node.

Question 3

How do you detect when an IoT device has gone offline using heartbeat TTL?

Accepted Answer

Each device publishes a heartbeat message on a fixed interval (e.g., every 30 seconds). The platform stores the last-seen timestamp for each device ID in a fast store such as Redis. A background monitor compares the current time against last-seen; if the gap exceeds a configurable TTL (e.g., 90 seconds—three missed heartbeats), the device is marked offline and an alert is emitted. Redis TTL keys can encode this natively: the device sets or refreshes a key with a TTL equal to the offline threshold, and key expiration triggers a Keyspace Notification that the monitor consumes.

Question 4

How should an OTA firmware update be rolled out to a large IoT fleet with automated gates?

Accepted Answer

A staged OTA rollout starts by deploying the firmware to a canary cohort (e.g., 1% of devices) and monitoring error rates, crash reports, and connectivity metrics for a bake period (e.g., 24 hours). Automated gates compare these metrics against baseline thresholds; if they pass, the rollout advances to the next stage (e.g., 5%, 20%, 50%, 100%). If a gate fails, the pipeline halts and optionally initiates an automatic rollback by pushing the previous firmware version. Each device reports its installed version and installation status, feeding the deployment dashboard in near real time.

Question 5

How do Flink sliding window aggregations enable real-time alerting on IoT streams?

Accepted Answer

Apache Flink's sliding windows compute continuous aggregates (sum, average, max) over a moving time range. For example, a window of size 5 minutes sliding every 1 minute emits an aggregate every minute covering the last 5 minutes of data. For IoT alerting, a Flink job consumes sensor readings from Kafka, keys the stream by device ID, and applies a sliding window to compute rolling averages of temperature, vibration, or pressure. When an aggregate breaches a threshold, the job emits an alert event downstream to a notification service. Flink's event-time processing with watermarks handles out-of-order messages from intermittently connected devices.

Low Level Design: IoT Data Ingestion Platform

Device Registry

MQTT Broker

Ingestion Pipeline

Time-Series Storage

Stream Processing and Alerting

OTA Firmware Updates

Frequently Asked Questions: IoT Data Ingestion Platform