Autonomous vehicle fleet management sits at the intersection of real-time embedded systems, cloud infrastructure, and safety-critical engineering. This LLD covers the major subsystems: telemetry, mission assignment, remote monitoring, OTA updates, and incident handling.
Vehicle Schema
Each vehicle in the fleet is represented as:
vehicle_id UUID PRIMARY KEY
model VARCHAR(64)
status ENUM('available','on_mission','charging','maintenance','disabled')
location_lat DECIMAL(9,6)
location_lng DECIMAL(9,6)
heading_deg SMALLINT
battery_pct SMALLINT -- or fuel_pct for hybrid
software_version VARCHAR(32)
active_mission_id UUID REFERENCES missions(mission_id)
last_seen TIMESTAMP
Telemetry Ingestion Pipeline
Each vehicle runs an onboard computer that reads CAN bus data at 100Hz: wheel speed, steering angle, brake pressure, sensor health, GPS position, and battery/motor state. This data is aggregated and compressed, then transmitted over 4G/5G to an ingestion endpoint every 100ms for safety-critical fields and every 1s for diagnostics. On the backend, a Kafka topic per vehicle receives the stream. Consumers write to a time-series database (InfluxDB or TimescaleDB) for trending and a Redis cache for the real-time dashboard. Alert rules run as stream processors (Flink or Kafka Streams) and fire within 50ms of a threshold crossing.
Mission Schema
A mission represents one passenger trip or cargo run:
mission_id UUID PRIMARY KEY
passenger_id UUID
pickup_lat DECIMAL(9,6)
pickup_lng DECIMAL(9,6)
dropoff_lat DECIMAL(9,6)
dropoff_lng DECIMAL(9,6)
assigned_vehicle_id UUID REFERENCES vehicles(vehicle_id)
route_polyline TEXT -- encoded polyline
status ENUM('queued','dispatched','in_progress','completed','cancelled')
created_at TIMESTAMP
eta TIMESTAMP
The dispatch service picks the nearest available vehicle that has sufficient battery for the full route plus return to a charging station. Assignment uses an optimistic lock on status = 'available'.
Remote Monitoring Dashboard
The operations center UI shows: a live map with all vehicles, color-coded by status; per-vehicle panels with sensor health indicators (green/yellow/red), camera feed thumbnails, and current speed; anomaly alerts ranked by severity. Map positions update via WebSocket push from the Redis cache. Each operator is assigned a vehicle cohort; the system can auto-escalate anomalies to a senior operator if the assigned one does not acknowledge within 30 seconds.
Remote Assistance
When the vehicle’s onboard decision system encounters an edge case it cannot resolve with high confidence (e.g., ambiguous construction zone, unexpected pedestrian behavior), it requests remote assistance. The vehicle slows to a safe speed or stops. A human operator receives a low-latency video feed (target under 200ms one-way) and takes control via a remote driving console. The operator resolves the situation and returns control to autonomous mode. All remote assistance sessions are recorded for model retraining.
OTA Software Updates
Software updates use a staged rollout by vehicle cohort. A new software version is first deployed to a canary cohort of 1% of the fleet. The update manager monitors error rates, disengagement events, and sensor health for 24 hours. If metrics stay within baseline, the rollout proceeds to 10%, then 50%, then 100%, with automated holds at each stage if anomaly rates rise. Each vehicle downloads the update delta (binary diff), verifies a cryptographic signature, and applies it during a scheduled maintenance window when the vehicle is charging. If the post-update health check fails, the vehicle automatically rolls back to the previous version and reports the failure.
Incident Handling
On fault detection (sensor failure, unexpected deceleration, software exception), the vehicle executes a minimal-risk maneuver: activate hazard lights, slow to a stop in the nearest safe location, and shift to a passive safe state. The operations center is notified within 5 seconds. The incident record captures: vehicle state snapshot, last 30 seconds of sensor data, GPS position, active mission, and software version. A field technician is dispatched if the vehicle cannot self-recover.
Safety System Integration
The perception stack uses triple redundancy: LiDAR, radar, and cameras each independently detect obstacles. The fusion layer requires at least two systems to agree before the vehicle proceeds. A hardware kill switch, independent of the main computer, can cut drive power if the watchdog timer is not refreshed within 100ms. All safety-critical code runs on an isolated real-time operating system partition separate from the application software.
Fleet Utilization Optimization
A demand forecasting model predicts ride requests by zone and time of day. The rebalancing service pre-positions idle vehicles to high-demand zones before peak hours. Charging schedules are optimized to ensure vehicles return to service at predicted demand spikes. Fleet-wide statistics (utilization rate, miles per charge, disengagement rate per mile) are reported daily to inform vehicle procurement and route expansion decisions.
Frequently Asked Questions
Q: How is CAN bus telemetry ingested in an autonomous vehicle fleet system?
A: Each vehicle exposes telemetry over its CAN (Controller Area Network) bus — speed, steering angle, brake pressure, sensor health, and hundreds of other signals at rates up to 1 Mbit/s. An on-board telemetry agent reads raw CAN frames, decodes them using a DBC (database CAN) file, filters to the signals relevant for fleet management, and batches them into compressed protobuf messages. These are forwarded over a cellular or V2X link to a cloud ingestion gateway. The gateway fans messages into a partitioned Kafka topic (partitioned by vehicle ID) so downstream consumers — dashboards, anomaly detectors, ML feature pipelines — can process each vehicle’s stream independently and in order.
Q: How do you design a staged OTA update rollout for an AV fleet?
A: A staged OTA rollout applies updates to an increasing percentage of the fleet over time, using automated gates to halt rollout if error rates rise. The update service stores firmware artifacts in object storage with immutable versioned keys. A rollout plan defines canary (1%), early adopter (5%), limited (20%), and full (100%) cohorts. Vehicles are assigned to cohorts deterministically by hashing their VIN. Before each stage promotion the system checks: crash rate delta, safety-critical alert rate, and user-reported issue count — all must remain below thresholds. Vehicles download updates during idle charging windows, verify the package hash, apply to an inactive partition, and reboot into the new partition only after a successful health check. Rollback is instant: the bootloader is instructed to reactivate the previous partition.
Q: How does a remote assist handoff protocol work for autonomous vehicles?
A: When a vehicle encounters a scenario it cannot resolve autonomously — unusual road geometry, a blocked lane, an ambiguous object — it transitions to MINIMAL_RISK_CONDITION and broadcasts a remote assist request. The request includes a live video feed (multi-camera), sensor data snapshot, and a structured description of the blocking condition. A remote operator console receives the request from a priority queue (ordered by vehicle safety state and wait time), reviews the scene, and issues a high-level directive: nudge left, proceed, pull over. The directive is transmitted back to the vehicle over a redundant low-latency link (primary LTE + backup satellite). The vehicle autonomy stack interprets the directive and resumes. The entire handoff targets sub-30-second resolution and is logged for audit and model training.
Q: Why do safety-critical AV systems use triple redundancy?
A: Triple redundancy (TMR — Triple Modular Redundancy) is used for safety-critical subsystems such as braking, steering, and sensor fusion because it tolerates a single component failure without loss of function. Three independent hardware channels compute the same output; a majority voter selects the result agreed upon by at least two channels. If one channel disagrees, it is flagged as faulty and the system continues operating on the remaining two (DMR — Dual Modular Redundancy) while generating a maintenance alert. Each channel uses independent power supplies, separate sensor inputs, and different hardware vendors where possible to avoid common-cause failures. The architecture is required by ISO 26262 ASIL-D functional safety standards for automotive systems.
Q: How do you design a fleet health dashboard for autonomous vehicles?
A: A fleet health dashboard aggregates real-time and historical metrics across all vehicles into a unified view. The data pipeline reads from the telemetry Kafka topics, computes per-vehicle and fleet-aggregate metrics in a stream processor (e.g., Apache Flink), and writes results to a time-series database (e.g., InfluxDB or TimescaleDB). The dashboard UI shows: live vehicle map with status overlays, fleet-wide KPIs (availability, miles between interventions, sensor fault rate), per-vehicle drill-down with historical trend charts, and active alert list sorted by severity. Alerts are generated by threshold rules and ML anomaly models running in the stream processor. On-call engineers receive PagerDuty notifications for CRITICAL alerts. The dashboard backend exposes a GraphQL API so multiple front-end surfaces (web, mobile NOC app) share the same data layer.
See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering