System Design: Design Tesla Autopilot — Autonomous Driving, Sensor Fusion, Perception, Path Planning, Real-Time ML

⏱ 5 min read

Tesla Autopilot processes data from cameras, radar, and ultrasonic sensors to navigate roads autonomously. Designing an autonomous driving system tests your understanding of: real-time sensor fusion, ML perception (object detection + depth estimation), path planning under uncertainty, and the safety-critical systems that must make life-or-death decisions in milliseconds. This is the ultimate real-time ML system design question.

Sensor Suite and Data Pipeline

Tesla uses a vision-only approach (8 cameras, no LiDAR). Other companies (Waymo, Cruise) use LiDAR + cameras + radar. Sensors: (1) Cameras (8) — surround-view: front-facing (wide, main, narrow), side-facing (left/right), rear-facing. Each produces 1280×960 frames at 36 FPS. Total: 8 cameras * 36 FPS * ~2 MB/frame = ~576 MB/sec of raw image data. (2) Radar (optional, Tesla removed in 2022) — detects objects and measures distance/velocity using radio waves. Works in rain, fog, and darkness (cameras struggle). Range: 250m. (3) Ultrasonic sensors (12) — short-range (< 8m) for parking and close-proximity detection. Data pipeline: all sensor data flows to the onboard computer (Tesla FSD computer: dual redundant chips, each with a neural network accelerator). Processing is entirely on-device (no cloud — the car must work without internet). Latency budget: from photon hitting the camera sensor to the car actuating (steering, braking): < 100ms total. Breakdown: image capture (10ms) + neural network inference (20-30ms) + perception post-processing (10ms) + path planning (10ms) + control output (10ms) + actuator response (20-30ms). Every millisecond matters: at 60 mph, the car travels 27 meters per second. A 100ms delay = 2.7 meters of uncontrolled travel.

Perception: What Is Around the Car

The perception system interprets raw sensor data into a structured understanding of the environment. Tasks: (1) Object detection — identify and locate: vehicles (cars, trucks, motorcycles), pedestrians, cyclists, traffic signs, traffic lights, lane markings, road boundaries, and construction zones. For each object: bounding box (2D in image, 3D in world coordinates), class, and confidence score. (2) Depth estimation — from 2D camera images, estimate the 3D distance to every pixel. Tesla uses a multi-camera network that learns depth from stereo geometry (multiple camera views of the same scene) and motion parallax (depth from the car movement between frames). (3) Occupancy network — Tesla latest approach: a 3D voxel grid around the car where each voxel is classified as: occupied (something is there) or free (safe to drive). This handles arbitrary objects (not just known classes like “car” — a fallen tree or a construction cone is just “occupied space”). (4) Lane and road understanding — detect lane lines, road edges, drivable surface, intersections, and merge points. Predict the road topology (how lanes connect at intersections). (5) Tracking — track objects across frames. Assign consistent IDs to detected objects. Predict their future trajectories (where will this pedestrian be in 2 seconds?). Neural network architecture: Tesla uses a vision transformer (ViT) backbone processing all 8 cameras simultaneously. Multi-scale features are fused into a Bird-Eye-View (BEV) representation — a top-down view of the world around the car. This BEV representation is the input to planning.

Path Planning: What Should the Car Do

Given the perception output (where objects are, where lanes are, what traffic signals say), the planner decides the car trajectory. Two levels: (1) Route planning (high-level) — navigate from A to B using a road map. Similar to Google Maps routing. Runs once per trip or when the route needs updating (missed turn, road closure). (2) Motion planning (low-level) — plan the exact trajectory for the next 5-10 seconds. Updated 10x per second. Must account for: other vehicles (maintain safe following distance, avoid collisions), traffic rules (stop at red lights, yield at intersections, speed limits), road geometry (curves, lane changes, merge), pedestrians and cyclists (predict their movement, give space), and comfort (smooth acceleration, gentle turns — passengers should not feel jerky movements). Planning approaches: (1) Rule-based — hand-coded rules: “if the car ahead brakes, brake proportionally.” Simple and interpretable but cannot handle every scenario (10,000+ edge cases). (2) Optimization-based — define a cost function: minimize (deviation from desired speed + proximity to objects + jerk + lane deviation). Search for the trajectory with minimum cost. Handles complex scenarios but the cost function is hard to design for all cases. (3) ML-based (Tesla FSD v12+) — an end-to-end neural network that maps perception to control outputs (steering angle, acceleration, braking). Trained on millions of miles of human driving data. Handles edge cases naturally (whatever a human would do, the model learns). Risk: neural networks are black boxes — hard to guarantee safety.

Safety and Redundancy

Autonomous driving is safety-critical: failures can kill. Safety architecture: (1) Dual redundant compute — two independent chips. If one fails, the other takes over. Each can independently control the car. (2) Watchdog systems — independent simple processors monitor the main system. If the main system fails to produce a control output within 50ms: the watchdog takes over (applies brakes gradually, activates hazard lights). (3) Graceful degradation — if perception is degraded (camera obscured, heavy rain): reduce speed, increase following distance, and alert the driver to take over. Do not continue at highway speed with degraded sensing. (4) Operational Design Domain (ODD) — define where the system works: highways only (Level 2), city streets (Level 4), or everywhere (Level 5). Outside the ODD: the system must refuse to engage or hand off to the human driver. (5) Driver monitoring — camera watches the driver face. Detect: eyes off the road, drowsiness, and hands off the wheel. Alert with increasing urgency: audio chime -> steering wheel vibration -> automatic slowdown and stop. (6) Validation — billions of miles of simulation testing. Shadow mode: the autopilot runs in parallel with a human driver. Compare: would the autopilot have made the same decision? Log disagreements for analysis. Hardware failures: steering motors, brake actuators, and sensors can fail. Redundant steering (dual motors). Redundant braking (electric + hydraulic). Sensor degradation detection (compare left and right cameras — if one differs significantly, it may be obscured).

Training Infrastructure

Training autonomous driving models requires massive data and compute: (1) Data collection — Tesla fleet (millions of cars) continuously collects driving data. Auto-labeling: the production model labels data in real-time. Interesting scenarios (near-collisions, unusual objects, disengagements) are flagged and uploaded for human review and manual labeling. Scale: petabytes of driving video per day. Only a fraction is used for training (selected for diversity and difficulty). (2) Compute — Tesla Dojo supercomputer: custom-designed for video training. ExaFLOP-scale compute. Training a new FSD version: weeks on thousands of GPUs. (3) Simulation — synthetic data generation. Render driving scenarios with: varying weather (rain, snow, fog, night), diverse road types (highway, urban, rural), and rare events (pedestrian jaywalking, debris in road, emergency vehicle). Simulation enables training on dangerous scenarios without real-world risk. (4) Over-the-air updates — new model versions are deployed to the fleet via OTA updates. Shadow mode first (the new model runs but does not control the car — compare with the current model). If the new model is better (fewer predicted disengagements, smoother trajectories): gradually roll out to production. Canary: 1% of fleet, then 10%, then 100%. (5) Continual learning — the model improves continuously from fleet data. Edge cases discovered by one car improve the model for all cars. This is the fundamental advantage of a large fleet: more data -> better model -> safer driving -> more customer trust -> larger fleet -> more data (flywheel effect).