Design Google Maps / Navigation System

Design a navigation system like Google Maps. This problem combines geospatial data at scale, graph algorithms, real-time data ingestion, and distributed systems — more moving parts than most system design questions. Interviewers use it to see how you decompose a complex product into discrete systems.

Requirements Clarification

Core features: Display a map, search for a location, get directions between two points, turn-by-turn navigation with rerouting, ETA estimates.
Scale: 1 billion users, 100M daily active navigating, 20M driver location updates/sec (like Waze with crowdsourced data).
Map scope: Global road network, points of interest (restaurants, gas stations), real-time traffic.
Latency: Route computation < 2 seconds for cross-country route. Map tile loads < 100ms.

The Four Sub-Systems

Google Maps is really four separate systems that share data:

Map tile service — serving the visual map
Search / geocoding — converting addresses and queries to coordinates
Routing engine — finding the optimal path through the road graph
Navigation service — real-time guidance and rerouting

Scope one in the interview unless told otherwise. Most interviewers focus on routing and navigation.

Map Tiles

The map you see is a grid of image tiles. At zoom level 0, one 256×256 tile covers the whole world. Each zoom level quadruples the number of tiles (2× in each dimension). Zoom level 20 (street level) has 2^40 ≈ 1 trillion tiles, most of which are ocean or unpopulated land.

Zoom 0:  1 tile   — entire world
Zoom 1:  4 tiles  — hemispheres
Zoom 5:  1,024    — countries
Zoom 10: 1M       — cities
Zoom 15: 1B       — neighborhoods
Zoom 20: 1T       — individual buildings

Vector tiles vs raster tiles:

Raster tiles (PNG/JPEG images): pre-rendered on the server. Simple to serve. Can’t change style on client. Fixed resolution — blurry when zoomed between levels.
Vector tiles (protobuf-encoded geometry): raw road/polygon data sent to client. Client renders using WebGL/Metal. Style changes (dark mode, accessibility) without new server requests. Smooth zoom. Google Maps and Mapbox both use vector tiles.

Tiles are pre-generated offline (hours-long pipeline) and served from CDN edge nodes. A user in Tokyo gets Tokyo tiles from Tokyo CDN, not from a US data center. Cache hit rate for popular city tiles approaches 100%.

Road Network as a Graph

The road network is a directed weighted graph:

Nodes: Intersections and road endpoints. ~400M globally for OSM data.
Edges: Road segments between intersections. Directed (one-way streets). ~900M globally.
Edge weights: Travel time (not distance). Updated in real time by traffic data.

Node: {node_id, lat, lng}
Edge: {from_node, to_node, distance_m, speed_limit_kmh, road_type, current_travel_time_s}

This graph is too large to fit in RAM on one machine (400M nodes × ~100 bytes = ~40GB). It’s partitioned into geographic tiles and loaded on demand by routing servers.

Routing: Finding the Shortest Path

Dijkstra’s algorithm finds shortest paths in a weighted graph, but exploring 400M nodes for a cross-country route takes minutes. Several optimizations make it fast enough:

A* search: Adds a heuristic — the straight-line (Haversine) distance to the destination. Nodes are explored in order of (current cost + estimated remaining cost). This directs the search toward the destination rather than expanding in all directions, reducing nodes explored by ~10×.

def a_star(graph, start, end):
    open_set = PriorityQueue()
    open_set.put((0, start))
    g_score = {start: 0}

    while open_set:
        _, current = open_set.get()
        if current == end:
            return reconstruct_path(current)

        for neighbor, edge_weight in graph[current]:
            tentative_g = g_score[current] + edge_weight
            if tentative_g < g_score.get(neighbor, inf):
                g_score[neighbor] = tentative_g
                h = haversine(graph.coords[neighbor], graph.coords[end])
                f = tentative_g + h
                open_set.put((f, neighbor))

Bidirectional A*: Run A* simultaneously from start toward end AND from end toward start. When the two frontiers meet, you’ve found the shortest path. Reduces explored nodes by another ~10× — critical for long routes.

Contraction Hierarchies (CH): The technique Google actually uses. Precompute a “highway hierarchy” — identify that some nodes (highway junctions) are more important than others (local streets). Shortcuts are added between important nodes that bypass less-important ones. At query time, only important nodes are explored, reducing the search space from millions to thousands of nodes. Route computation drops to milliseconds. CH preprocessing takes hours on the full graph but only needs to be rerun when the graph changes significantly.

Real-Time Traffic Integration

Traffic data sources:

Crowdsourced GPS: Phones running Google Maps/Waze report their location and speed every few seconds. 20M updates/sec. These are anonymized and aggregated per road segment.
Historical patterns: ML model trained on years of traffic data. “Tuesday at 5pm on I-405 near LAX” is predictably slow. Used for ETA when no real-time data is available.
Incident reports: User-reported accidents, construction, police — injected as temporary high-cost edges.

Edge weights are updated continuously. The routing engine periodically re-reads updated weights (every ~30 seconds). Routes are re-computed with fresh weights for new navigation requests.

Traffic aggregation pipeline:

Phone GPS updates → [Kafka] → Stream Processor (Flink)
  → Map-match GPS points to road segments (hidden Markov model)
  → Aggregate speed per segment (rolling 5-min window)
  → Update edge weight in road graph (Redis / in-memory graph store)
  → Routing servers pick up updated weights

Turn-by-Turn Navigation

Once a route is computed, the navigation service monitors the user’s position and provides instructions:

Phone reports GPS position every second over WebSocket
Navigation service map-matches position to the route
If position deviates from route by >50m for >5 seconds: trigger reroute
Reroute: run routing algorithm from current position to destination with fresh traffic data
Push new route to phone via WebSocket or APNs/FCM

Rerouting must be fast — under 1 second perceived latency. With Contraction Hierarchies, computing a route for a city-level trip takes <10ms server-side, leaving budget for network and rendering.

ETA Estimation

ETA = sum of travel times along all route edges. Edge travel time uses:

Current traffic speed (if available)
Historical speed for this segment at this time of day / day of week
Road type baseline (highway, arterial, residential)

Google’s ETA model also factors in: traffic light timing (green wave effects), turn delays, merge delays at highway on-ramps. A 2023 Google paper showed their ML-based ETA model reduced prediction error by 40% over the pure graph-based approach.

Scale Considerations

Routing servers: Stateless, can scale horizontally. Each holds the full road graph for its region in memory. Route requests are dispatched to the server holding the relevant region.
GPS update ingest: 20M updates/sec → Kafka cluster. Each update is ~50 bytes → 1GB/sec ingest throughput.
Map tile CDN: 99%+ cache hit for popular areas. Long-tail tiles (middle of ocean) are generated on-demand and cached.

Interview Follow-ups

How does the system handle a major road closure or a new road being built? What’s the pipeline to update the graph?
How would you add multi-modal routing — walk to subway, take train, walk to destination?
How do you prevent the entire city from being routed down the same “fastest” street, causing it to become slow?
How does offline maps work — what data is downloaded and how is routing done without a connection?
How would you design the ETA feature for ridesharing — “driver arrives in 4 min”?