Spatial Index Low-Level Design: R-Tree, Geohash Grid, and Range Query Optimization

Why B-Trees Cannot Index 2D Space

A B-tree sorts data along a single dimension. Given a point (lat, lng), there is no single linear ordering that keeps spatially nearby points adjacent in the index. If you sort by latitude, then two points at lat=40 but lng=-73 and lng=120 are adjacent in the index but thousands of kilometers apart. A range query “find all points within 10km of (40.7, -74.0)” translates to a two-dimensional bounding box, which a one-dimensional B-tree cannot evaluate without scanning every row within the latitude range and filtering on longitude.

Spatial indexes solve this by organizing data in multi-dimensional space, allowing range queries to prune large portions of the search space without scanning every point.

R-Tree Structure

An R-tree is a balanced tree where every node stores a Minimum Bounding Rectangle (MBR) — the smallest axis-aligned rectangle that encloses all geometries in its subtree. Leaf nodes store individual geometries plus their MBRs. Internal nodes store the MBR of their children's MBRs.

The key insight: if a query bounding box does not intersect a node's MBR, no geometry in that subtree can possibly match the query. The search prunes entire subtrees early without examining individual geometries.

R-tree properties:

  • All leaves are at the same depth (balanced)
  • Each node except the root has between m and M entries (m ≥ ceil(M/2))
  • The root has at least 2 entries unless it is a leaf

To find all geometries intersecting a query rectangle Q:

  1. At the root, check each child MBR. If it intersects Q, recurse into that child.
  2. At each internal node, prune children whose MBRs do not intersect Q.
  3. At leaf nodes, check each geometry's MBR against Q. If it intersects, add to candidates.
  4. For exact geometric predicates (not just MBR), perform the precise geometry intersection check on each candidate.

This filter-refinement approach avoids precise geometry computation on the vast majority of indexed shapes.

R-Tree Insert and Split Algorithms

To insert a geometry G:

  1. Descend from root, at each level choosing the child whose MBR requires the smallest area increase to accommodate G.
  2. Insert G into the chosen leaf.
  3. If the leaf overflows (more than M entries), split it. Splitting algorithms:
  • Linear split — O(M), low quality: find the two seeds with maximum separation along each axis, assign remaining entries to the nearer seed.
  • Quadratic split — O(M²), better quality: find the pair of entries that would waste the most space if grouped together; use as seeds.
  • R*-tree split — O(M log M), best quality: sort entries along each axis, evaluate split points by combined area and overlap; also introduces forced reinsert (remove a fraction of entries from an overflowing node and reinsert them, often avoiding the split entirely).

R*-Tree and GiST

The R*-tree is the preferred R-tree variant in practice. Its forced reinsert strategy reduces MBR overlap between sibling nodes, which is the main cause of R-tree query inefficiency (high overlap means more subtrees must be explored per query). PostgreSQL implements spatial indexing via GiST (Generalized Search Tree) — a pluggable index framework. The PostGIS extension provides a GiST operator class for the GEOMETRY type that implements R-tree semantics using the GiST API. Queries using && (MBR overlap), ~ (contains), and @ (within) operators use the GiST index automatically.

Geohash Grid

A geohash encodes a (lat, lng) coordinate as a string by interleaving the binary representations of latitude and longitude. The string prefix represents a rectangular cell; longer prefixes represent smaller, more precise cells. Cells that share a prefix are geographically adjacent (with some edge-case exceptions at cell boundaries).

Geohash is simpler than an R-tree and works naturally with a B-tree index on the geohash string. A bounding box query translates to a set of geohash prefixes covering the bounding box. The main limitation: geohash cells are rectangular and do not align with arbitrary query shapes; queries near cell boundaries may require checking multiple prefix ranges.

Geohash is widely used for approximate proximity queries (e.g., “find all restaurants with geohash prefix dr5ru”) in systems where a simple B-tree index is preferred over a specialized spatial index.

Spatial Join

A spatial join finds all pairs (A, B) where geometry A intersects geometry B. Naive nested-loop join is O(|A| × |B|). The standard approach uses an index:

  1. Filter phase: for each geometry in A, use the R-tree on B to find candidates whose MBR intersects A's MBR.
  2. Refinement phase: for each candidate pair, perform precise geometric intersection (PostGIS uses GEOS library for this).

The spatial join cost is O(|A| log |B|) for the index lookup phase plus O(|candidates| × geometry_cost) for refinement.

SQL Data Model

-- Spatial objects table (requires PostGIS extension)
CREATE EXTENSION IF NOT EXISTS postgis;

CREATE TABLE SpatialObject (
    id          BIGSERIAL PRIMARY KEY,
    geom        GEOMETRY(Geometry, 4326) NOT NULL,  -- WGS84 coordinate system
    geom_type   VARCHAR(64) NOT NULL,               -- Point, LineString, Polygon, etc.
    properties  JSONB,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- GiST spatial index (R-tree semantics)
CREATE INDEX idx_spatialobject_geom ON SpatialObject USING GIST(geom);

-- Bounding box filter table for pre-filtering
CREATE TABLE BoundingBox (
    id          BIGSERIAL PRIMARY KEY,
    object_id   BIGINT REFERENCES SpatialObject(id),
    min_lat     DOUBLE PRECISION NOT NULL,
    min_lng     DOUBLE PRECISION NOT NULL,
    max_lat     DOUBLE PRECISION NOT NULL,
    max_lng     DOUBLE PRECISION NOT NULL
);

CREATE INDEX idx_bbox_coords ON BoundingBox(min_lat, max_lat, min_lng, max_lng);

Python Implementation Sketch

from dataclasses import dataclass, field
from typing import Optional
import math

@dataclass
class BBox:
    min_x: float; min_y: float
    max_x: float; max_y: float

    def intersects(self, other: "BBox") -> bool:
        return (self.min_x = other.min_x and
                self.min_y = other.min_y)

    def area(self) -> float:
        return (self.max_x - self.min_x) * (self.max_y - self.min_y)

    def expand_to_include(self, other: "BBox") -> "BBox":
        return BBox(min(self.min_x, other.min_x), min(self.min_y, other.min_y),
                    max(self.max_x, other.max_x), max(self.max_y, other.max_y))

@dataclass
class RTreeEntry:
    mbr: BBox
    data: object = None         # geometry or properties for leaf entries
    children: list = field(default_factory=list)  # child RTreeEntry for internal nodes

    @property
    def is_leaf(self) -> bool:
        return len(self.children) == 0 and self.data is not None

class RTree:
    MAX_ENTRIES = 8
    MIN_ENTRIES = 3

    def __init__(self):
        self.root: Optional[RTreeEntry] = None

    def insert_geometry(self, geom, properties: dict) -> None:
        """Insert a geometry with its bounding box into the R-tree."""
        mbr = self._compute_mbr(geom)
        entry = RTreeEntry(mbr=mbr, data={"geom": geom, "properties": properties})
        if self.root is None:
            self.root = RTreeEntry(mbr=mbr, children=[entry])
        else:
            self._insert(self.root, entry, depth=0)

    def _compute_mbr(self, geom) -> BBox:
        """Compute minimum bounding rectangle for a geometry."""
        # Placeholder: in practice, iterate coordinates
        if isinstance(geom, tuple) and len(geom) == 2:
            x, y = geom
            return BBox(x, y, x, y)
        return BBox(0, 0, 0, 0)

    def range_query(self, bbox: BBox) -> list:
        """Return all geometries whose MBR intersects bbox."""
        results = []
        if self.root:
            self._search(self.root, bbox, results)
        return results

    def _search(self, node: RTreeEntry, bbox: BBox, results: list) -> None:
        if not node.mbr.intersects(bbox):
            return
        if node.is_leaf:
            results.append(node.data)
            return
        for child in node.children:
            self._search(child, bbox, results)

    def nearest_neighbor_query(self, point: tuple, k: int) -> list:
        """Return k nearest geometries to point using branch-and-bound."""
        # Placeholder: priority queue ordered by min distance to node MBR
        return []

    def _insert(self, node: RTreeEntry, entry: RTreeEntry, depth: int) -> None:
        """Recursive insert — choose subtree with min MBR enlargement."""
        if not node.children or node.children[0].is_leaf:
            node.children.append(entry)
            node.mbr = node.mbr.expand_to_include(entry.mbr)
            # Handle overflow: split if needed
            if len(node.children) > self.MAX_ENTRIES:
                pass  # split logic omitted for brevity
        else:
            best = min(node.children,
                       key=lambda c: c.mbr.expand_to_include(entry.mbr).area() - c.mbr.area())
            self._insert(best, entry, depth + 1)
            node.mbr = node.mbr.expand_to_include(entry.mbr)

def spatial_join(layer_a: list, layer_b: list, predicate: str = "intersects") -> list:
    """Naive filter phase spatial join. Replace inner loop with R-tree lookup."""
    results = []
    for a in layer_a:
        for b in layer_b:
            a_bbox = BBox(*a["bbox"])
            b_bbox = BBox(*b["bbox"])
            if a_bbox.intersects(b_bbox):
                results.append((a, b))
    return results

def encode_geohash(lat: float, lng: float, precision: int = 8) -> str:
    """Encode lat/lng as a geohash string of given precision."""
    BASE32 = "0123456789bcdefghjkmnpqrstuvwxyz"
    lat_range, lng_range = [-90.0, 90.0], [-180.0, 180.0]
    bits, bit_count, geohash = 0, 0, []
    is_lng = True
    while len(geohash) = mid:
                bits = (bits << 1) | 1; lng_range[0] = mid
            else:
                bits = bits <= mid:
                bits = (bits << 1) | 1; lat_range[0] = mid
            else:
                bits = bits << 1; lat_range[1] = mid
        is_lng = not is_lng
        bit_count += 1
        if bit_count == 5:
            geohash.append(BASE32[bits]); bits, bit_count = 0, 0
    return "".join(geohash)

Frequently Asked Questions

R-Tree vs Geohash

R-tree supports arbitrary geometry types and handles queries of any shape efficiently. Geohash uses a B-tree on encoded strings — simpler to implement but struggles near cell boundaries and does not handle non-point geometries well. Use R-tree (GiST in PostgreSQL) for production spatial queries; geohash for simple proximity lookups where approximate results are acceptable.

MBR Overlap Problem in R-Tree

When sibling MBRs overlap, a query intersecting both must explore both subtrees. In the worst case all MBRs overlap and the tree degenerates to a linear scan. R*-tree's forced reinsert strategy minimizes overlap by reinserting overflow entries rather than immediately splitting.

GiST Operator Class

GiST separates tree traversal from domain key comparison. An operator class provides consistent(), union(), compress(), penalty(), and picksplit(). PostGIS implements these for GEOMETRY, enabling any query using &&, @, or ~ operators to use the R-tree GiST index automatically.

Spatial Join Complexity

Naive nested-loop join: O(|A| × |B|). With an R-tree index on B: O(|A| log |B| + candidates). PostGIS executes this as an index-accelerated join via ST_Intersects when a GiST index exists on the geometry columns.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering

Scroll to Top