Spatial Index Low-Level Design: R-Tree, Geohash Grid, and Range Query Optimization

Why B-Trees Cannot Index 2D Space

A B-tree sorts data along a single dimension. Given a point (lat, lng), there is no single linear ordering that keeps spatially nearby points adjacent in the index. If you sort by latitude, then two points at lat=40 but lng=-73 and lng=120 are adjacent in the index but thousands of kilometers apart. A range query “find all points within 10km of (40.7, -74.0)” translates to a two-dimensional bounding box, which a one-dimensional B-tree cannot evaluate without scanning every row within the latitude range and filtering on longitude.

Spatial indexes solve this by organizing data in multi-dimensional space, allowing range queries to prune large portions of the search space without scanning every point.

R-Tree Structure

An R-tree is a balanced tree where every node stores a Minimum Bounding Rectangle (MBR) — the smallest axis-aligned rectangle that encloses all geometries in its subtree. Leaf nodes store individual geometries plus their MBRs. Internal nodes store the MBR of their children's MBRs.

The key insight: if a query bounding box does not intersect a node's MBR, no geometry in that subtree can possibly match the query. The search prunes entire subtrees early without examining individual geometries.

R-tree properties:

All leaves are at the same depth (balanced)
Each node except the root has between m and M entries (m ≥ ceil(M/2))
The root has at least 2 entries unless it is a leaf

R-Tree Search

To find all geometries intersecting a query rectangle Q:

At the root, check each child MBR. If it intersects Q, recurse into that child.
At each internal node, prune children whose MBRs do not intersect Q.
At leaf nodes, check each geometry's MBR against Q. If it intersects, add to candidates.
For exact geometric predicates (not just MBR), perform the precise geometry intersection check on each candidate.

This filter-refinement approach avoids precise geometry computation on the vast majority of indexed shapes.

R-Tree Insert and Split Algorithms

To insert a geometry G:

Descend from root, at each level choosing the child whose MBR requires the smallest area increase to accommodate G.
Insert G into the chosen leaf.
If the leaf overflows (more than M entries), split it. Splitting algorithms:

Linear split — O(M), low quality: find the two seeds with maximum separation along each axis, assign remaining entries to the nearer seed.
Quadratic split — O(M²), better quality: find the pair of entries that would waste the most space if grouped together; use as seeds.
R*-tree split — O(M log M), best quality: sort entries along each axis, evaluate split points by combined area and overlap; also introduces forced reinsert (remove a fraction of entries from an overflowing node and reinsert them, often avoiding the split entirely).

R*-Tree and GiST

The R*-tree is the preferred R-tree variant in practice. Its forced reinsert strategy reduces MBR overlap between sibling nodes, which is the main cause of R-tree query inefficiency (high overlap means more subtrees must be explored per query). PostgreSQL implements spatial indexing via GiST (Generalized Search Tree) — a pluggable index framework. The PostGIS extension provides a GiST operator class for the GEOMETRY type that implements R-tree semantics using the GiST API. Queries using && (MBR overlap), ~ (contains), and @ (within) operators use the GiST index automatically.

Geohash Grid

A geohash encodes a (lat, lng) coordinate as a string by interleaving the binary representations of latitude and longitude. The string prefix represents a rectangular cell; longer prefixes represent smaller, more precise cells. Cells that share a prefix are geographically adjacent (with some edge-case exceptions at cell boundaries).

Geohash is simpler than an R-tree and works naturally with a B-tree index on the geohash string. A bounding box query translates to a set of geohash prefixes covering the bounding box. The main limitation: geohash cells are rectangular and do not align with arbitrary query shapes; queries near cell boundaries may require checking multiple prefix ranges.

Geohash is widely used for approximate proximity queries (e.g., “find all restaurants with geohash prefix dr5ru”) in systems where a simple B-tree index is preferred over a specialized spatial index.

Spatial Join

A spatial join finds all pairs (A, B) where geometry A intersects geometry B. Naive nested-loop join is O(|A| × |B|). The standard approach uses an index:

Filter phase: for each geometry in A, use the R-tree on B to find candidates whose MBR intersects A's MBR.
Refinement phase: for each candidate pair, perform precise geometric intersection (PostGIS uses GEOS library for this).

The spatial join cost is O(|A| log |B|) for the index lookup phase plus O(|candidates| × geometry_cost) for refinement.

SQL Data Model

-- Spatial objects table (requires PostGIS extension)
CREATE EXTENSION IF NOT EXISTS postgis;

CREATE TABLE SpatialObject (
    id          BIGSERIAL PRIMARY KEY,
    geom        GEOMETRY(Geometry, 4326) NOT NULL,  -- WGS84 coordinate system
    geom_type   VARCHAR(64) NOT NULL,               -- Point, LineString, Polygon, etc.
    properties  JSONB,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- GiST spatial index (R-tree semantics)
CREATE INDEX idx_spatialobject_geom ON SpatialObject USING GIST(geom);

-- Bounding box filter table for pre-filtering
CREATE TABLE BoundingBox (
    id          BIGSERIAL PRIMARY KEY,
    object_id   BIGINT REFERENCES SpatialObject(id),
    min_lat     DOUBLE PRECISION NOT NULL,
    min_lng     DOUBLE PRECISION NOT NULL,
    max_lat     DOUBLE PRECISION NOT NULL,
    max_lng     DOUBLE PRECISION NOT NULL
);

CREATE INDEX idx_bbox_coords ON BoundingBox(min_lat, max_lat, min_lng, max_lng);

Python Implementation Sketch

from dataclasses import dataclass, field
from typing import Optional
import math

@dataclass
class BBox:
    min_x: float; min_y: float
    max_x: float; max_y: float

    def intersects(self, other: "BBox") -> bool:
        return (self.min_x = other.min_x and
                self.min_y = other.min_y)

    def area(self) -> float:
        return (self.max_x - self.min_x) * (self.max_y - self.min_y)

    def expand_to_include(self, other: "BBox") -> "BBox":
        return BBox(min(self.min_x, other.min_x), min(self.min_y, other.min_y),
                    max(self.max_x, other.max_x), max(self.max_y, other.max_y))

@dataclass
class RTreeEntry:
    mbr: BBox
    data: object = None         # geometry or properties for leaf entries
    children: list = field(default_factory=list)  # child RTreeEntry for internal nodes

    @property
    def is_leaf(self) -> bool:
        return len(self.children) == 0 and self.data is not None

class RTree:
    MAX_ENTRIES = 8
    MIN_ENTRIES = 3

    def __init__(self):
        self.root: Optional[RTreeEntry] = None

    def insert_geometry(self, geom, properties: dict) -> None:
        """Insert a geometry with its bounding box into the R-tree."""
        mbr = self._compute_mbr(geom)
        entry = RTreeEntry(mbr=mbr, data={"geom": geom, "properties": properties})
        if self.root is None:
            self.root = RTreeEntry(mbr=mbr, children=[entry])
        else:
            self._insert(self.root, entry, depth=0)

    def _compute_mbr(self, geom) -> BBox:
        """Compute minimum bounding rectangle for a geometry."""
        # Placeholder: in practice, iterate coordinates
        if isinstance(geom, tuple) and len(geom) == 2:
            x, y = geom
            return BBox(x, y, x, y)
        return BBox(0, 0, 0, 0)

    def range_query(self, bbox: BBox) -> list:
        """Return all geometries whose MBR intersects bbox."""
        results = []
        if self.root:
            self._search(self.root, bbox, results)
        return results

    def _search(self, node: RTreeEntry, bbox: BBox, results: list) -> None:
        if not node.mbr.intersects(bbox):
            return
        if node.is_leaf:
            results.append(node.data)
            return
        for child in node.children:
            self._search(child, bbox, results)

    def nearest_neighbor_query(self, point: tuple, k: int) -> list:
        """Return k nearest geometries to point using branch-and-bound."""
        # Placeholder: priority queue ordered by min distance to node MBR
        return []

    def _insert(self, node: RTreeEntry, entry: RTreeEntry, depth: int) -> None:
        """Recursive insert — choose subtree with min MBR enlargement."""
        if not node.children or node.children[0].is_leaf:
            node.children.append(entry)
            node.mbr = node.mbr.expand_to_include(entry.mbr)
            # Handle overflow: split if needed
            if len(node.children) > self.MAX_ENTRIES:
                pass  # split logic omitted for brevity
        else:
            best = min(node.children,
                       key=lambda c: c.mbr.expand_to_include(entry.mbr).area() - c.mbr.area())
            self._insert(best, entry, depth + 1)
            node.mbr = node.mbr.expand_to_include(entry.mbr)

def spatial_join(layer_a: list, layer_b: list, predicate: str = "intersects") -> list:
    """Naive filter phase spatial join. Replace inner loop with R-tree lookup."""
    results = []
    for a in layer_a:
        for b in layer_b:
            a_bbox = BBox(*a["bbox"])
            b_bbox = BBox(*b["bbox"])
            if a_bbox.intersects(b_bbox):
                results.append((a, b))
    return results

def encode_geohash(lat: float, lng: float, precision: int = 8) -> str:
    """Encode lat/lng as a geohash string of given precision."""
    BASE32 = "0123456789bcdefghjkmnpqrstuvwxyz"
    lat_range, lng_range = [-90.0, 90.0], [-180.0, 180.0]
    bits, bit_count, geohash = 0, 0, []
    is_lng = True
    while len(geohash) = mid:
                bits = (bits << 1) | 1; lng_range[0] = mid
            else:
                bits = bits <= mid:
                bits = (bits << 1) | 1; lat_range[0] = mid
            else:
                bits = bits << 1; lat_range[1] = mid
        is_lng = not is_lng
        bit_count += 1
        if bit_count == 5:
            geohash.append(BASE32[bits]); bits, bit_count = 0, 0
    return "".join(geohash)

Frequently Asked Questions

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between an R-tree and a geohash index?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An R-tree stores minimum bounding rectangles in a tree structure, supports arbitrary geometry types (points, lines, polygons), and handles queries of any shape efficiently. A geohash encodes coordinates as strings with a prefix hierarchy and uses a regular B-tree on those strings. Geohash is simpler and works well for point proximity queries at fixed precision levels, but struggles near cell boundaries (two nearby points can have completely different geohash prefixes), requires querying multiple prefix ranges for bounding boxes, and does not handle non-point geometries well. R-tree is more general and precise; geohash is simpler to implement and understand.”
}
},
{
“@type”: “Question”,
“name”: “Why does high MBR overlap degrade R-tree performance?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When sibling nodes have overlapping MBRs, a search query that intersects both MBRs must explore both subtrees to find all matches. In the worst case, if all MBRs overlap, every query must visit every node in the tree — degenerating to a linear scan. R*-tree's split algorithm minimizes MBR overlap by evaluating multiple split axes and using forced reinsert (removing overflow entries and reinserting them, which often results in less overlap than splitting). High overlap is the primary reason naive R-tree implementations underperform on non-point geometries.”
}
},
{
“@type”: “Question”,
“name”: “How does the GiST operator class work in PostgreSQL?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “GiST (Generalized Search Tree) is a pluggable index framework in PostgreSQL that separates the tree traversal logic from the domain-specific key comparison logic. An operator class provides implementations of consistent() (does this entry match the query?), union() (compute the MBR of a set of entries), compress()/decompress() (key storage format), penalty() (cost of inserting an entry into a node), and picksplit() (how to split an overflowing node). PostGIS provides a GiST operator class for the GEOMETRY type that implements these functions using R-tree geometry. Any query using spatial operators (&&, @, ~) automatically uses the GiST index.”
}
},
{
“@type”: “Question”,
“name”: “What is the complexity of a spatial join?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A naive nested-loop spatial join is O(|A| * |B|). Using an R-tree index on B, the filter phase reduces this to O(|A| * log|B| + |candidates|) for the MBR intersection checks, plus O(|candidates| * geometry_cost) for precise geometry intersection in the refinement phase. For most practical geometries, |candidates| is much smaller than |A| * |B|. PostGIS implements this as an index-accelerated join when you write: WHERE ST_Intersects(a.geom, b.geom). The planner uses the GiST index on one or both tables depending on selectivity estimates.”
}
}
]
}

R-Tree vs Geohash

R-tree supports arbitrary geometry types and handles queries of any shape efficiently. Geohash uses a B-tree on encoded strings — simpler to implement but struggles near cell boundaries and does not handle non-point geometries well. Use R-tree (GiST in PostgreSQL) for production spatial queries; geohash for simple proximity lookups where approximate results are acceptable.

MBR Overlap Problem in R-Tree

When sibling MBRs overlap, a query intersecting both must explore both subtrees. In the worst case all MBRs overlap and the tree degenerates to a linear scan. R*-tree's forced reinsert strategy minimizes overlap by reinserting overflow entries rather than immediately splitting.

GiST Operator Class

GiST separates tree traversal from domain key comparison. An operator class provides consistent(), union(), compress(), penalty(), and picksplit(). PostGIS implements these for GEOMETRY, enabling any query using &&, @, or ~ operators to use the R-tree GiST index automatically.

Spatial Join Complexity

Naive nested-loop join: O(|A| × |B|). With an R-tree index on B: O(|A| log |B| + candidates). PostGIS executes this as an index-accelerated join via ST_Intersects when a GiST index exists on the geometry columns.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does an R-tree prune subtrees during a spatial query?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each internal node stores the MBR (minimum bounding rectangle) of all geometries in its subtree; the query bounding box is tested against each child MBR; subtrees whose MBR does not intersect the query box are pruned, avoiding traversal of irrelevant portions of the tree.”
}
},
{
“@type”: “Question”,
“name”: “How does geohash differ from R-tree for proximity search?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Geohash encodes 2D coordinates into a 1D string, enabling prefix-based range queries on a standard B-tree index; R-tree natively handles 2D bounding box queries with higher precision but requires a specialized index structure; geohash is simpler and works with standard databases.”
}
},
{
“@type”: “Question”,
“name”: “What is the filter-refinement approach in spatial queries?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The filter step uses the R-tree MBR index to quickly find candidate geometries whose bounding boxes intersect the query; the refinement step computes exact geometric intersection on only the candidates, eliminating false positives from MBR approximation.”
}
},
{
“@type”: “Question”,
“name”: “How is a spatial join optimized between two large geometry layers?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A nested loop spatial join scans each geometry in layer A and queries the R-tree of layer B; with index on both sides, a sort-merge join using geohash-encoded extents can reduce the cross-product to only candidate pairs with overlapping MBRs.”
}
}
]
}