Why do you need to search neighbor geohash cells in proximity queries?

A query point near a geohash cell boundary will have nearby entities in adjacent cells that share a different prefix. Searching only the center cell misses those entities. By always fetching the 8 neighboring cells at the chosen precision and applying haversine filtering as a second pass, you guarantee no nearby entity is missed due to cell boundary effects.

When should you use haversine vs PostGIS ST_DWithin?

Use haversine when PostGIS is unavailable — it is accurate to within 0.3% for distances under a few hundred km and easy to implement. Use ST_DWithin with a GEOGRAPHY column when PostGIS is available, as it handles ellipsoidal earth geometry correctly, integrates with the GiST spatial index automatically, and supports the kNN distance operator for efficient nearest-neighbor queries without a fixed radius.

What spatial index type should you use for proximity queries?

For PostGIS, use a GiST index on a GEOGRAPHY or GEOMETRY column. GiST implements an R-tree variant that supports bounding box overlap queries and distance ordering efficiently. For non-PostGIS databases, a B-tree index on the geohash string supports prefix matching (LIKE 'prefix%') which provides an approximate bounding box filter, though it is less accurate than a true spatial index.

How do you handle moving entities like ride-share drivers in a proximity system?

Moving entities require frequent location updates (every 3-5 seconds for drivers). A relational table with geohash is too slow for this update rate. Use Redis geospatial commands instead: GEOADD to update position and GEORADIUS or GEOSEARCH for radius queries. Redis geospatial uses an internal geohash stored in a sorted set, providing sub-millisecond reads and writes. The relational table serves as a durable backup, updated asynchronously.

How does geohash handle boundary queries at cell borders?

A point near the edge of its geohash cell may be closer to points in adjacent cells than to points in its own cell; the query expands to the 8 neighboring geohash cells (same precision) to capture all candidates within the radius.

How does PostGIS ST_DWithin compare to haversine for radius queries?

ST_DWithin uses the spatial GiST index to prune candidates before computing exact spherical distance, making it orders of magnitude faster than computing haversine for every row in the table.

How are moving entities (e.g., drivers) handled in proximity search?

Moving entities update their Location row on a heartbeat schedule (e.g., every 5 seconds); the geohash column is updated on each write, keeping the spatial index current; stale locations are filtered by updated_at threshold.

How is the spatial index chosen between GiST and SP-GiST?

GiST (R-tree based) is preferred for general geometry; SP-GiST (space-partitioned tree) is better for point-heavy datasets with non-overlapping regions; both support ST_DWithin and nearest-neighbor operators.

Proximity Search System Low-Level Design: Geohash Indexing, Radius Queries, and Nearest Neighbor

⏱ 8 min read

Proximity Search System: Overview

A proximity search system finds entities (restaurants, drivers, users) within a geographic radius of a query point. Low-level design covers coordinate storage, geohash-based bounding box filtering, haversine distance computation, spatial indexing, k-nearest neighbor retrieval, and strategies for handling moving entities.

Coordinate Storage

Latitude and longitude are stored as DECIMAL(9,6) columns, providing ~11cm precision at the equator — more than sufficient for any proximity use case:

lat  DECIMAL(9,6)   -- range: -90.000000 to 90.000000
lng  DECIMAL(9,6)   -- range: -180.000000 to 180.000000

Alternatively, PostGIS uses a GEOGRAPHY(POINT, 4326) column that stores coordinates as WGS84 and unlocks native spatial functions. For non-PostGIS environments, the DECIMAL approach with application-level haversine is standard.

Geohash Encoding

A geohash encodes a coordinate pair into a short alphanumeric string. Each additional character halves the cell size:

5 characters: ~4.9km x 4.9km cell
6 characters: ~1.2km x 0.6km cell
7 characters: ~152m x 152m cell
8 characters: ~38m x 19m cell

Entities within the same geohash prefix are geographically close. A prefix search on the geohash column returns a bounding box of candidates without computing exact distances for all rows.

For a given query point and radius, compute the minimum geohash prefix length that contains the radius circle, then search that prefix plus its 8 neighbors (to handle boundary crossing at cell edges).

Neighbor Geohash Cells for Boundary Crossing

A query point near a geohash cell boundary will have nearby entities in adjacent cells. Always expand the search to the 8 neighboring geohash cells at the chosen precision level. Most geohash libraries expose a neighbors(geohash) function returning the 8 adjacent cells.

After fetching candidates from all 9 cells (center + 8 neighbors), apply exact haversine filtering to discard candidates outside the true radius.

Haversine Formula

The haversine formula computes the great-circle distance between two points on a sphere:

a = sin^2((lat2-lat1)/2) + cos(lat1)*cos(lat2)*sin^2((lng2-lng1)/2)
distance = 2 * R * arcsin(sqrt(a))    -- R = 6371 km (Earth radius)

This is accurate for distances up to a few hundred km. For precision over long distances, use the Vincenty formula on the WGS84 ellipsoid.

PostGIS for Accurate Radius Queries

When the database has PostGIS, use ST_DWithin on a GEOGRAPHY column for accurate spherical distance filtering:

SELECT id, entity_type, entity_id
FROM Location
WHERE ST_DWithin(
    geog,
    ST_SetSRID(ST_MakePoint(:lng, :lat), 4326)::geography,
    :radius_meters
)
ORDER BY ST_Distance(geog, ST_SetSRID(ST_MakePoint(:lng, :lat), 4326)::geography)
LIMIT :k;

ST_DWithin on a GEOGRAPHY column uses the GiST spatial index automatically for efficient bounding box pre-filtering followed by exact distance computation.

Spatial Index: GiST vs R-tree

GiST (Generalized Search Tree): PostgreSQL's native spatial index, used by PostGIS. Supports bounding box overlap queries (&&) and distance ordering (<-> operator). Default choice for PostGIS.
R-tree: The underlying data structure in many spatial databases (SpatiaLite, MySQL spatial, Oracle Spatial). Hierarchically groups bounding boxes. GiST in PostgreSQL implements an R-tree variant.

Without PostGIS, index the geohash column with a standard B-tree index and use LIKE 'prefix%' queries for prefix matching. This is less accurate but works in plain SQL.

K-Nearest Neighbor Query

To find the k closest entities regardless of a fixed radius:

SELECT id, entity_id, lat, lng,
       ST_Distance(geog, :query_point::geography) AS dist_meters
FROM Location
WHERE entity_type = :entity_type
ORDER BY geog <-> :query_point::geography
LIMIT :k;

The <-> distance operator in PostGIS uses the GiST index for efficient kNN without a radius pre-filter. For non-PostGIS, fetch candidates from geohash prefix + neighbors, compute haversine for each, sort, and take the top k.

Partitioning: Tile-Based Sharding

For global scale, partition the Location table by geohash prefix (first 2-3 characters). Each shard handles a geographic region. Queries for a given area route to one or two shards (for boundary queries). This collapses hotspots in dense cities to predictable shards that can be independently scaled.

Caching

Popular area results (e.g., restaurants near city center) are cached with a short TTL (30-60 seconds). Cache key: nearby:{geohash_prefix}:{radius}:{entity_type}. For moving entities (ride-share drivers), skip caching or use a very short TTL (5 seconds) since positions change rapidly.

SQL Schema

-- Location table (non-PostGIS variant)
CREATE TABLE Location (
    id          BIGSERIAL PRIMARY KEY,
    entity_type VARCHAR(32) NOT NULL,   -- driver, restaurant, user
    entity_id   BIGINT NOT NULL,
    lat         DECIMAL(9,6) NOT NULL,
    lng         DECIMAL(9,6) NOT NULL,
    geohash     VARCHAR(12) NOT NULL,   -- pre-computed geohash
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- B-tree index for geohash prefix search
CREATE INDEX idx_location_geohash ON Location(geohash, entity_type);
-- Index for point lookups by entity
CREATE UNIQUE INDEX idx_location_entity ON Location(entity_type, entity_id);

-- PostGIS variant: add a GEOGRAPHY column and GiST index
-- ALTER TABLE Location ADD COLUMN geog GEOGRAPHY(POINT, 4326);
-- CREATE INDEX idx_location_geog ON Location USING GIST(geog);

Python Implementation

import math
import geohash2  # pip install geohash2

EARTH_RADIUS_KM = 6371.0

def encode_geohash(lat: float, lng: float, precision: int = 7) -> str:
    """Encode coordinates to geohash string at given precision."""
    return geohash2.encode(lat, lng, precision=precision)

def geohash_precision_for_radius(radius_km: float) -> int:
    """Choose geohash precision so cell size  4.9:
        return 4
    elif radius_km > 1.2:
        return 5
    elif radius_km > 0.15:
        return 6
    return 7

def compute_haversine(lat1: float, lng1: float, lat2: float, lng2: float) -> float:
    """Return great-circle distance in km."""
    r = EARTH_RADIUS_KM
    phi1, phi2 = math.radians(lat1), math.radians(lat2)
    dphi = math.radians(lat2 - lat1)
    dlambda = math.radians(lng2 - lng1)

    a = math.sin(dphi/2)**2 + math.cos(phi1)*math.cos(phi2)*math.sin(dlambda/2)**2
    return 2 * r * math.asin(math.sqrt(a))

def find_nearby(lat: float, lng: float, radius_km: float,
                entity_type: str, limit: int = 20) -> list:
    """Find nearby entities using geohash prefix filter + haversine exact check."""
    precision = geohash_precision_for_radius(radius_km)
    center_hash = encode_geohash(lat, lng, precision)
    neighbor_hashes = geohash2.neighbors(center_hash)
    search_hashes = [center_hash] + list(neighbor_hashes.values())

    placeholders = ','.join(['%s'] * len(search_hashes))
    rows = db.execute(
        f"SELECT entity_id, lat, lng FROM Location"
        f" WHERE entity_type=%s"
        f"   AND LEFT(geohash, %s) IN ({placeholders})",
        (entity_type, precision, *search_hashes)
    ).fetchall()

    results = []
    for entity_id, elat, elng in rows:
        dist = compute_haversine(lat, lng, float(elat), float(elng))
        if dist <= radius_km:
            results.append({"entity_id": entity_id, "distance_km": round(dist, 4)})

    results.sort(key=lambda x: x["distance_km"])
    return results[:limit]

Key Design Decisions Summary

Geohash prefix search reduces candidates from millions to hundreds before exact distance computation.
Always search 8 neighbor cells to prevent boundary-crossing misses.
PostGIS ST_DWithin + GiST index is the most accurate and performant option when available.
Haversine is sufficient for distances up to ~500km; use Vincenty for global precision requirements.
Moving entities (drivers) require frequent location updates — use Redis geospatial (GEOADD / GEORADIUS) for sub-second update and query latency.