System Design: Geospatial Platform — Proximity Search, Geofencing, and Location Data at Scale (2025)

Geospatial Data Fundamentals

Geospatial queries are common in production systems: “find the 10 nearest restaurants,” “is this user inside a geofence?”, “show all events within 50km.” Standard databases without spatial indexes require full table scans for proximity queries — O(n) at minimum. Geospatial indexes make these queries O(log n) or O(k) (k = result count). Key concepts: Haversine distance: straight-line distance between two lat/lng points on Earth’s surface. Approximate for short distances: distance ≈ sqrt((dlat * 111km)^2 + (dlng * cos(lat) * 111km)^2). Bounding box: a rectangle of lat/lng coordinates used to pre-filter candidates before exact distance calculation. Geohash: encodes a lat/lng into a string of characters where common prefix = geographic proximity. S2 geometry: Google’s library that divides Earth into hierarchical cells for efficient spatial operations.

Proximity Search: PostGIS and Redis

-- PostGIS: find restaurants within 5km of a point
CREATE TABLE restaurants (
    id BIGSERIAL PRIMARY KEY,
    name TEXT,
    location GEOGRAPHY(POINT, 4326)  -- lat/lng as geography type
);
CREATE INDEX idx_restaurants_location ON restaurants USING GIST(location);

-- Query: find all restaurants within 5km, sorted by distance
SELECT id, name,
    ST_Distance(location,
                ST_MakePoint(-73.9857, 40.7484)::GEOGRAPHY) AS dist_meters
FROM restaurants
WHERE ST_DWithin(
    location,
    ST_MakePoint(-73.9857, 40.7484)::GEOGRAPHY,
    5000  -- meters
)
ORDER BY dist_meters
LIMIT 20;
# Redis geospatial: fast proximity search
# Add restaurant locations
redis.geoadd("restaurants", longitude, latitude, restaurant_id)

# Find restaurants within 5km, sorted by distance
results = redis.georadius(
    "restaurants",
    longitude=-73.9857,
    latitude=40.7484,
    radius=5,
    unit="km",
    sort="ASC",
    count=20,
    withcoord=True,
    withdist=True
)
# Returns: [(id, dist_km, (lng, lat)), ...]

Geohash-Based Sharding

Geohash encodes a lat/lng into a string: each character halves the precision bounding box. Geohash prefix = geographic region. Precision: length 4 = ~40km x 20km, length 6 = ~1.2km x 0.6km, length 8 = ~38m x 19m. Sharding by geohash: store entities in shards based on their geohash prefix. Proximity query: compute the geohash of the query point, find all neighboring geohash cells (up to 8 neighbors), query those shards. Limitation: geohash cells at high precision are irregular rectangles (not circles) and neighbors do not always share a prefix. Edge case: locations near a geohash boundary have very different prefixes but are physically close — always check 8 neighbors, not just the same cell. S2 cells (Google) address this with hierarchical hexagonal tessellation that avoids the boundary artifacts of rectangular geohash cells.

Geofencing at Scale

Geofencing: detect when a moving entity (user, vehicle) enters or exits a defined polygon. Naive: for each location update, check if the point is inside any polygon (point-in-polygon test: ray casting). O(k * n) where k = location updates/sec, n = total geofences. Optimized: Bounding box pre-filter: each geofence stores its bounding box (min_lat, max_lat, min_lng, max_lng). Index the bounding boxes in a 2D spatial index (R-tree in PostGIS, or geohash-based). On location update: find all geofences whose bounding box contains the point (fast index lookup). Run exact point-in-polygon test only on those candidates (typically < 10 vs. thousands). State tracking: maintain last known inside/outside status per (entity, geofence) to fire ENTER/EXIT events correctly. Store state in Redis (expires after inactivity). Kafka-based pipeline: location updates → Kafka → geofence consumer → state comparison → ENTER/EXIT events → downstream automation.

Location History and Privacy

Storing fine-grained location history (every GPS update) creates privacy risks and storage costs. Storage architecture: raw location stream → Kafka → location consumer: (1) update Redis current location (TTL = 30s, for real-time features). (2) downsample to 1 update per minute, write to time-series table (TimescaleDB or InfluxDB). (3) summarize to “visited places” (significant location stays, anonymized) for long-term retention. Data retention: raw 1-minute samples: 7 days. Summarized visit data: 1 year. Aggregated/anonymized analytics: indefinite. GDPR compliance: users can request deletion of all location data. Implement a “right to erasure” pipeline: delete raw samples, summarize data, and cached locations within 72 hours. Location data is PII — encrypt at rest, restrict access by role.


{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the difference between GEOGRAPHY and GEOMETRY in PostGIS?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”GEOMETRY: treats coordinates as flat 2D plane coordinates (Cartesian). Distance calculations are in the coordinate units (degrees, meters, etc. depending on SRID). Fast computation but inaccurate for large distances on Earth's curved surface. GEOGRAPHY: treats coordinates as lat/lng on Earth's sphere (WGS84, SRID 4326). Distance calculations use the haversine formula and return meters. Accurate for any distance on Earth but slower (~10% overhead vs. GEOMETRY). For most geospatial applications involving real-world locations: use GEOGRAPHY. The ST_DWithin function with GEOGRAPHY correctly computes "within N meters" whereas ST_DWithin with GEOMETRY would incorrectly treat degrees as distance units, giving wrong results at most latitudes.”}},{“@type”:”Question”,”name”:”What are geohash neighbors and why must you check them for proximity queries?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Geohash encodes a lat/lng into a string where strings with the same prefix are in the same geographic area. The issue: cells at geohash boundaries are physically close but have different prefixes. Example: the left half of cell "u09t" and the right half of cell "u09s" may be < 10 meters apart but share no common prefix beyond 3 characters. If your proximity radius spans a geohash cell boundary, you must also check the 8 neighboring cells (N, NE, E, SE, S, SW, W, NW). Compute neighbors by decoding the geohash to lat/lng bounds and looking up the adjacent cell geohashes. Always query the center cell + 8 neighbors, then filter results to exact distance. This ensures you never miss a nearby point due to the boundary effect. At very small radii, a single cell may be sufficient — this is tunable based on your radius vs. geohash precision.”}},{“@type”:”Question”,”name”:”How does Redis implement geospatial commands internally?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Redis geospatial commands (GEOADD, GEODIST, GEORADIUS) use a single sorted set (ZSET) internally. Each member's score is its geohash encoded as a 52-bit integer (using the Mercator-projected coordinates). The sorted set is indexed by this score. GEORADIUS: (1) Decode the query point to a geohash cell. (2) Find neighboring cells to cover the search radius. (3) Compute score ranges for those cells. (4) Perform ZRANGEBYSCORE queries for each range (O(log n + k) each). (5) Filter results by exact distance (haversine). This achieves O(log n + k) per query where k = result count. Limitation: Redis geospatial precision is approximately 0.6mm — sufficient for all practical use cases.”}},{“@type”:”Question”,”name”:”How do you implement geofencing for millions of moving entities at scale?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Naive approach: for each location update, check the entity against all geofences — O(entities * geofences) per second, infeasible at scale. Spatial index approach: index all geofence bounding boxes in a PostGIS R-tree (GIST index). On each location update: (1) ST_DWithin query to find geofences whose bounding box contains the new point (fast index lookup). (2) Exact point-in-polygon test for candidates (ray casting or PostGIS ST_Contains). (3) Compare inside/outside state with previous state (from Redis) to generate ENTER or EXIT events. For very high update rates (millions per second): pre-shard geofences by region. Route each location update to the shard responsible for its geohash cell. Each shard handles a geographic subset of geofences. Kafka for location update stream; stateful Flink consumer for geofence evaluation.”}},{“@type”:”Question”,”name”:”How do you handle location privacy and GDPR compliance for stored location data?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Location data is personal data under GDPR. Key requirements: (1) Consent: explicitly obtain consent before collecting location. Provide granular controls (e.g., share location only while using the app, not in background). (2) Data minimization: collect only the precision needed. If you need city-level location, do not store GPS coordinates. Round coordinates to ~100m precision. (3) Retention limits: define and enforce retention periods. Raw GPS: 7 days. Aggregated: 1 year. Delete automatically at expiry. (4) Right to erasure: on deletion request, delete all location records for the user within 72 hours. Include derived data (visit history, home/work location inferred from patterns). (5) Access controls: location data is restricted to specific roles and services. Log all accesses. Encrypt at rest with per-user keys — key deletion destroys the data.”}}]}

See also: Uber Interview Prep

See also: DoorDash Interview Prep

See also: Lyft Interview Prep

Scroll to Top