Design a Proximity Service (Yelp / Nearby Search)

Design a proximity service like Yelp’s “restaurants near me” or Google Maps’ nearby search. The core challenge is geospatial indexing — you need to find all businesses within a radius quickly, without scanning your entire database of millions of locations.

Requirements Clarification

  • Operations: Find businesses within radius R of (lat, lng), filtered by category and optional rating. Add/update/delete business listings. View business details.
  • Scale: 100M businesses globally (Yelp has ~8M US listings; Google has 200M+). 1B search queries/day.
  • Search radius options: 500m, 1km, 5km, 20km. Return top 20 results per query.
  • Read/write ratio: Extremely read-heavy. Searches >> business updates. Businesses update infrequently (hours to days). Slight staleness is acceptable.
  • Latency: <100ms p99 for search queries.

Back-of-Envelope

  • 1B searches/day → ~11,600 searches/sec (peak ~50K/sec in major cities at lunch)
  • 100M businesses × ~200 bytes metadata = 20GB — fits comfortably in memory
  • Business updates: ~1M/day → ~12/sec. Truly read-dominated.

The Geospatial Indexing Problem

A naive approach: SELECT * FROM businesses WHERE lat BETWEEN ? AND ? AND lng BETWEEN ? AND ?. This scans a 2D bounding box. At 100M businesses, even with separate lat and lng indexes, the DB can only use one index — you still scan millions of rows.

The solution: map 2D coordinates to a 1D space that preserves proximity. Two dominant approaches: Geohash and QuadTree.

Approach 1: Geohash

Geohash recursively divides the world into a grid of rectangular cells. Each cell is identified by a Base32 string. Cells that are geographically close share a common prefix.

San Francisco: 37.7749° N, 122.4194° W
Geohash precision 6: "9q8yy4"
Geohash precision 5: "9q8yy" (larger cell, ~5km × 5km)
Geohash precision 4: "9q8y"  (larger cell, ~40km × 40km)

Geohash precision vs cell size:

Length Cell size (approx) Use case
4 ~45km × 45km City-level
5 ~5km × 5km Neighborhood
6 ~1km × 1km Street block
7 ~150m × 150m Individual address

To find nearby businesses: compute the user’s geohash at the appropriate precision, find all businesses in that cell and the 8 neighboring cells (to handle edge cases where the user is near a cell boundary).

-- Index on geohash prefix
CREATE INDEX idx_geohash ON businesses(geohash(6));

-- Query: user at geohash "9q8yy4", search radius ~1km → precision 6
SELECT * FROM businesses
WHERE geohash_6 IN ('9q8yy4', '9q8yy5', '9q8yy7', '9q8yyh', /* 8 neighbors */)
  AND category = 'restaurant'
ORDER BY rating DESC
LIMIT 20;

After the DB query, apply exact distance filtering in application code (Haversine formula) to exclude any results outside the true radius — the geohash cell is rectangular, not circular.

Geohash edge case: Two locations very close together but in different parent cells may not share a prefix. Always include the 8 neighboring cells, not just the user’s cell.

Approach 2: QuadTree

A QuadTree recursively divides a 2D space into four quadrants until each leaf cell contains at most K businesses (typically K=100). Dense urban areas create deeper trees; sparse rural areas are shallow.

World
├── NW quadrant (Europe, N. America...)
│   ├── NW (Canada, W. Europe)
│   │   └── ... (recurse until <100 businesses per leaf)
│   ├── NE (E. Europe, Russia)
│   ├── SW (S. America, W. Africa)
│   └── SE (...)
└── SE quadrant

Search: traverse the tree to find the leaf containing (lat, lng). Expand to neighboring cells until you’ve collected enough results. QuadTree nodes are small enough to fit the entire index in memory on a single server (Google Maps keeps quadtree segments in memory on each serving node).

Geohash vs QuadTree:

Aspect Geohash QuadTree
Implementation Simple — store a string column in DB Complex — in-memory tree structure
Density handling Fixed cell size (uneven density) Adaptive — dense areas get smaller cells
Cell boundary edge cases 8 neighbor query needed Tree traversal handles naturally
Update latency Milliseconds (DB write) Requires tree rebuild or delta update
Best for Most production use cases Very high traffic, uniform QPS

For a system design interview, Geohash is the recommended starting point. It’s simpler to explain, implement, and operate. Mention QuadTree as an alternative for interviewer brownie points.

Architecture

Client → API Gateway
           ├─ Location Service (geospatial queries) → Redis (geohash index) → MySQL (business data)
           └─ Business Service (CRUD)               → MySQL → [Kafka] → Location Service cache update

Location Service:
  1. Compute geohash(6) for user (lat, lng)
  2. Generate 9 geohash cells (user cell + 8 neighbors)
  3. Fetch business IDs from Redis: SMEMBERS geohash:{cell}
  4. Batch-fetch business details from MySQL or Redis cache
  5. Filter by exact radius, category, rating
  6. Sort and return top 20

Redis for Geospatial

Redis has built-in geospatial commands:

-- Add businesses (longitude first — Redis convention)
GEOADD businesses -122.4194 37.7749 "business:1234"
GEOADD businesses -122.4089 37.7831 "business:5678"

-- Find all businesses within 1km of a point
GEORADIUS businesses -122.4194 37.7749 1 km
  ASC COUNT 20 WITHCOORD WITHDIST

-- Returns: [["business:1234", 0.123, [-122.4194, 37.7749]], ...]

Redis GEORADIUS (now GEOSEARCH in Redis 6.2+) uses a Geohash-based sorted set internally. This gives O(N+log M) performance where N is results and M is total entries — fast even at 100M businesses if sharded by region.

Sharding Redis: Shard by region (continent or country). All US businesses on US Redis nodes, EU on EU nodes. The location service routes based on user’s rough region. Cross-shard queries (a user near the US/Canada border) are handled by querying both shards and merging.

Caching

Search results are highly cacheable. “Restaurants within 1km of geohash 9q8yy4” returns the same results for every user who queries from that cell — cache at the geohash+category+radius level. TTL: 1 minute (balance freshness vs cache hit rate). Cache in Redis with a simple string key.

Business details (name, address, hours, photos, rating) change infrequently — cache per business_id with a 5-minute TTL.

Business Review and Rating

Ratings are aggregated offline (batch job every hour) rather than computed on every query. Store pre-aggregated avg_rating and review_count on the business record. Individual reviews go to a separate reviews service and table, which feeds the aggregation job.

Interview Follow-ups

  • How do you handle businesses that span multiple locations (e.g., a chain restaurant with 50 locations)?
  • How would you add real-time business information — “busy right now” indicator based on live foot traffic?
  • How does your design change if users are searching while moving (navigation mode vs point search)?
  • How do you personalize results for a returning user? (Factoring in cuisine preferences, past visits, ratings given.)
  • How would you implement the autocomplete for the business search box?

Related System Design Topics

Scroll to Top