Graph Search System: Overview
A graph search system enables queries over a network of nodes and edges — users, connections, groups, and their relationships. Low-level design covers storage format, traversal algorithms, index structures, query execution, and distributed partitioning for multi-hop queries at scale.
Adjacency List Storage in Relational DB
The simplest durable graph representation uses a relational Edge table:
- Each directed edge is one row: (src_id, dst_id, edge_type, weight).
- Undirected edges are stored as two rows (both directions) or queried with OR on src/dst.
- A covering index on
(src_id, edge_type, dst_id)allows fetching all neighbors of a node for a specific edge type in a single index scan without a table heap access. - Node metadata is stored in a separate Node table with a JSONB properties column for flexible attributes.
For read-heavy workloads, the in-memory adjacency list is replicated to Redis as a set per node: SMEMBERS adj:{node_id}:{edge_type}.
BFS and DFS Traversal
Breadth-First Search (BFS) is preferred for shortest path and k-hop neighbor queries because it finds the shortest path first and allows early termination.
Algorithm for k-hop BFS:
- Initialize frontier = {start_node}, visited = {start_node}.
- For each hop 1..k: expand frontier by fetching neighbors of each node in frontier, add unvisited neighbors to next_frontier and visited.
- Return visited {start_node} or the specific target if found.
Cycle prevention: the visited set ensures no node is processed twice. Use a hash set for O(1) membership checks.
DFS is used for reachability queries and connected component detection where depth-first exploration is sufficient.
Friends-of-Friends Query (2-Hop)
A friends-of-friends query finds all users reachable in exactly 2 hops (direct friends of friends, excluding direct friends).
Using a recursive CTE in PostgreSQL:
WITH RECURSIVE hops AS (
-- Level 1: direct friends
SELECT dst_id AS node_id, 1 AS depth
FROM Edge
WHERE src_id = :user_id AND edge_type = 'friend'
UNION ALL
-- Level 2: friends of friends
SELECT e.dst_id, h.depth + 1
FROM Edge e
JOIN hops h ON e.src_id = h.node_id
WHERE h.depth < 2 AND e.edge_type = 'friend'
)
SELECT DISTINCT node_id
FROM hops
WHERE depth = 2
AND node_id != :user_id
AND node_id NOT IN (
SELECT dst_id FROM Edge WHERE src_id = :user_id AND edge_type = 'friend'
)
LIMIT 100;
For performance, cap the expansion with LIMIT at each level to avoid explosion on high-degree nodes.
Index-Assisted Graph Traversal
Covering index on (src_id, edge_type, dst_id) enables index-only scans for neighbor lookups. For multi-attribute node filtering (e.g., find friends who work at company X), add a NodeIndex table that inverts node properties:
- Index rows: (node_id, attribute_name, attribute_value)
- Query: find nodes where attribute_name='company' AND attribute_value='Acme', then intersect with adjacency list.
This avoids full JSONB scans on the Node table for common filter attributes.
Graph Partitioning
For distributed graphs, partitioning strategy determines how many cross-shard hops are needed:
- Random partitioning: simple but most edges cross shards, making multi-hop queries expensive.
- Degree-aware sharding: high-degree (hub) nodes are assigned to dedicated shards and replicated to neighbors' shards to reduce cross-shard fetches.
- Community-based partitioning: cluster the graph (e.g., via METIS or Louvain) and assign each community to a shard, minimizing cross-shard edges.
For social graphs, co-locating a user with their direct neighbors reduces 1-hop queries to a single shard. 2-hop queries then require at most one cross-shard scatter.
Distributed Traversal: Scatter-Gather
For multi-hop queries spanning multiple shards:
- The query coordinator identifies the shard for each node in the current frontier.
- Scatter: send partial BFS expansion requests to each relevant shard in parallel.
- Gather: collect results, deduplicate, and form the next frontier.
- Repeat for each hop until depth limit or target found.
Latency grows with hop count and shard count. Optimize by pruning high-degree hub nodes from expansion (they add noise and cross too many shards).
SQL Schema
-- Node table with flexible properties
CREATE TABLE Node (
id BIGSERIAL PRIMARY KEY,
type VARCHAR(32) NOT NULL, -- user, group, page, event
properties JSONB NOT NULL DEFAULT '{}'
);
CREATE INDEX idx_node_type ON Node(type);
CREATE INDEX idx_node_properties ON Node USING GIN(properties);
-- Edge table (directed)
CREATE TABLE Edge (
src_id BIGINT NOT NULL REFERENCES Node(id),
dst_id BIGINT NOT NULL REFERENCES Node(id),
edge_type VARCHAR(32) NOT NULL, -- friend, follow, member_of, likes
weight DOUBLE PRECISION NOT NULL DEFAULT 1.0,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (src_id, dst_id, edge_type)
);
-- Covering index for forward neighbor lookup
CREATE INDEX idx_edge_src ON Edge(src_id, edge_type, dst_id);
-- Reverse index for incoming edge queries
CREATE INDEX idx_edge_dst ON Edge(dst_id, edge_type, src_id);
-- Inverted index for node attribute filtering
CREATE TABLE NodeIndex (
node_id BIGINT NOT NULL REFERENCES Node(id),
attribute_name VARCHAR(64) NOT NULL,
attribute_value TEXT NOT NULL,
PRIMARY KEY (attribute_name, attribute_value, node_id)
);
Python Implementation
from collections import deque
from typing import List, Optional, Set
def bfs_neighbors(node_id: int, max_hops: int, edge_type: str) -> Set[int]:
"""Return all node IDs reachable within max_hops via edge_type."""
visited: Set[int] = {node_id}
frontier: Set[int] = {node_id}
for hop in range(max_hops):
if not frontier:
break
placeholders = ','.join(['%s'] * len(frontier))
rows = db.execute(
f"SELECT dst_id FROM Edge WHERE src_id IN ({placeholders})"
f" AND edge_type=%s",
(*frontier, edge_type)
).fetchall()
next_frontier: Set[int] = set()
for (dst_id,) in rows:
if dst_id not in visited:
visited.add(dst_id)
next_frontier.add(dst_id)
frontier = next_frontier
visited.discard(node_id)
return visited
def friends_of_friends(user_id: int, limit: int = 50) -> List[dict]:
"""Return 2-hop friends excluding direct friends, ranked by mutual connection count."""
direct = bfs_neighbors(user_id, max_hops=1, edge_type='friend')
two_hop_candidates: dict[int, int] = {}
for friend_id in direct:
fof = bfs_neighbors(friend_id, max_hops=1, edge_type='friend')
for candidate in fof:
if candidate != user_id and candidate not in direct:
two_hop_candidates[candidate] = two_hop_candidates.get(candidate, 0) + 1
ranked = sorted(two_hop_candidates.items(), key=lambda x: x[1], reverse=True)
return [{"user_id": uid, "mutual_count": cnt} for uid, cnt in ranked[:limit]]
def shortest_path(src_id: int, dst_id: int, edge_type: str = 'friend') -> Optional[List[int]]:
"""BFS shortest path between two nodes. Returns list of node IDs or None if unreachable."""
if src_id == dst_id:
return [src_id]
visited = {src_id: None} # node -> parent
queue = deque([src_id])
while queue:
current = queue.popleft()
rows = db.execute(
"SELECT dst_id FROM Edge WHERE src_id=%s AND edge_type=%s",
(current, edge_type)
).fetchall()
for (neighbor,) in rows:
if neighbor not in visited:
visited[neighbor] = current
if neighbor == dst_id:
# Reconstruct path
path = []
node = dst_id
while node is not None:
path.append(node)
node = visited[node]
return list(reversed(path))
queue.append(neighbor)
return None # unreachable
Key Design Decisions Summary
- Covering index on (src_id, edge_type, dst_id) enables index-only neighbor lookups — the most frequent operation.
- Recursive CTEs work for 2-hop queries in Postgres but must be capped at each level to avoid degree-explosion.
- Visited set in BFS prevents cycles and redundant processing in O(1) per check.
- Community-based graph partitioning minimizes cross-shard edges, critical for keeping scatter-gather latency bounded.
- NodeIndex inverts JSONB properties for efficient attribute-based filtering without full JSONB scans.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering