System Design: Knowledge Graph — Entity Storage, Relationship Traversal, and Semantic Search (2025)

What is a Knowledge Graph?

A knowledge graph stores entities (nodes) and typed relationships (edges) between them. Examples: Google Knowledge Graph (person -> born_in -> city), LinkedIn Economic Graph (person -> works_at -> company -> located_in -> country), Meta social graph (person -> friend_of -> person). Core operations: entity lookup (get all properties of an entity), neighbor traversal (who does entity X relate to via relationship type R?), path finding (shortest path between two entities), pattern matching (find all companies where person X worked before founding startup Y), and semantic similarity search (find entities similar to this entity). The choice of storage backend determines which operations are efficient.

Data Model: Labeled Property Graph

Each Node has: node_id (UUID), labels (set of type strings, e.g., [“Person”, “Employee”]), properties (JSONB: name, birth_date, description), embedding (VECTOR(1536) for semantic search), created_at. Each Edge has: edge_id, source_id, target_id, relationship_type (string: WORKS_AT, BORN_IN, KNOWS, FOUNDED), properties (JSONB: start_date, end_date, confidence_score), is_directed (bool). Indexes: node label + property (for entity lookup), source_id + relationship_type (for outgoing traversal), target_id + relationship_type (for incoming traversal), embedding vector index (HNSW for approximate nearest neighbor search).

Storage Backend Options

Native graph database (Neo4j, Amazon Neptune): Store nodes and edges with native adjacency lists. Index-free adjacency: edge traversal follows direct pointers, O(degree) per hop regardless of total graph size. Best for: deep traversal (3+ hops), pattern matching queries (Cypher/Gremlin). Cons: limited horizontal scaling, complex ops. Relational (PostgreSQL): Nodes table + Edges table. Adjacency queries use JOINs. Simple for 1-2 hop queries. Recursive CTEs for multi-hop (WITH RECURSIVE). Best for: small-medium graphs, existing Postgres infrastructure. Document store (Elasticsearch): Nodes as documents with embedded neighbor lists. Best for: full-text search on entity properties, not deep traversal. Hybrid: PostgreSQL for entity storage + property lookups; graph DB for relationship traversal; Elasticsearch for text search; vector DB (Pinecone/pgvector) for semantic similarity. Most production systems use a hybrid.

Multi-Hop Traversal with BFS

def bfs_neighbors(start_id: str, relationship_types: list[str],
                  max_hops: int, max_results: int = 1000) -> list[dict]:
    # BFS traversal up to max_hops with visited set for cycle prevention
    visited = {start_id}
    queue = [(start_id, 0)]  # (node_id, current_depth)
    results = []

    while queue and len(results) = max_hops:
            continue

        # Batch fetch neighbors to minimize DB round trips
        neighbors = db.query(
            "SELECT target_id, relationship_type, properties "
            "FROM edges WHERE source_id = %s "
            "AND relationship_type = ANY(%s)",
            node_id, relationship_types
        )
        for edge in neighbors:
            if edge.target_id not in visited:
                visited.add(edge.target_id)
                results.append({
                    "node_id": edge.target_id,
                    "via": edge.relationship_type,
                    "depth": depth + 1
                })
                queue.append((edge.target_id, depth + 1))

    return results

# Optimization: cache hot node neighborhoods in Redis
# Key: node:{node_id}:rel:{rel_type}:neighbors, TTL 1h
# Invalidate on edge insert/delete for that source node

Semantic Search with Embeddings

Each entity has an embedding vector generated by a language model (OpenAI text-embedding-3-small, 1536 dimensions). Embeddings are computed on entity creation and updated when properties change. Storage: pgvector extension in PostgreSQL with HNSW index for approximate nearest-neighbor search. Query: “find entities semantically similar to Tesla”: (1) embed the query string, (2) run vector similarity search: SELECT node_id, 1 – (embedding query_vec) AS similarity FROM nodes WHERE labels @> ARRAY[‘Company’] ORDER BY embedding query_vec LIMIT 20. The operator computes cosine distance. HNSW index gives O(log n) approximate search instead of O(n) brute force. Combine with keyword search for hybrid retrieval: rank by 0.7 * vector_score + 0.3 * text_score.

Meta uses knowledge graphs for the social graph and entity understanding. See common design questions at Meta interview: knowledge graph and social graph design.

LinkedIn’s Economic Graph is a large-scale knowledge graph. See system design patterns asked at LinkedIn interview: knowledge graph and economic graph design.

Databricks interviews cover graph analytics and semantic search systems. See common patterns at Databricks interview: graph analytics and vector search.

Scroll to Top