System Design: Knowledge Graph — Entity Storage, Relationship Traversal, and Semantic Search (2025)

What is a Knowledge Graph?

A knowledge graph stores entities (nodes) and typed relationships (edges) between them. Examples: Google Knowledge Graph (person -> born_in -> city), LinkedIn Economic Graph (person -> works_at -> company -> located_in -> country), Meta social graph (person -> friend_of -> person). Core operations: entity lookup (get all properties of an entity), neighbor traversal (who does entity X relate to via relationship type R?), path finding (shortest path between two entities), pattern matching (find all companies where person X worked before founding startup Y), and semantic similarity search (find entities similar to this entity). The choice of storage backend determines which operations are efficient.

Data Model: Labeled Property Graph

Each Node has: node_id (UUID), labels (set of type strings, e.g., [“Person”, “Employee”]), properties (JSONB: name, birth_date, description), embedding (VECTOR(1536) for semantic search), created_at. Each Edge has: edge_id, source_id, target_id, relationship_type (string: WORKS_AT, BORN_IN, KNOWS, FOUNDED), properties (JSONB: start_date, end_date, confidence_score), is_directed (bool). Indexes: node label + property (for entity lookup), source_id + relationship_type (for outgoing traversal), target_id + relationship_type (for incoming traversal), embedding vector index (HNSW for approximate nearest neighbor search).

Storage Backend Options

Native graph database (Neo4j, Amazon Neptune): Store nodes and edges with native adjacency lists. Index-free adjacency: edge traversal follows direct pointers, O(degree) per hop regardless of total graph size. Best for: deep traversal (3+ hops), pattern matching queries (Cypher/Gremlin). Cons: limited horizontal scaling, complex ops. Relational (PostgreSQL): Nodes table + Edges table. Adjacency queries use JOINs. Simple for 1-2 hop queries. Recursive CTEs for multi-hop (WITH RECURSIVE). Best for: small-medium graphs, existing Postgres infrastructure. Document store (Elasticsearch): Nodes as documents with embedded neighbor lists. Best for: full-text search on entity properties, not deep traversal. Hybrid: PostgreSQL for entity storage + property lookups; graph DB for relationship traversal; Elasticsearch for text search; vector DB (Pinecone/pgvector) for semantic similarity. Most production systems use a hybrid.

Multi-Hop Traversal with BFS

def bfs_neighbors(start_id: str, relationship_types: list[str],
                  max_hops: int, max_results: int = 1000) -> list[dict]:
    # BFS traversal up to max_hops with visited set for cycle prevention
    visited = {start_id}
    queue = [(start_id, 0)]  # (node_id, current_depth)
    results = []

    while queue and len(results) = max_hops:
            continue

        # Batch fetch neighbors to minimize DB round trips
        neighbors = db.query(
            "SELECT target_id, relationship_type, properties "
            "FROM edges WHERE source_id = %s "
            "AND relationship_type = ANY(%s)",
            node_id, relationship_types
        )
        for edge in neighbors:
            if edge.target_id not in visited:
                visited.add(edge.target_id)
                results.append({
                    "node_id": edge.target_id,
                    "via": edge.relationship_type,
                    "depth": depth + 1
                })
                queue.append((edge.target_id, depth + 1))

    return results

# Optimization: cache hot node neighborhoods in Redis
# Key: node:{node_id}:rel:{rel_type}:neighbors, TTL 1h
# Invalidate on edge insert/delete for that source node

Semantic Search with Embeddings

Each entity has an embedding vector generated by a language model (OpenAI text-embedding-3-small, 1536 dimensions). Embeddings are computed on entity creation and updated when properties change. Storage: pgvector extension in PostgreSQL with HNSW index for approximate nearest-neighbor search. Query: “find entities semantically similar to Tesla”: (1) embed the query string, (2) run vector similarity search: SELECT node_id, 1 – (embedding query_vec) AS similarity FROM nodes WHERE labels @> ARRAY[‘Company’] ORDER BY embedding query_vec LIMIT 20. The operator computes cosine distance. HNSW index gives O(log n) approximate search instead of O(n) brute force. Combine with keyword search for hybrid retrieval: rank by 0.7 * vector_score + 0.3 * text_score.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is a knowledge graph and how does it differ from a relational database?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A knowledge graph stores entities (nodes) and typed relationships (edges) between them as first-class citizens. Unlike a relational database where relationships are expressed as foreign keys and require JOIN operations, a knowledge graph uses native adjacency lists for O(degree) hop traversal regardless of graph size. Knowledge graphs excel at: multi-hop path queries, flexible schema (adding new entity types or relationship types without migration), and semantic queries. Use a relational DB for structured, tabular data with complex aggregations; use a knowledge graph for interconnected entity networks.”}},{“@type”:”Question”,”name”:”What storage backend should you use for a knowledge graph?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”The choice depends on your primary query patterns. For deep traversal (3+ hops) and pattern matching: use a native graph database like Neo4j or Amazon Neptune with index-free adjacency. For simple 1-2 hop queries on existing Postgres infrastructure: use relational tables with recursive CTEs. For semantic/text search on entity properties: add Elasticsearch. For vector similarity (semantic entity matching): add pgvector or Pinecone. Most production knowledge graphs use a hybrid: relational for entity storage, graph DB for traversal, vector DB for semantic search.”}},{“@type”:”Question”,”name”:”How do you prevent infinite loops in graph traversal?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use a visited set that tracks all explored node IDs. Before adding a neighbor to the BFS/DFS queue, check if it is already in the visited set. In the context of knowledge graphs, also set a max_hops limit to bound the traversal depth and a max_results limit to bound output size. For very large graphs, use bidirectional BFS (search from both source and target simultaneously) to reduce the explored frontier by roughly the square root.”}},{“@type”:”Question”,”name”:”How do embeddings enable semantic search in a knowledge graph?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Each entity is represented as a high-dimensional embedding vector (e.g., 1536 dimensions from OpenAI text-embedding-3-small), encoding semantic meaning from its properties and description. To find semantically similar entities, compute the cosine similarity between the query embedding and all entity embeddings. An HNSW (Hierarchical Navigable Small World) index in pgvector enables approximate nearest-neighbor search in O(log n) instead of O(n) brute force. This allows queries like "find companies similar to OpenAI" that go beyond keyword matching to semantic understanding.”}},{“@type”:”Question”,”name”:”How do you handle graph updates efficiently in a knowledge graph?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Cache frequently traversed node neighborhoods in Redis (key: node:{id}:rel:{type}:neighbors) with TTL. On edge insert or delete, invalidate the Redis cache for the affected source node. For embedding updates, use an async pipeline: when entity properties change, enqueue a re-embedding job. Process in batch to minimize API calls to the embedding service. For bulk graph updates (data ingestion), use a staging table and apply changes in transactions to avoid partial graph states visible to live queries.”}}]}

Meta uses knowledge graphs for the social graph and entity understanding. See common design questions at Meta interview: knowledge graph and social graph design.

LinkedIn’s Economic Graph is a large-scale knowledge graph. See system design patterns asked at LinkedIn interview: knowledge graph and economic graph design.

Databricks interviews cover graph analytics and semantic search systems. See common patterns at Databricks interview: graph analytics and vector search.