Question 1

How do you design the storage layer for a large-scale knowledge graph?

Accepted Answer

Knowledge graph storage must efficiently support three access patterns: entity lookup by ID, neighborhood traversal (find all edges of a node), and predicate-based filtering (find all entities with property X = Y). A common approach is a hybrid storage model: a distributed key-value store (RocksDB, Bigtable, or DynamoDB) for entity property storage, keyed by entity ID; an adjacency list store for graph structure, where each entity ID maps to a sorted list of (predicate, target entity ID) pairs; and an inverted index for predicate-based lookups. For RDF-style triples (subject, predicate, object), specialized stores like Apache Jena TDB or Google's Spanner-backed triple store use compound indexes: SPO, POS, and OSP orderings to support all three lookup directions efficiently. Sharding is done by entity ID hash, with hot entities (high-degree nodes like major celebrities or cities) replicated across shards to prevent hotspots.

Question 2

How do you efficiently traverse relationships in a knowledge graph to answer multi-hop queries?

Accepted Answer

Multi-hop traversal (e.g., 'find all movies directed by directors who worked with actor X') is expensive at scale. Optimizations include: BFS with early termination and visited-set deduplication to avoid cycles; query planning that estimates cardinality at each hop and reorders joins to minimize intermediate result set size (similar to a relational query optimizer); and precomputed materialized paths for common traversal patterns (stored as edge lists on popular entity pairs). For latency-sensitive serving, a graph embedding approach (TransE, RotatE, or ComplEx) encodes entities and relations as dense vectors, enabling approximate multi-hop reasoning via vector arithmetic in O(1) — at the cost of some accuracy. Production systems like Google's Knowledge Graph use a combination: exact traversal for precision-critical queries (factual lookups) and embedding-based retrieval for exploratory or fuzzy queries. Caching neighborhood results for high-degree nodes with a short TTL dramatically reduces traversal cost for popular entities.

Question 3

How do you design a graph query engine to optimize performance for complex SPARQL or Gremlin queries?

Accepted Answer

Graph query optimization focuses on join ordering, index selection, and push-down filtering. The query planner parses the query into a logical plan (a series of triple pattern joins for SPARQL, or step sequences for Gremlin), estimates selectivity for each pattern using statistics (entity count per predicate, predicate cardinality histograms), and generates a physical plan that starts from the most selective pattern to minimize intermediate result sizes. Predicate-object indexes (PO index) allow starting traversal from the object side when the object is more selective than the subject. For large intermediate results, hash joins are used; for small probe-side inputs, nested-loop joins with index lookups are faster. Execution is parallelized by partitioning intermediate results across workers. Adaptive query execution monitors runtime statistics (actual vs. estimated cardinality) and can replan mid-execution. Query results for common subgraph patterns are cached in a result cache keyed by the normalized subquery hash.

Question 4

How do you handle knowledge graph updates — entity additions, relation changes, and entity merges — at scale?

Accepted Answer

Knowledge graph updates fall into three categories: additive (new entity or relation), mutative (property value change), and structural (entity merge/split). Additive updates are written via a WAL-backed API that propagates changes to the adjacency store and inverted indexes, achieving eventual consistency across replicas within seconds. Mutative updates use optimistic concurrency control with version vectors to detect conflicts; last-write-wins is acceptable for factual updates, but schema-level changes require coordination. Entity merges — collapsing two entity IDs into one (deduplication) — are the hardest operation: all inbound edges pointing to the deprecated ID must be redirected, which is O(degree) writes. This is handled by maintaining a canonical ID mapping table (deprecated ID → canonical ID) that is applied transparently at read time, deferring the expensive edge rewrite to a background compaction job. Knowledge provenance (source, timestamp, confidence score) is stored per-triple to support conflict resolution and auditing when multiple sources disagree on a fact.

Knowledge Graph Low-Level Design: Entity Storage, Relationship Traversal, and Graph Query Optimization

Knowledge Graph Overview

Graph Data Models

Storage Backends

Entity and Relationship Schema

Graph Query Patterns

Traversal Optimization

Entity Disambiguation and Resolution

Entity Embeddings for Semantic Queries

Knowledge Graph Construction

Graph Updates and Temporal Facts