Knowledge Graph Low-Level Design: Entity Storage, Relationship Traversal, and Graph Query Optimization

Knowledge Graph Overview

A knowledge graph stores entities (people, organizations, places, events) and the typed relationships between them, enabling structured queries over facts rather than just keyword search. The two dominant models — RDF triples and property graphs — make different tradeoffs between standardization and expressiveness.

Graph Data Models

  • RDF triple: Every fact is a subject-predicate-object triple. Example: (Elon_Musk, founded, SpaceX). Standardized format with SPARQL as the query language. Enables schema-free fact addition.
  • Property graph: Nodes have labels and properties; edges have types and properties. Example: node Person {name: "Elon Musk"} connected by edge FOUNDED {year: 2002} to node Company {name: "SpaceX"}. Queried with Cypher (Neo4j) or Gremlin.

Storage Backends

  • RDF triple stores: Apache Jena (embedded or TDB backend), Virtuoso (high-scale commercial), Amazon Neptune (managed, supports both SPARQL and Gremlin).
  • Property graph: Neo4j (native graph storage), JanusGraph (distributed, pluggable backends: Apache Cassandra for scale, Apache HBase for Hadoop environments), Amazon Neptune.

For read-heavy analytical workloads, a graph backed by Cassandra scales horizontally. For transactional updates with ACID requirements, Neo4j's native engine is simpler to operate.

Entity and Relationship Schema

Entity schema:

  • entity_id (UUID)
  • type: Person | Company | Place | Event | Product
  • properties: {name, aliases[], description, external_ids: {wikidata_id, freebase_id, crunchbase_id}}

Relationship schema:

  • from_entity_id, relationship_type, to_entity_id
  • confidence_score: 0.0–1.0 (how certain is this fact?)
  • source: where this fact was extracted from
  • valid_from, valid_to: temporal validity (facts can expire)

Graph Query Patterns

Cypher examples on a property graph:

// Find all companies founded by a person
MATCH (p:Person)-[:FOUNDED]->(c:Company)
WHERE p.name = 'Elon Musk'
RETURN c.name

// Find co-founders (two people who both founded the same company)
MATCH (p1:Person)-[:FOUNDED]->(c:Company)<-[:FOUNDED]-(p2:Person)
WHERE p1  p2
RETURN p1.name, p2.name, c.name

// Find investors two hops away
MATCH (target:Company)(related:Company)
RETURN related.name, count(*) AS shared_investors
ORDER BY shared_investors DESC

Traversal Optimization

Deep graph traversals are expensive without optimization:

  • Materialized paths: Precompute and store the full path for common deep traversals (e.g., organizational hierarchy up to root). Trade storage for query speed.
  • BFS with depth limit: Bound traversal depth to prevent runaway queries on highly connected nodes.
  • Pregel-style distributed computation: For global graph algorithms (PageRank, community detection), use a distributed graph processing framework (Apache Spark GraphX, Pregel) that iterates over the full graph in parallel.

Entity Disambiguation and Resolution

The same real-world entity appears under multiple names across sources:

  • “Apple Inc”, “Apple”, “AAPL”, “Apple Computer” all refer to the same entity.
  • Entity resolution combines string similarity (edit distance, token overlap), context signals (co-occurring entities, industry, location), and external ID matching (Wikidata Q number).
  • Resolved entities are merged: all aliases stored in the aliases[] array, all external IDs stored in external_ids.
  • Confidence score on relationships reflects extraction confidence — low-confidence facts are candidates for human review.

Entity Embeddings for Semantic Queries

Knowledge graph embedding models (TransE, RotatE) learn dense vector representations of entities and relations:

  • TransE: Models a relation as a translation in embedding space: head + relation ≈ tail.
  • RotatE: Models relations as rotations in complex vector space, handles symmetric and antisymmetric relations better than TransE.
  • Use cases: Missing link prediction (which entities are likely related but not yet connected?), entity similarity search (find entities semantically similar to a given entity).

Knowledge Graph Construction

Three sources feed the graph:

  1. Automated extraction from text: Named Entity Recognition (NER) identifies entities, relation extraction models identify relationships between entity pairs in the same sentence or paragraph.
  2. Manual curation: Human editors add and verify high-value facts, especially for prominent entities where accuracy is critical.
  3. Data fusion: Merge facts from multiple structured sources (Wikidata, Crunchbase, company registries). Conflict resolution when sources disagree: use confidence scores and source authority ranking.

Graph Updates and Temporal Facts

Facts change over time: a person changes companies, a company changes headquarters, a product is discontinued. The valid_from / valid_to fields on relationships handle temporal validity. Queries can be scoped to a point in time. Entity merging on disambiguation must propagate all relationships from the merged entity to the canonical entity without losing historical provenance.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you design the storage layer for a large-scale knowledge graph?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Knowledge graph storage must efficiently support three access patterns: entity lookup by ID, neighborhood traversal (find all edges of a node), and predicate-based filtering (find all entities with property X = Y). A common approach is a hybrid storage model: a distributed key-value store (RocksDB, Bigtable, or DynamoDB) for entity property storage, keyed by entity ID; an adjacency list store for graph structure, where each entity ID maps to a sorted list of (predicate, target entity ID) pairs; and an inverted index for predicate-based lookups. For RDF-style triples (subject, predicate, object), specialized stores like Apache Jena TDB or Google's Spanner-backed triple store use compound indexes: SPO, POS, and OSP orderings to support all three lookup directions efficiently. Sharding is done by entity ID hash, with hot entities (high-degree nodes like major celebrities or cities) replicated across shards to prevent hotspots.”
}
},
{
“@type”: “Question”,
“name”: “How do you efficiently traverse relationships in a knowledge graph to answer multi-hop queries?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Multi-hop traversal (e.g., 'find all movies directed by directors who worked with actor X') is expensive at scale. Optimizations include: BFS with early termination and visited-set deduplication to avoid cycles; query planning that estimates cardinality at each hop and reorders joins to minimize intermediate result set size (similar to a relational query optimizer); and precomputed materialized paths for common traversal patterns (stored as edge lists on popular entity pairs). For latency-sensitive serving, a graph embedding approach (TransE, RotatE, or ComplEx) encodes entities and relations as dense vectors, enabling approximate multi-hop reasoning via vector arithmetic in O(1) — at the cost of some accuracy. Production systems like Google's Knowledge Graph use a combination: exact traversal for precision-critical queries (factual lookups) and embedding-based retrieval for exploratory or fuzzy queries. Caching neighborhood results for high-degree nodes with a short TTL dramatically reduces traversal cost for popular entities.”
}
},
{
“@type”: “Question”,
“name”: “How do you design a graph query engine to optimize performance for complex SPARQL or Gremlin queries?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Graph query optimization focuses on join ordering, index selection, and push-down filtering. The query planner parses the query into a logical plan (a series of triple pattern joins for SPARQL, or step sequences for Gremlin), estimates selectivity for each pattern using statistics (entity count per predicate, predicate cardinality histograms), and generates a physical plan that starts from the most selective pattern to minimize intermediate result sizes. Predicate-object indexes (PO index) allow starting traversal from the object side when the object is more selective than the subject. For large intermediate results, hash joins are used; for small probe-side inputs, nested-loop joins with index lookups are faster. Execution is parallelized by partitioning intermediate results across workers. Adaptive query execution monitors runtime statistics (actual vs. estimated cardinality) and can replan mid-execution. Query results for common subgraph patterns are cached in a result cache keyed by the normalized subquery hash.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle knowledge graph updates — entity additions, relation changes, and entity merges — at scale?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Knowledge graph updates fall into three categories: additive (new entity or relation), mutative (property value change), and structural (entity merge/split). Additive updates are written via a WAL-backed API that propagates changes to the adjacency store and inverted indexes, achieving eventual consistency across replicas within seconds. Mutative updates use optimistic concurrency control with version vectors to detect conflicts; last-write-wins is acceptable for factual updates, but schema-level changes require coordination. Entity merges — collapsing two entity IDs into one (deduplication) — are the hardest operation: all inbound edges pointing to the deprecated ID must be redirected, which is O(degree) writes. This is handled by maintaining a canonical ID mapping table (deprecated ID → canonical ID) that is applied transparently at read time, deferring the expensive edge rewrite to a background compaction job. Knowledge provenance (source, timestamp, confidence score) is stored per-triple to support conflict resolution and auditing when multiple sources disagree on a fact.”
}
}
]
}

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale

Scroll to Top