Knowledge Graph Low-Level Design: Entity Storage, Relationship Traversal, and Graph Query Optimization

Knowledge Graph Overview

A knowledge graph stores entities (people, organizations, places, events) and the typed relationships between them, enabling structured queries over facts rather than just keyword search. The two dominant models — RDF triples and property graphs — make different tradeoffs between standardization and expressiveness.

Graph Data Models

  • RDF triple: Every fact is a subject-predicate-object triple. Example: (Elon_Musk, founded, SpaceX). Standardized format with SPARQL as the query language. Enables schema-free fact addition.
  • Property graph: Nodes have labels and properties; edges have types and properties. Example: node Person {name: "Elon Musk"} connected by edge FOUNDED {year: 2002} to node Company {name: "SpaceX"}. Queried with Cypher (Neo4j) or Gremlin.

Storage Backends

  • RDF triple stores: Apache Jena (embedded or TDB backend), Virtuoso (high-scale commercial), Amazon Neptune (managed, supports both SPARQL and Gremlin).
  • Property graph: Neo4j (native graph storage), JanusGraph (distributed, pluggable backends: Apache Cassandra for scale, Apache HBase for Hadoop environments), Amazon Neptune.

For read-heavy analytical workloads, a graph backed by Cassandra scales horizontally. For transactional updates with ACID requirements, Neo4j's native engine is simpler to operate.

Entity and Relationship Schema

Entity schema:

  • entity_id (UUID)
  • type: Person | Company | Place | Event | Product
  • properties: {name, aliases[], description, external_ids: {wikidata_id, freebase_id, crunchbase_id}}

Relationship schema:

  • from_entity_id, relationship_type, to_entity_id
  • confidence_score: 0.0–1.0 (how certain is this fact?)
  • source: where this fact was extracted from
  • valid_from, valid_to: temporal validity (facts can expire)

Graph Query Patterns

Cypher examples on a property graph:

// Find all companies founded by a person
MATCH (p:Person)-[:FOUNDED]->(c:Company)
WHERE p.name = 'Elon Musk'
RETURN c.name

// Find co-founders (two people who both founded the same company)
MATCH (p1:Person)-[:FOUNDED]->(c:Company)<-[:FOUNDED]-(p2:Person)
WHERE p1  p2
RETURN p1.name, p2.name, c.name

// Find investors two hops away
MATCH (target:Company)(related:Company)
RETURN related.name, count(*) AS shared_investors
ORDER BY shared_investors DESC

Traversal Optimization

Deep graph traversals are expensive without optimization:

  • Materialized paths: Precompute and store the full path for common deep traversals (e.g., organizational hierarchy up to root). Trade storage for query speed.
  • BFS with depth limit: Bound traversal depth to prevent runaway queries on highly connected nodes.
  • Pregel-style distributed computation: For global graph algorithms (PageRank, community detection), use a distributed graph processing framework (Apache Spark GraphX, Pregel) that iterates over the full graph in parallel.

Entity Disambiguation and Resolution

The same real-world entity appears under multiple names across sources:

  • “Apple Inc”, “Apple”, “AAPL”, “Apple Computer” all refer to the same entity.
  • Entity resolution combines string similarity (edit distance, token overlap), context signals (co-occurring entities, industry, location), and external ID matching (Wikidata Q number).
  • Resolved entities are merged: all aliases stored in the aliases[] array, all external IDs stored in external_ids.
  • Confidence score on relationships reflects extraction confidence — low-confidence facts are candidates for human review.

Entity Embeddings for Semantic Queries

Knowledge graph embedding models (TransE, RotatE) learn dense vector representations of entities and relations:

  • TransE: Models a relation as a translation in embedding space: head + relation ≈ tail.
  • RotatE: Models relations as rotations in complex vector space, handles symmetric and antisymmetric relations better than TransE.
  • Use cases: Missing link prediction (which entities are likely related but not yet connected?), entity similarity search (find entities semantically similar to a given entity).

Knowledge Graph Construction

Three sources feed the graph:

  1. Automated extraction from text: Named Entity Recognition (NER) identifies entities, relation extraction models identify relationships between entity pairs in the same sentence or paragraph.
  2. Manual curation: Human editors add and verify high-value facts, especially for prominent entities where accuracy is critical.
  3. Data fusion: Merge facts from multiple structured sources (Wikidata, Crunchbase, company registries). Conflict resolution when sources disagree: use confidence scores and source authority ranking.

Graph Updates and Temporal Facts

Facts change over time: a person changes companies, a company changes headquarters, a product is discontinued. The valid_from / valid_to fields on relationships handle temporal validity. Queries can be scoped to a point in time. Entity merging on disambiguation must propagate all relationships from the merged entity to the canonical entity without losing historical provenance.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale

Scroll to Top