Knowledge Graph Overview
A knowledge graph stores entities (people, organizations, places, events) and the typed relationships between them, enabling structured queries over facts rather than just keyword search. The two dominant models — RDF triples and property graphs — make different tradeoffs between standardization and expressiveness.
Graph Data Models
- RDF triple: Every fact is a subject-predicate-object triple. Example:
(Elon_Musk, founded, SpaceX). Standardized format with SPARQL as the query language. Enables schema-free fact addition. - Property graph: Nodes have labels and properties; edges have types and properties. Example: node
Person {name: "Elon Musk"}connected by edgeFOUNDED {year: 2002}to nodeCompany {name: "SpaceX"}. Queried with Cypher (Neo4j) or Gremlin.
Storage Backends
- RDF triple stores: Apache Jena (embedded or TDB backend), Virtuoso (high-scale commercial), Amazon Neptune (managed, supports both SPARQL and Gremlin).
- Property graph: Neo4j (native graph storage), JanusGraph (distributed, pluggable backends: Apache Cassandra for scale, Apache HBase for Hadoop environments), Amazon Neptune.
For read-heavy analytical workloads, a graph backed by Cassandra scales horizontally. For transactional updates with ACID requirements, Neo4j's native engine is simpler to operate.
Entity and Relationship Schema
Entity schema:
entity_id(UUID)type: Person | Company | Place | Event | Productproperties: {name, aliases[], description, external_ids: {wikidata_id, freebase_id, crunchbase_id}}
Relationship schema:
from_entity_id,relationship_type,to_entity_idconfidence_score: 0.0–1.0 (how certain is this fact?)source: where this fact was extracted fromvalid_from,valid_to: temporal validity (facts can expire)
Graph Query Patterns
Cypher examples on a property graph:
// Find all companies founded by a person
MATCH (p:Person)-[:FOUNDED]->(c:Company)
WHERE p.name = 'Elon Musk'
RETURN c.name
// Find co-founders (two people who both founded the same company)
MATCH (p1:Person)-[:FOUNDED]->(c:Company)<-[:FOUNDED]-(p2:Person)
WHERE p1 p2
RETURN p1.name, p2.name, c.name
// Find investors two hops away
MATCH (target:Company)(related:Company)
RETURN related.name, count(*) AS shared_investors
ORDER BY shared_investors DESC
Traversal Optimization
Deep graph traversals are expensive without optimization:
- Materialized paths: Precompute and store the full path for common deep traversals (e.g., organizational hierarchy up to root). Trade storage for query speed.
- BFS with depth limit: Bound traversal depth to prevent runaway queries on highly connected nodes.
- Pregel-style distributed computation: For global graph algorithms (PageRank, community detection), use a distributed graph processing framework (Apache Spark GraphX, Pregel) that iterates over the full graph in parallel.
Entity Disambiguation and Resolution
The same real-world entity appears under multiple names across sources:
- “Apple Inc”, “Apple”, “AAPL”, “Apple Computer” all refer to the same entity.
- Entity resolution combines string similarity (edit distance, token overlap), context signals (co-occurring entities, industry, location), and external ID matching (Wikidata Q number).
- Resolved entities are merged: all aliases stored in the
aliases[]array, all external IDs stored inexternal_ids. - Confidence score on relationships reflects extraction confidence — low-confidence facts are candidates for human review.
Entity Embeddings for Semantic Queries
Knowledge graph embedding models (TransE, RotatE) learn dense vector representations of entities and relations:
- TransE: Models a relation as a translation in embedding space: head + relation ≈ tail.
- RotatE: Models relations as rotations in complex vector space, handles symmetric and antisymmetric relations better than TransE.
- Use cases: Missing link prediction (which entities are likely related but not yet connected?), entity similarity search (find entities semantically similar to a given entity).
Knowledge Graph Construction
Three sources feed the graph:
- Automated extraction from text: Named Entity Recognition (NER) identifies entities, relation extraction models identify relationships between entity pairs in the same sentence or paragraph.
- Manual curation: Human editors add and verify high-value facts, especially for prominent entities where accuracy is critical.
- Data fusion: Merge facts from multiple structured sources (Wikidata, Crunchbase, company registries). Conflict resolution when sources disagree: use confidence scores and source authority ranking.
Graph Updates and Temporal Facts
Facts change over time: a person changes companies, a company changes headquarters, a product is discontinued. The valid_from / valid_to fields on relationships handle temporal validity. Queries can be scoped to a point in time. Entity merging on disambiguation must propagate all relationships from the merged entity to the canonical entity without losing historical provenance.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering