A graph database stores entities as nodes and relationships as edges, with properties on both. Unlike relational databases that simulate graphs via foreign key joins, native graph databases use index-free adjacency — direct physical pointers from each node to its adjacent relationships — enabling constant-time local neighbor access that does not degrade as the graph grows.
Native Graph Storage Model
In a native graph store, each record type is stored in its own file or store:
- Node store: fixed-size records, each containing a pointer to the first relationship record and a pointer to the first property record.
- Relationship store: each record contains src_node pointer, dst_node pointer, relationship type ID, pointers to the next/previous relationship for both the source and destination node (doubly-linked list per node), and a pointer to properties.
- Property store: key-value pairs packed into property records, with overflow to a dynamic store for large values.
- Label and type stores: token stores mapping label/type names to integer IDs for compact in-record storage.
Index-free adjacency means traversal follows physical pointers — when you expand a node's neighbors, the engine reads the first relationship record directly via the stored pointer and then follows the doubly-linked list without any index lookup. This makes the cost of a single hop O(degree) rather than O(log N) for a global index lookup.
Relationship Record Structure
RelationshipRecord {
id: uint64
src_node_id: uint64
dst_node_id: uint64
relationship_type: uint32 // token ID
src_prev_rel: uint64 // doubly-linked list for src node
src_next_rel: uint64
dst_prev_rel: uint64 // doubly-linked list for dst node
dst_next_rel: uint64
property_ptr: uint64
in_use: bool
}
Traversal walks the linked list of relationships at a node — reading only the relationships of the node being expanded, not the full edge table.
Traversal Engine: DFS and BFS with Visitor Pattern
The traversal engine supports both depth-first and breadth-first traversal. A visitor pattern allows pluggable logic at each node and relationship during traversal:
- DFS: lower memory footprint, suitable for path existence queries and deep hierarchies.
- BFS: finds shortest paths, suitable for social distance queries (degrees of separation).
- Bidirectional BFS: starts from both source and target simultaneously; meets in the middle, dramatically reducing explored nodes for shortest-path queries in dense social graphs.
Pruning is applied via relationship type filters (only follow FOLLOWS edges) and property filters evaluated during traversal rather than after.
Cypher Query Execution
A representative Cypher pattern:
MATCH (a:User)-[:FOLLOWS]->(b:User)
WHERE a.id = $id
RETURN b.name, b.email
ORDER BY b.name
Execution pipeline:
- Parse: tokenise and produce AST.
- Logical plan: NodeIndexSeek(User, id=$id) → Expand(FOLLOWS, OUTGOING) → Filter(b:User) → Project(b.name, b.email) → Sort.
- Optimise: push label filter into Expand; use property index for entry point.
- Physical plan: NodeIndexSeek uses the User.id B-tree index to find node a in O(log N), then Expand follows index-free adjacency pointers.
Label and Property Indexes
Indexes in a graph database serve as entry points for pattern matching. Without an index, the engine must scan all nodes with the given label. B-tree indexes are used for range queries; hash indexes for exact-match lookups. Composite indexes cover multi-property predicates.
Index maintenance: indexes are updated synchronously on write. For high-write workloads, bulk-loading with deferred index build is preferred.
Graph Algorithms
- PageRank: iterative O(k*(V+E)); rank flows from high-connectivity nodes outward.
- Louvain (community detection): modularity optimisation; ~O(V log V) in practice.
- Dijkstra / A*: weighted shortest path; O((V+E) log V) with binary heap.
- Betweenness centrality: O(V*E) — typically computed offline.
SQL DDL: Relational Analog
-- Nodes
CREATE TABLE GNode (
id BIGSERIAL PRIMARY KEY,
labels JSONB NOT NULL DEFAULT '[]',
properties JSONB NOT NULL DEFAULT '{}'
);
CREATE INDEX idx_gnode_labels ON GNode USING GIN (labels);
CREATE INDEX idx_gnode_properties ON GNode USING GIN (properties);
-- Edges (directed relationships)
CREATE TABLE GEdge (
id BIGSERIAL PRIMARY KEY,
src_id BIGINT NOT NULL REFERENCES GNode(id),
dst_id BIGINT NOT NULL REFERENCES GNode(id),
rel_type VARCHAR(128) NOT NULL,
properties JSONB NOT NULL DEFAULT '{}'
);
CREATE INDEX idx_gedge_src ON GEdge (src_id, rel_type);
CREATE INDEX idx_gedge_dst ON GEdge (dst_id, rel_type);
-- Property index for fast entry-point lookup
CREATE TABLE GIndex (
label VARCHAR(128) NOT NULL,
property_name VARCHAR(128) NOT NULL,
node_id BIGINT NOT NULL REFERENCES GNode(id),
PRIMARY KEY (label, property_name, node_id)
);
Python: Core Operations
from neo4j import GraphDatabase
from typing import Any
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
def create_node(labels: list[str], properties: dict[str, Any]) -> int:
"""Create a node and return its internal ID."""
label_str = ":".join(labels)
query = f"CREATE (n:{label_str} $props) RETURN id(n) AS node_id"
with driver.session() as session:
result = session.run(query, props=properties)
return result.single()["node_id"]
def create_edge(src: int, dst: int, rel_type: str, properties: dict[str, Any]) -> None:
"""Create a directed relationship between two nodes by internal ID."""
query = (
"MATCH (a) WHERE id(a) = $src "
"MATCH (b) WHERE id(b) = $dst "
f"CREATE (a)-[r:{rel_type} $props]->(b)"
)
with driver.session() as session:
session.run(query, src=src, dst=dst, props=properties)
def match_pattern(user_id: int) -> list[dict]:
"""Return all users directly followed by the given user."""
query = (
"MATCH (a:User {id: $uid})-[:FOLLOWS]->(b:User) "
"RETURN b.id AS id, b.name AS name ORDER BY b.name"
)
with driver.session() as session:
result = session.run(query, uid=user_id)
return [dict(record) for record in result]
def run_page_rank(graph_name: str = "myGraph") -> list[dict]:
"""Run PageRank via GDS and return top-ranked nodes."""
query = (
"CALL gds.pageRank.stream($graph) "
"YIELD nodeId, score "
"RETURN gds.util.asNode(nodeId).name AS name, score "
"ORDER BY score DESC LIMIT 10"
)
with driver.session() as session:
result = session.run(query, graph=graph_name)
return [dict(record) for record in result]
Design Considerations Summary
- Index-free adjacency: core differentiator; local traversal cost is O(degree), not O(log N).
- Entry-point indexes: always index the property used in WHERE to avoid full label scans.
- Cypher compilation: filter pushdown is critical; ensure predicate order matches traversal direction.
- Graph algorithms: run offline for large graphs; store results as node properties for real-time reads.
- Graph partitioning: distributing a graph across machines is hard due to high inter-partition edge traffic; prefer native graph databases that fit the working set in memory.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture