Graph Database Low-Level Design: Native Graph Storage, Traversal Engine, and Cypher Query Execution

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is index-free adjacency and why does it matter for graph traversal?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Index-free adjacency means each node record stores direct pointers (physical memory or disk addresses) to its adjacent relationship records, rather than storing neighbor IDs that must be looked up in a global index. This gives O(1) local neighbor access regardless of the total graph size. In a relational database, finding all friends of a user requires a join on the edges table, which is O(log N) via index or O(N) via scan. In a native graph database, the traversal follows direct pointers, so multi-hop traversals (friends of friends) remain fast even at scale.”
}
},
{
“@type”: “Question”,
“name”: “When should you use traversal versus index lookup to start a graph query?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Index lookup (label index or property index) is used for the entry-point node — the starting node of a traversal. For example, MATCH (u:User {id: 42}) uses a property index to find node 42 in O(log N). Once the entry node is found, all subsequent hops use index-free adjacency via direct pointers, not the index. Traversal without an entry-point index (full graph scan) should be avoided in production. Always ensure that the property used in the WHERE clause of a MATCH pattern is indexed.”
}
},
{
“@type”: “Question”,
“name”: “How does Cypher query compilation work in a graph database engine?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Cypher compilation follows a pipeline: lexing and parsing produce an AST, which is converted to a logical plan (relational-algebra-like operations: NodeByLabelScan, Expand, Filter, Project). The logical plan is optimized — filter predicates are pushed as close to the data source as possible to prune traversal early. The physical plan selects execution strategies: NodeIndexSeek for indexed lookups, Expand(All) for neighbor traversal, and optionally parallel execution for independent branches. The compiled plan is cached by query signature so repeated queries avoid re-compilation.”
}
},
{
“@type”: “Question”,
“name”: “What is the time complexity of PageRank and community detection on a graph database?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “PageRank runs iteratively: each iteration is O(V + E) where V is the number of nodes and E the number of edges. Convergence typically requires 10-50 iterations, so overall complexity is O(k*(V+E)) for k iterations. The Louvain community detection algorithm is approximately O(V log V) in practice but varies by graph density and modularity landscape. Dijkstra shortest path is O((V + E) log V) with a binary heap. For very large graphs, these algorithms are typically run offline in batch (e.g., via graph processing frameworks) and the results stored back as node properties for real-time lookup.”
}
}
]
}

A graph database stores entities as nodes and relationships as edges, with properties on both. Unlike relational databases that simulate graphs via foreign key joins, native graph databases use index-free adjacency — direct physical pointers from each node to its adjacent relationships — enabling constant-time local neighbor access that does not degrade as the graph grows.

Native Graph Storage Model

In a native graph store, each record type is stored in its own file or store:

Node store: fixed-size records, each containing a pointer to the first relationship record and a pointer to the first property record.
Relationship store: each record contains src_node pointer, dst_node pointer, relationship type ID, pointers to the next/previous relationship for both the source and destination node (doubly-linked list per node), and a pointer to properties.
Property store: key-value pairs packed into property records, with overflow to a dynamic store for large values.
Label and type stores: token stores mapping label/type names to integer IDs for compact in-record storage.

Index-free adjacency means traversal follows physical pointers — when you expand a node's neighbors, the engine reads the first relationship record directly via the stored pointer and then follows the doubly-linked list without any index lookup. This makes the cost of a single hop O(degree) rather than O(log N) for a global index lookup.

Relationship Record Structure

RelationshipRecord {
    id:                  uint64
    src_node_id:         uint64
    dst_node_id:         uint64
    relationship_type:   uint32   // token ID
    src_prev_rel:        uint64   // doubly-linked list for src node
    src_next_rel:        uint64
    dst_prev_rel:        uint64   // doubly-linked list for dst node
    dst_next_rel:        uint64
    property_ptr:        uint64
    in_use:              bool
}

Traversal walks the linked list of relationships at a node — reading only the relationships of the node being expanded, not the full edge table.

Traversal Engine: DFS and BFS with Visitor Pattern

The traversal engine supports both depth-first and breadth-first traversal. A visitor pattern allows pluggable logic at each node and relationship during traversal:

DFS: lower memory footprint, suitable for path existence queries and deep hierarchies.
BFS: finds shortest paths, suitable for social distance queries (degrees of separation).
Bidirectional BFS: starts from both source and target simultaneously; meets in the middle, dramatically reducing explored nodes for shortest-path queries in dense social graphs.

Pruning is applied via relationship type filters (only follow FOLLOWS edges) and property filters evaluated during traversal rather than after.

Cypher Query Execution

A representative Cypher pattern:

MATCH (a:User)-[:FOLLOWS]->(b:User)
WHERE a.id = $id
RETURN b.name, b.email
ORDER BY b.name

Execution pipeline:

Parse: tokenise and produce AST.
Logical plan: NodeIndexSeek(User, id=$id) → Expand(FOLLOWS, OUTGOING) → Filter(b:User) → Project(b.name, b.email) → Sort.
Optimise: push label filter into Expand; use property index for entry point.
Physical plan: NodeIndexSeek uses the User.id B-tree index to find node a in O(log N), then Expand follows index-free adjacency pointers.

Label and Property Indexes

Indexes in a graph database serve as entry points for pattern matching. Without an index, the engine must scan all nodes with the given label. B-tree indexes are used for range queries; hash indexes for exact-match lookups. Composite indexes cover multi-property predicates.

Index maintenance: indexes are updated synchronously on write. For high-write workloads, bulk-loading with deferred index build is preferred.

Graph Algorithms

PageRank: iterative O(k*(V+E)); rank flows from high-connectivity nodes outward.
Louvain (community detection): modularity optimisation; ~O(V log V) in practice.
Dijkstra / A*: weighted shortest path; O((V+E) log V) with binary heap.
Betweenness centrality: O(V*E) — typically computed offline.

SQL DDL: Relational Analog

-- Nodes
CREATE TABLE GNode (
    id          BIGSERIAL PRIMARY KEY,
    labels      JSONB         NOT NULL DEFAULT '[]',
    properties  JSONB         NOT NULL DEFAULT '{}'
);

CREATE INDEX idx_gnode_labels ON GNode USING GIN (labels);
CREATE INDEX idx_gnode_properties ON GNode USING GIN (properties);

-- Edges (directed relationships)
CREATE TABLE GEdge (
    id          BIGSERIAL PRIMARY KEY,
    src_id      BIGINT        NOT NULL REFERENCES GNode(id),
    dst_id      BIGINT        NOT NULL REFERENCES GNode(id),
    rel_type    VARCHAR(128)  NOT NULL,
    properties  JSONB         NOT NULL DEFAULT '{}'
);

CREATE INDEX idx_gedge_src ON GEdge (src_id, rel_type);
CREATE INDEX idx_gedge_dst ON GEdge (dst_id, rel_type);

-- Property index for fast entry-point lookup
CREATE TABLE GIndex (
    label         VARCHAR(128)  NOT NULL,
    property_name VARCHAR(128)  NOT NULL,
    node_id       BIGINT        NOT NULL REFERENCES GNode(id),
    PRIMARY KEY (label, property_name, node_id)
);

Python: Core Operations

from neo4j import GraphDatabase
from typing import Any

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

def create_node(labels: list[str], properties: dict[str, Any]) -> int:
    """Create a node and return its internal ID."""
    label_str = ":".join(labels)
    query = f"CREATE (n:{label_str} $props) RETURN id(n) AS node_id"
    with driver.session() as session:
        result = session.run(query, props=properties)
        return result.single()["node_id"]

def create_edge(src: int, dst: int, rel_type: str, properties: dict[str, Any]) -> None:
    """Create a directed relationship between two nodes by internal ID."""
    query = (
        "MATCH (a) WHERE id(a) = $src "
        "MATCH (b) WHERE id(b) = $dst "
        f"CREATE (a)-[r:{rel_type} $props]->(b)"
    )
    with driver.session() as session:
        session.run(query, src=src, dst=dst, props=properties)

def match_pattern(user_id: int) -> list[dict]:
    """Return all users directly followed by the given user."""
    query = (
        "MATCH (a:User {id: $uid})-[:FOLLOWS]->(b:User) "
        "RETURN b.id AS id, b.name AS name ORDER BY b.name"
    )
    with driver.session() as session:
        result = session.run(query, uid=user_id)
        return [dict(record) for record in result]

def run_page_rank(graph_name: str = "myGraph") -> list[dict]:
    """Run PageRank via GDS and return top-ranked nodes."""
    query = (
        "CALL gds.pageRank.stream($graph) "
        "YIELD nodeId, score "
        "RETURN gds.util.asNode(nodeId).name AS name, score "
        "ORDER BY score DESC LIMIT 10"
    )
    with driver.session() as session:
        result = session.run(query, graph=graph_name)
        return [dict(record) for record in result]

Design Considerations Summary

Index-free adjacency: core differentiator; local traversal cost is O(degree), not O(log N).
Entry-point indexes: always index the property used in WHERE to avoid full label scans.
Cypher compilation: filter pushdown is critical; ensure predicate order matches traversal direction.
Graph algorithms: run offline for large graphs; store results as node properties for real-time reads.
Graph partitioning: distributing a graph across machines is hard due to high inter-partition edge traffic; prefer native graph databases that fit the working set in memory.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is index-free adjacency and why does it improve traversal performance?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each node stores direct memory or disk pointers to its adjacent relationship records, enabling O(1) neighbor lookup regardless of graph size; relational databases require index lookups that scale with the total number of edges, not the local neighborhood.”
}
},
{
“@type”: “Question”,
“name”: “How does Cypher query compilation work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The parser produces an AST; the logical planner maps MATCH patterns to logical operators (NodeScan, Expand, Filter); the physical planner selects execution strategies (index seek vs scan, join order) and produces an execution plan; plans are cached by query fingerprint.”
}
},
{
“@type”: “Question”,
“name”: “How is PageRank computed in a graph database?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “PageRank iterates: each node's rank = (1-d)/N + d * sum(rank(in_neighbor)/out_degree(in_neighbor)); iterations continue until rank values converge below a tolerance threshold; convergence typically requires 20-50 iterations for web-scale graphs.”
}
},
{
“@type”: “Question”,
“name”: “How are large graphs partitioned across multiple machines?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Edges are co-located with their source node using hash partitioning on the source node ID; cross-partition edges are tracked in a routing table; multi-hop traversals that cross partitions use scatter-gather coordination.”
}
}
]
}