Graph Database Low-Level Design: Native Graph Storage, Traversal Engine, and Cypher Query Execution

A graph database stores entities as nodes and relationships as edges, with properties on both. Unlike relational databases that simulate graphs via foreign key joins, native graph databases use index-free adjacency — direct physical pointers from each node to its adjacent relationships — enabling constant-time local neighbor access that does not degrade as the graph grows.

Native Graph Storage Model

In a native graph store, each record type is stored in its own file or store:

  • Node store: fixed-size records, each containing a pointer to the first relationship record and a pointer to the first property record.
  • Relationship store: each record contains src_node pointer, dst_node pointer, relationship type ID, pointers to the next/previous relationship for both the source and destination node (doubly-linked list per node), and a pointer to properties.
  • Property store: key-value pairs packed into property records, with overflow to a dynamic store for large values.
  • Label and type stores: token stores mapping label/type names to integer IDs for compact in-record storage.

Index-free adjacency means traversal follows physical pointers — when you expand a node's neighbors, the engine reads the first relationship record directly via the stored pointer and then follows the doubly-linked list without any index lookup. This makes the cost of a single hop O(degree) rather than O(log N) for a global index lookup.

Relationship Record Structure

RelationshipRecord {
    id:                  uint64
    src_node_id:         uint64
    dst_node_id:         uint64
    relationship_type:   uint32   // token ID
    src_prev_rel:        uint64   // doubly-linked list for src node
    src_next_rel:        uint64
    dst_prev_rel:        uint64   // doubly-linked list for dst node
    dst_next_rel:        uint64
    property_ptr:        uint64
    in_use:              bool
}

Traversal walks the linked list of relationships at a node — reading only the relationships of the node being expanded, not the full edge table.

Traversal Engine: DFS and BFS with Visitor Pattern

The traversal engine supports both depth-first and breadth-first traversal. A visitor pattern allows pluggable logic at each node and relationship during traversal:

  • DFS: lower memory footprint, suitable for path existence queries and deep hierarchies.
  • BFS: finds shortest paths, suitable for social distance queries (degrees of separation).
  • Bidirectional BFS: starts from both source and target simultaneously; meets in the middle, dramatically reducing explored nodes for shortest-path queries in dense social graphs.

Pruning is applied via relationship type filters (only follow FOLLOWS edges) and property filters evaluated during traversal rather than after.

Cypher Query Execution

A representative Cypher pattern:

MATCH (a:User)-[:FOLLOWS]->(b:User)
WHERE a.id = $id
RETURN b.name, b.email
ORDER BY b.name

Execution pipeline:

  1. Parse: tokenise and produce AST.
  2. Logical plan: NodeIndexSeek(User, id=$id) → Expand(FOLLOWS, OUTGOING) → Filter(b:User) → Project(b.name, b.email) → Sort.
  3. Optimise: push label filter into Expand; use property index for entry point.
  4. Physical plan: NodeIndexSeek uses the User.id B-tree index to find node a in O(log N), then Expand follows index-free adjacency pointers.

Label and Property Indexes

Indexes in a graph database serve as entry points for pattern matching. Without an index, the engine must scan all nodes with the given label. B-tree indexes are used for range queries; hash indexes for exact-match lookups. Composite indexes cover multi-property predicates.

Index maintenance: indexes are updated synchronously on write. For high-write workloads, bulk-loading with deferred index build is preferred.

Graph Algorithms

  • PageRank: iterative O(k*(V+E)); rank flows from high-connectivity nodes outward.
  • Louvain (community detection): modularity optimisation; ~O(V log V) in practice.
  • Dijkstra / A*: weighted shortest path; O((V+E) log V) with binary heap.
  • Betweenness centrality: O(V*E) — typically computed offline.

SQL DDL: Relational Analog

-- Nodes
CREATE TABLE GNode (
    id          BIGSERIAL PRIMARY KEY,
    labels      JSONB         NOT NULL DEFAULT '[]',
    properties  JSONB         NOT NULL DEFAULT '{}'
);

CREATE INDEX idx_gnode_labels ON GNode USING GIN (labels);
CREATE INDEX idx_gnode_properties ON GNode USING GIN (properties);

-- Edges (directed relationships)
CREATE TABLE GEdge (
    id          BIGSERIAL PRIMARY KEY,
    src_id      BIGINT        NOT NULL REFERENCES GNode(id),
    dst_id      BIGINT        NOT NULL REFERENCES GNode(id),
    rel_type    VARCHAR(128)  NOT NULL,
    properties  JSONB         NOT NULL DEFAULT '{}'
);

CREATE INDEX idx_gedge_src ON GEdge (src_id, rel_type);
CREATE INDEX idx_gedge_dst ON GEdge (dst_id, rel_type);

-- Property index for fast entry-point lookup
CREATE TABLE GIndex (
    label         VARCHAR(128)  NOT NULL,
    property_name VARCHAR(128)  NOT NULL,
    node_id       BIGINT        NOT NULL REFERENCES GNode(id),
    PRIMARY KEY (label, property_name, node_id)
);

Python: Core Operations

from neo4j import GraphDatabase
from typing import Any

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

def create_node(labels: list[str], properties: dict[str, Any]) -> int:
    """Create a node and return its internal ID."""
    label_str = ":".join(labels)
    query = f"CREATE (n:{label_str} $props) RETURN id(n) AS node_id"
    with driver.session() as session:
        result = session.run(query, props=properties)
        return result.single()["node_id"]

def create_edge(src: int, dst: int, rel_type: str, properties: dict[str, Any]) -> None:
    """Create a directed relationship between two nodes by internal ID."""
    query = (
        "MATCH (a) WHERE id(a) = $src "
        "MATCH (b) WHERE id(b) = $dst "
        f"CREATE (a)-[r:{rel_type} $props]->(b)"
    )
    with driver.session() as session:
        session.run(query, src=src, dst=dst, props=properties)

def match_pattern(user_id: int) -> list[dict]:
    """Return all users directly followed by the given user."""
    query = (
        "MATCH (a:User {id: $uid})-[:FOLLOWS]->(b:User) "
        "RETURN b.id AS id, b.name AS name ORDER BY b.name"
    )
    with driver.session() as session:
        result = session.run(query, uid=user_id)
        return [dict(record) for record in result]

def run_page_rank(graph_name: str = "myGraph") -> list[dict]:
    """Run PageRank via GDS and return top-ranked nodes."""
    query = (
        "CALL gds.pageRank.stream($graph) "
        "YIELD nodeId, score "
        "RETURN gds.util.asNode(nodeId).name AS name, score "
        "ORDER BY score DESC LIMIT 10"
    )
    with driver.session() as session:
        result = session.run(query, graph=graph_name)
        return [dict(record) for record in result]

Design Considerations Summary

  • Index-free adjacency: core differentiator; local traversal cost is O(degree), not O(log N).
  • Entry-point indexes: always index the property used in WHERE to avoid full label scans.
  • Cypher compilation: filter pushdown is critical; ensure predicate order matches traversal direction.
  • Graph algorithms: run offline for large graphs; store results as node properties for real-time reads.
  • Graph partitioning: distributing a graph across machines is hard due to high inter-partition edge traffic; prefer native graph databases that fit the working set in memory.

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top