Graph Database: Low-Level Design

⏱ 4 min read

A graph database stores data as nodes (entities) and edges (relationships), enabling efficient traversal of complex, highly connected data. Social networks (friends-of-friends), fraud ring detection (shared accounts and devices), recommendation engines (users who liked X also liked Y), and knowledge graphs are natural fits. Neo4j, Amazon Neptune, and TigerGraph are production graph databases. Understanding when and how to use graph databases is a valuable system design skill.

Graph Data Model

Property graph model (used by Neo4j): nodes have labels (types) and properties (key-value pairs). Edges have types and properties. Example social network: Node (label: User, properties: {user_id: 123, name: “Alice”, joined: 2020}). Edge (type: FOLLOWS, properties: {since: 2021}). Edge (type: LIKED, properties: {at: timestamp}). Queries: “find all users Alice follows who also follow Bob” — a two-hop traversal. RDF (Resource Description Framework): an alternative model used by Neptune and academic knowledge graphs. Everything is a (subject, predicate, object) triple. Flexible but less intuitive for developers. Property graphs are generally preferred for application development. Labeled Property Graph (LPG) schema: define node labels and edge types with optional property constraints. Not as rigid as relational schemas but provides enough structure for validation and query optimization.

Graph Traversal and Query Language

Cypher (Neo4j’s query language): a pattern-matching language for graph traversal. Find friends of Alice’s friends: MATCH (alice:User {name: “Alice”})-[:FOLLOWS]->(:User)-[:FOLLOWS]->(fof:User) WHERE NOT (alice)-[:FOLLOWS]->(fof) AND fof alice RETURN fof.name LIMIT 10. The pattern (a)-[:REL]->(b) matches nodes and relationships in the graph. Variable-length traversal: (a)-[:FOLLOWS*1..3]->(b) matches paths where Alice reaches b in 1 to 3 FOLLOWS hops — useful for “mutual connections” or “degrees of separation.” Gremlin: Apache TinkerPop’s graph traversal language, used by Neptune and JanusGraph. g.V().has(“name”,”Alice”).out(“FOLLOWS”).out(“FOLLOWS”).dedup().toList() — equivalent to the Cypher query above. BFS/DFS traversal: shortest path between two nodes uses breadth-first search. Neo4j has built-in shortestPath() and allShortestPaths() functions. APOC plugin: extended algorithms (PageRank, community detection, centrality) as stored procedures.

Storage and Indexing

Graph database storage is optimized for traversal, not sequential scans. Neo4j uses a native graph storage engine: nodes and relationships are stored in fixed-size records with direct pointers to adjacent relationships — following an edge is O(1) pointer dereference, not a join. Adjacency list per node: each node stores a doubly-linked list of its relationships. Traversal follows pointer chains — no index lookup needed for adjacent nodes. Property storage: properties are stored separately from node/relationship records; accessed when needed (lazy loading). Index types: B-tree index on node properties for entry-point queries (MATCH (u:User {email: “alice@example.com”}) uses an index on User.email to find the starting node). Full-text index for text search within properties. Relationship traversal complexity: O(degree) per node hop, where degree is the number of relationships. Supernode problem: a node with millions of relationships (a celebrity in a social network) makes traversals through it expensive. Mitigation: partition supernode relationships by type; limit traversal depth near supernodes.

Graph Algorithms for Fraud and Recommendations

Graph algorithms unlock patterns invisible in relational databases. Fraud ring detection: fraudsters share devices, IP addresses, phone numbers, and email domains across multiple accounts. In a graph: accounts are nodes; shared attributes are edges. A connected component of accounts sharing devices and IPs is a fraud ring. Algorithm: weakly connected components (WCC) identifies clusters. Run periodically or in real time on new account registrations. If a new account shares a device with 3 known-fraudulent accounts: high fraud risk score. PageRank for trust scoring: accounts connected to many trusted accounts have higher trust. Attackers create isolated or weakly-connected accounts. Recommendation: collaborative filtering via graph — “users connected to Alice via LIKES relationships who also LIKES items not yet liked by Alice.” This is a 2-hop traversal. Graph neural networks (GNNs) learn node embeddings from graph structure — used by Pinterest (PinSage) and LinkedIn for recommendations at billion-node scale. Community detection (Louvain algorithm): identify clusters of densely connected users — interest communities, organizational hierarchies.

When to Use a Graph Database vs. Relational

Graph databases excel at: multi-hop traversals (“friends of friends who bought X”), variable-depth queries (“ancestors up to 5 levels up”), and highly connected data where relationship patterns are the primary query concern. Relational databases struggle with: recursive queries (self-joins or CTEs for multi-hop), variable-depth traversals (expensive recursive CTEs), and queries where the answer depends on graph structure (shortest path, connected components). Use graph database when: (1) Relationships are first-class — as important to query as the entities themselves. (2) Traversal depth is variable or unknown at schema design time. (3) The application needs graph algorithms (PageRank, community detection, shortest path). Use relational when: (1) Data is primarily tabular with well-defined foreign key relationships. (2) Most queries are point lookups or aggregations, not traversals. (3) You need strong ACID transactions across multiple entities (graph databases vary in ACID support). Hybrid approach: store relational data in PostgreSQL; sync a graph projection to Neo4j for graph-specific queries. Maintain both stores via CDC (change data capture).