Low Level Design: Graph Analytics System

Vertex-Centric Computation

The Pregel model, introduced by Google in 2010, reimagines distributed graph computation around the vertex as the unit of parallelism. Each vertex owns a piece of local state (its current value, accumulated messages, active/inactive flag) and executes a user-defined compute function during each superstep. The compute function receives the messages sent to this vertex during the previous superstep, updates the vertex’s local state, and optionally sends new messages to neighboring vertices along outgoing edges. All vertices execute their compute functions in parallel within a superstep, and no vertex can observe the state of another vertex directly — all communication goes through the message-passing mechanism. After all vertices complete their compute functions, a global synchronization barrier separates supersteps: no superstep N+1 computation begins until all superstep N compute functions and message deliveries are complete. A vertex can vote to halt if it has no more work to do and is not expecting messages; the computation terminates when all vertices have voted to halt and the message queue is empty. This model maps naturally to distributed systems because each machine owns a partition of vertices and their associated messages, processes them locally, and exchanges messages over the network only for edges that cross partition boundaries.

Graph Partitioning

How a graph is distributed across machines profoundly affects communication cost and load balance. Edge-cut partitioning assigns each vertex to exactly one machine and cuts edges that cross partition boundaries — those edges require network messages during computation. For graphs with a relatively uniform degree distribution (e.g., road networks), edge-cut partitioning works well because high-degree vertices are rare and inter-machine traffic is proportional to the cut size. However, for power-law graphs (web graphs, social networks), a small number of vertices have degree in the millions — celebrities, Google’s homepage, hub airports. Assigning a hub vertex to any single machine makes that machine a bottleneck: it must process millions of messages per superstep. Vertex-cut partitioning, used by PowerGraph and GraphX, addresses this by assigning edges to machines rather than vertices. A high-degree vertex is split into mirrors on multiple machines; each mirror holds a replica of the vertex’s state, processes its share of incident edges locally, and participates in a reduce step to synchronize state across mirrors after each superstep. The communication cost is proportional to the replication factor (number of mirrors per vertex) rather than degree, which dramatically reduces hot spots on power-law graphs at the cost of more complex state synchronization.

PageRank Implementation

PageRank is the canonical graph algorithm and a useful benchmark for any graph processing system. In the Pregel formulation, each vertex maintains its current rank as a float. In superstep 0, each vertex initializes its rank to 1/N where N is the total vertex count. In each subsequent superstep, a vertex computes its contribution as rank / out_degree and sends this value as a message to each outgoing neighbor. At the start of the next superstep, each vertex receives the contributions from its incoming neighbors, sums them, applies the damping formula rank = 0.15/N + 0.85 * sum_of_contributions, updates its local rank, and sends its new contribution to neighbors. The 0.15 damping factor models the probability that a random web surfer follows a random link versus teleporting to a random page. Iteration continues until the maximum rank delta across all vertices falls below a convergence threshold, typically 0.001 or smaller. In a distributed system, the damping teleportation term 0.15/N requires knowing the global vertex count N, which is stored as a broadcast constant. Dangling nodes (vertices with out_degree zero) are handled by adding their rank contribution to a global accumulator and redistributing it evenly across all vertices each superstep, preventing rank from leaking out of the graph.

Convergence Detection

Determining when an iterative graph algorithm has converged requires a global view that no single vertex possesses. Pregel provides aggregators for this purpose: user-defined reduce functions that combine values from all vertices into a single global result available at the start of the next superstep. For PageRank convergence, each vertex computes the absolute difference between its current rank and its previous rank and contributes this delta to a SumAggregator. After the global barrier, the master reads the aggregated sum of deltas. If the sum falls below the convergence threshold, the master signals all vertices to halt; otherwise, the next superstep begins. Aggregators can also implement other global operations: MaxAggregator tracks the largest value seen across all vertices (useful for diameter estimation), AndAggregator checks whether all vertices satisfy some predicate (useful for bipartite detection), and custom aggregators can implement histogram collection for monitoring algorithmic progress. The cost of aggregation is O(V/P) per machine where P is the number of partitions, plus O(P) for the master to reduce across machines — negligible compared to per-superstep computation cost for most algorithms.

Community Detection

Community detection identifies groups of vertices that are more densely connected to each other than to the rest of the graph. The Label Propagation Algorithm (LPA) is one of the simplest and most parallelizable methods and maps cleanly onto the vertex-centric model. In superstep 0, each vertex initializes its label to its own vertex ID. In each subsequent superstep, a vertex collects the labels of all its neighbors (sent as messages in the previous superstep), selects the most frequently occurring label (breaking ties by choosing the smallest label ID), and updates its own label to that value. The vertex then broadcasts its new label to all neighbors. Vertices that did not change their label vote to halt. The algorithm converges when no vertex changes its label, meaning each vertex already holds the dominant label among its neighbors. At convergence, all vertices with the same label form a community. LPA runs in O(E) time per iteration and typically converges in a small number of iterations (5-10) for most real-world graphs, making it practical at billion-edge scale. Its weakness is non-determinism: the order in which ties are broken can produce different community assignments across runs, and some graphs have many near-equivalent solutions. For more stable results, modularity-maximizing algorithms like Louvain produce better quality communities at higher computational cost.

Shortest Path

Single-source shortest path (SSSP) from a designated source vertex to all other vertices can be implemented in Pregel as a direct application of Bellman-Ford relaxation. The source vertex initializes its distance to zero and all other vertices to infinity. In each superstep, each vertex that received a shorter distance estimate in the previous superstep propagates distance + edge_weight to each outgoing neighbor. A neighbor updates its distance only if the proposed value is smaller than its current estimate, then propagates the update. Vertices that did not improve their estimate vote to halt. Because Bellman-Ford requires up to V-1 supersteps in the worst case (a path of length V-1), this approach is expensive on large-diameter graphs. For single-pair shortest path, bidirectional BFS dramatically reduces the search space: run BFS simultaneously from both source and destination, terminating when the two frontiers meet. The meeting point gives the shortest path. For dense graphs or small-world networks where diameter is low (6 degrees of separation), even the basic Pregel SSSP converges quickly. Edge weights in a Pregel SSSP must be non-negative; negative cycles require different algorithms like SPFA (Shortest Path Faster Algorithm), which is a queue-based variant of Bellman-Ford.

Graph Representation

The choice of in-memory graph representation has major performance implications for both traversal and analytics workloads. Adjacency list representation stores each vertex’s neighbors as a variable-length list, using a hash map or sorted array keyed by vertex ID. This is flexible and supports efficient single-vertex neighbor lookups, making it the standard choice for online graph databases like Neo4j. For batch analytics, the Compressed Sparse Row (CSR) format packs all neighbor lists into two flat arrays: an offsets array of length V+1 where offsets[v] gives the start index of vertex v’s neighbor list in the second array, and an adjacency array of length E storing all neighbor vertex IDs. CSR eliminates pointer indirection and achieves excellent cache locality during sequential BFS or superstep computation because neighbors of a vertex occupy contiguous memory. A property graph extends both models by attaching attribute maps to vertices and edges — vertex properties might include name, age, and join_date; edge properties might include weight, relationship_type, and created_at. In distributed systems, property storage is co-located with the vertex or edge it describes, partitioned by the same strategy as the graph structure, so property lookups during compute functions incur no additional network round trips.

Incremental Processing

Real-world graphs are not static: edges are added and removed continuously as users follow each other, transactions occur, and web pages link to new content. Rerunning a full batch analytics job (PageRank, community detection, SSSP) from scratch after each small update is wasteful when only a tiny fraction of the graph changed. Incremental graph processing, implemented most rigorously in the differential dataflow model (Frank McSherry’s work, used in Materialize and Napa), tracks changes as deltas — additions and deletions of vertices and edges — and propagates only the consequences of those deltas through the computation graph. For PageRank, an edge addition between vertices u and v changes the rank contribution flowing from u to v and requires re-propagating rank changes through the subgraph reachable from u. The key insight of differential dataflow is that if you model the computation as a fixed-point iteration over a partially ordered version space, you can precisely identify which portions of the output change in response to an input delta and recompute only those portions. In practice, for low-connectivity graphs, an edge insertion affects a small local neighborhood and converges in a few supersteps. For hub vertices in power-law graphs, a single edge insertion can propagate rank changes globally, making incremental computation approach full recomputation cost — a fundamental limitation of incremental approaches on scale-free graphs.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top