How is degree distribution computed at social-graph scale using MapReduce?

In the Map phase each edge record emits (user_id, 1) for both endpoints. The Reduce phase sums counts per user to produce per-node degree. A second MapReduce job then uses degree as the key and counts how many nodes share that degree, yielding the degree distribution histogram. Combiners on each mapper node reduce shuffle volume significantly for power-law graphs.

How does the Louvain algorithm detect communities in a large social graph?

Louvain is a greedy modularity-optimisation algorithm. Phase 1 assigns each node to its own community and iteratively moves nodes to neighbouring communities when doing so increases modularity. Phase 2 collapses each community into a super-node and repeats. The two phases alternate until modularity stops improving. For distributed execution the graph is partitioned and local Louvain runs are merged with cross-partition edge reconciliation passes.

What is personalized PageRank and how is it used in social graph analytics?

Personalized PageRank (PPR) modifies the random-walk teleportation step to always return to a specific source node rather than a uniform distribution. The resulting stationary distribution scores all other nodes by their relevance to that source, making it useful for feed ranking, friend recommendation, and influence measurement. In practice PPR vectors are approximated with Monte Carlo random walks or the push-flow algorithm to avoid full matrix inversion on billion-node graphs.

How do you design scalable batch computation for social graph analytics?

Scalable batch analytics separates storage (edge lists in columnar format on object storage) from compute (Spark or Pregel-style BSP engine). Incremental graph snapshots with delta encoding avoid reprocessing the full graph daily. Precomputed results — degree, community labels, PPR vectors — are written to a low-latency key-value store for online serving. Job scheduling uses dependency graphs so downstream analytics wait for upstream graph exports to complete.

Social Graph Analytics Low-Level Design: Degree Distribution, Cluster Detection, and Influence Scoring

⏱ 5 min read

Overview

Social graph analytics computes structural metrics over large-scale user graphs including degree distribution, community detection, and influence scoring. These metrics power feed ranking, content amplification decisions, spam detection, and growth insights reported to product teams.

Requirements

Functional Requirements

Compute and serve degree distribution histograms (in-degree and out-degree) for the full graph.
Detect communities or clusters and assign each user a community ID.
Score user influence using a PageRank-variant algorithm updated daily.
Expose per-user analytics via a low-latency read API.
Support time-windowed analytics comparing graph snapshots across days or weeks.

Non-Functional Requirements

Full graph analytics pipeline completes within a six-hour window on a graph of ten billion edges.
Per-user metric read latency under 5 ms at p99.
Pipeline is fault-tolerant with checkpoint-based recovery on failure.

Data Model

The analytics store maintains a metrics table with one row per user containing: user ID (partition key), in-degree, out-degree, community ID, influence score, PageRank score, last updated timestamp, and a JSON blob of auxiliary metrics. This table is backed by a key-value store optimized for point reads.

Degree distribution data is stored as aggregated histograms in a time-series table keyed by (metric_name, bucket_boundary, snapshot_date). Each snapshot represents a full graph computation result, enabling trend analysis over time.

Community membership is stored as a separate table mapping (community_id, user_id) for reverse lookups, allowing enumeration of all members of a given community.

Core Algorithms

Degree Distribution Computation

Degree distribution is computed via a single-pass MapReduce job over the edge list. The map phase emits (user_id, 1) for each edge, and the reduce phase sums counts per user to produce in-degree and out-degree values. A second pass aggregates these values into histogram buckets using logarithmic binning (bucket boundaries at powers of two) to handle the heavy-tailed distribution typical of social graphs where most users have low degree and a small number have millions of followers.

Community Detection

The pipeline implements the Louvain algorithm adapted for distributed execution:

Phase 1 (local optimization): each node evaluates the modularity gain of moving to each neighbor community and moves to the highest-gain community. This phase runs iteratively until no node moves.
Phase 2 (aggregation): communities found in Phase 1 are collapsed into super-nodes. Edge weights between super-nodes are summed. The algorithm then repeats Phase 1 on the aggregated graph.
The distributed implementation partitions nodes across workers. Border nodes that span partition boundaries are handled via message passing between workers to synchronize community assignments.

The algorithm converges in logarithmically many passes over the graph and scales to tens of billions of edges on a cluster of hundreds of workers.

Influence Scoring and PageRank

Influence scores are computed using a personalized PageRank variant with damping factor 0.85. The iterative update rule is:

Initialize each node score to 1/N where N is total node count.
At each iteration, each node distributes its score equally to all out-neighbors.
Each node receives the sum of incoming score contributions multiplied by the damping factor, plus (1 – damping) / N as the teleportation term.
Iteration continues until the L1 norm of score changes falls below a convergence threshold of 1e-6.

The distributed implementation uses Pregel-style vertex-centric computation. Each worker holds a partition of the graph and passes messages to neighboring partitions over the network. Convergence typically requires 30 to 50 iterations on real social graphs.

API Design

GetUserMetrics(user_id) — returns degree, community ID, influence score, and PageRank for a single user.
GetDegreeDistribution(direction, snapshot_date) — returns histogram buckets for in-degree or out-degree.
GetCommunityMembers(community_id, page_token, limit) — returns paginated member list for a community.
GetInfluenceLeaderboard(limit, snapshot_date) — returns top-N users by influence score for a given snapshot.
CompareSnapshots(metric, date_a, date_b) — returns delta statistics between two graph snapshots.

Scalability

Pipeline Architecture

The analytics pipeline runs on a distributed graph processing framework. The edge list is ingested from the operational graph store via a daily export to distributed object storage. The pipeline reads from object storage to avoid impacting the operational store. Checkpoints are written after each major phase (degree computation, community detection, PageRank) so that failures restart from the last checkpoint rather than the beginning.

Serving Layer

After pipeline completion, computed metrics are bulk-loaded into the analytics key-value store using a sorted string table import process that bypasses the write path and directly loads files into the storage engine. This eliminates write amplification from individual put operations and reduces load time from hours to minutes for tables with billions of rows.

Monitoring

Pipeline health is tracked via job duration per phase, checkpoint success rate, convergence iteration count for PageRank, and number of communities detected (an anomalous count signals a data or algorithm issue). The read API tracks p50/p95/p99 latency and error rate per method. Alerts fire if the pipeline fails to complete within the six-hour SLA or if influence score distributions shift more than 10% day-over-day, which may indicate graph data corruption.