Social Graph Analytics Low-Level Design: Degree Distribution, Cluster Detection, and Influence Scoring

Overview

Social graph analytics computes structural metrics over large-scale user graphs including degree distribution, community detection, and influence scoring. These metrics power feed ranking, content amplification decisions, spam detection, and growth insights reported to product teams.

Requirements

Functional Requirements

  • Compute and serve degree distribution histograms (in-degree and out-degree) for the full graph.
  • Detect communities or clusters and assign each user a community ID.
  • Score user influence using a PageRank-variant algorithm updated daily.
  • Expose per-user analytics via a low-latency read API.
  • Support time-windowed analytics comparing graph snapshots across days or weeks.

Non-Functional Requirements

  • Full graph analytics pipeline completes within a six-hour window on a graph of ten billion edges.
  • Per-user metric read latency under 5 ms at p99.
  • Pipeline is fault-tolerant with checkpoint-based recovery on failure.

Data Model

The analytics store maintains a metrics table with one row per user containing: user ID (partition key), in-degree, out-degree, community ID, influence score, PageRank score, last updated timestamp, and a JSON blob of auxiliary metrics. This table is backed by a key-value store optimized for point reads.

Degree distribution data is stored as aggregated histograms in a time-series table keyed by (metric_name, bucket_boundary, snapshot_date). Each snapshot represents a full graph computation result, enabling trend analysis over time.

Community membership is stored as a separate table mapping (community_id, user_id) for reverse lookups, allowing enumeration of all members of a given community.

Core Algorithms

Degree Distribution Computation

Degree distribution is computed via a single-pass MapReduce job over the edge list. The map phase emits (user_id, 1) for each edge, and the reduce phase sums counts per user to produce in-degree and out-degree values. A second pass aggregates these values into histogram buckets using logarithmic binning (bucket boundaries at powers of two) to handle the heavy-tailed distribution typical of social graphs where most users have low degree and a small number have millions of followers.

Community Detection

The pipeline implements the Louvain algorithm adapted for distributed execution:

  • Phase 1 (local optimization): each node evaluates the modularity gain of moving to each neighbor community and moves to the highest-gain community. This phase runs iteratively until no node moves.
  • Phase 2 (aggregation): communities found in Phase 1 are collapsed into super-nodes. Edge weights between super-nodes are summed. The algorithm then repeats Phase 1 on the aggregated graph.
  • The distributed implementation partitions nodes across workers. Border nodes that span partition boundaries are handled via message passing between workers to synchronize community assignments.

The algorithm converges in logarithmically many passes over the graph and scales to tens of billions of edges on a cluster of hundreds of workers.

Influence Scoring and PageRank

Influence scores are computed using a personalized PageRank variant with damping factor 0.85. The iterative update rule is:

  • Initialize each node score to 1/N where N is total node count.
  • At each iteration, each node distributes its score equally to all out-neighbors.
  • Each node receives the sum of incoming score contributions multiplied by the damping factor, plus (1 – damping) / N as the teleportation term.
  • Iteration continues until the L1 norm of score changes falls below a convergence threshold of 1e-6.

The distributed implementation uses Pregel-style vertex-centric computation. Each worker holds a partition of the graph and passes messages to neighboring partitions over the network. Convergence typically requires 30 to 50 iterations on real social graphs.

API Design

  • GetUserMetrics(user_id) — returns degree, community ID, influence score, and PageRank for a single user.
  • GetDegreeDistribution(direction, snapshot_date) — returns histogram buckets for in-degree or out-degree.
  • GetCommunityMembers(community_id, page_token, limit) — returns paginated member list for a community.
  • GetInfluenceLeaderboard(limit, snapshot_date) — returns top-N users by influence score for a given snapshot.
  • CompareSnapshots(metric, date_a, date_b) — returns delta statistics between two graph snapshots.

Scalability

Pipeline Architecture

The analytics pipeline runs on a distributed graph processing framework. The edge list is ingested from the operational graph store via a daily export to distributed object storage. The pipeline reads from object storage to avoid impacting the operational store. Checkpoints are written after each major phase (degree computation, community detection, PageRank) so that failures restart from the last checkpoint rather than the beginning.

Serving Layer

After pipeline completion, computed metrics are bulk-loaded into the analytics key-value store using a sorted string table import process that bypasses the write path and directly loads files into the storage engine. This eliminates write amplification from individual put operations and reduces load time from hours to minutes for tables with billions of rows.

Monitoring

Pipeline health is tracked via job duration per phase, checkpoint success rate, convergence iteration count for PageRank, and number of communities detected (an anomalous count signals a data or algorithm issue). The read API tracks p50/p95/p99 latency and error rate per method. Alerts fire if the pipeline fails to complete within the six-hour SLA or if influence score distributions shift more than 10% day-over-day, which may indicate graph data corruption.

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top