Question 1

How do you store and query a social graph with 50 billion edges?

Accepted Answer

A social graph at LinkedIn or Facebook scale (50-600 billion edges) requires distributed storage and specialized query patterns. The standard approach is a sharded adjacency list: store each user's friend list as a row or set of rows in a distributed database, sharded by user_id. Each shard holds the complete friend list for the users assigned to it — a "get my friends" query hits exactly one shard. Facebook uses TAO (The Associations and Objects), a custom distributed key-value store optimized for social graph reads. TAO stores objects (users, posts) and associations (friendships, likes) separately. Associations have a type (friend, like, comment), a timestamp, and data payload. TAO caches aggressively in a two-tier cache: per-region L1 cache (thin Memcached layer) and a cross-region L2 cache, with the MySQL database as the source of truth. Most queries hit L1 (sub-millisecond). LinkedIn uses Voldemort (their distributed key-value store) for the social graph with user_id as the partition key — all of a member's connections are on one partition, enabling O(1) lookups. For multi-hop queries (friends of friends), LinkedIn's Expander system pre-computes 2-hop neighborhoods and materializes them in a secondary store. The fundamental tension: user-based sharding (efficient for "my friends") vs multi-hop queries that must scatter-gather across many shards. Production systems accept this inefficiency for multi-hop and pre-compute results for PYMK offline.

Question 2

How does the "People You May Know" algorithm work at scale?

Accepted Answer

People You May Know (PYMK) recommends users to connect with based on shared signals. The algorithm runs in three phases: candidate generation, feature extraction, and ranking. Candidate generation identifies non-connected users with high potential. The primary signal is mutual friends — two users with 20 mutual connections almost certainly know each other. For user U, PYMK collects all 2-hop neighbors (friends of friends not already connected to U). For a user with 500 friends, this generates up to 500×500 = 250,000 candidates. Additional candidate sources: users who appear in the same email contact list (imported during signup), users who work at the same company (profile data), users who attended the same school (profile data), users who have viewed U's profile recently, and users in the same geographic area. Feature extraction computes a feature vector for each (user, candidate) pair: mutual friend count, mutual group count, shared employer, shared school, shared connections in common, recency of candidate's connections to U's friends, and whether the candidate has already sent a connection request. Ranking uses a gradient-boosted tree or neural network trained on historical connection acceptance rates. The model predicts P(acceptance | features). The top-N candidates by predicted acceptance are shown. Pre-computation: PYMK cannot run in real time for 1 billion users. LinkedIn runs batch pre-computation daily (or multiple times per day) using Spark, materializing the top-25 recommendations per user in a key-value store. The recommendations are refreshed when the social graph changes significantly for a user.

Question 3

When should you use a graph database vs a relational database for social graph queries?

Accepted Answer

Graph databases and relational databases have fundamentally different cost models for graph queries. In a relational database, a JOIN is a set operation — scanning one index and probing another. For a 2-hop query (friends of friends), you need two JOINs, each with O(N log N) cost where N is the table size. For 3+ hops, the cost grows polynomially and most RDBMS optimizers give up or time out. In a graph database (Neo4j, Amazon Neptune, JanusGraph), each node stores direct pointers to its adjacent nodes. Traversal follows pointers — O(1) per hop, independent of total graph size. This is called index-free adjacency. A 6-hop traversal in Neo4j takes milliseconds; the same query in PostgreSQL would take minutes or time out. When to choose relational: the social graph is an auxiliary feature (most queries are attribute-based, not graph-traversal); the graph is sparse and small (< 10M edges); you need ACID transactions across graph and non-graph data; your team has strong SQL expertise. When to choose graph database: graph traversal (multi-hop paths, communities, subgraph matching) is a primary use case; you need to find the shortest path between two users; recommendation algorithms require real-time 3+ hop traversals; the graph is dense and large. Most social platforms actually use both: a relational database (or distributed K/V store) for the raw adjacency list (friend lookups), and periodic offline computation (Spark, Pregel/GraphX) for complex graph algorithms (PageRank, community detection, PYMK). Real-time graph traversal beyond 2 hops is rare in production social graph systems.

System Design Interview: Social Graph and Friend Recommendations

Graph Storage: Adjacency List

Sharding Strategy

Graph Databases

People You May Know (PYMK) Algorithm

MinHash for Approximate Mutual Friend Counting

Interview Questions

Frequently Asked Questions

How do you store and query a social graph with 50 billion edges?

How does the "People You May Know" algorithm work at scale?

When should you use a graph database vs a relational database for social graph queries?

Companies That Ask This Question

Social Graph Requirements

Graph Storage: Adjacency List

Sharding Strategy

Graph Databases

People You May Know (PYMK) Algorithm

MinHash for Approximate Mutual Friend Counting

Feed Generation from the Social Graph

Interview Questions

Frequently Asked Questions

How do you store and query a social graph with 50 billion edges?

How does the "People You May Know" algorithm work at scale?

When should you use a graph database vs a relational database for social graph queries?

Companies That Ask This Question