System Design Interview: Social Graph and Friend Recommendations

Social Graph Requirements

A social graph stores relationships between users (friendships, follows, connections) and enables graph queries: friends of friends, mutual connections, shortest path between users, influence scoring, and “People You May Know” (PYMK) recommendations. LinkedIn has 1 billion members with ~50 connections each = 50 billion edges. Facebook has 3 billion users with ~200 friends each = 600 billion edges. The graph is the core data asset — recommendation quality directly drives engagement and revenue.

Graph Storage: Adjacency List

The most practical storage for social graphs at scale is a sharded adjacency list in a relational or key-value database. Each user has a list of their connections.


-- Relational adjacency list (bidirectional friendship):
CREATE TABLE friendships (
    user_id    BIGINT NOT NULL,
    friend_id  BIGINT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (user_id, friend_id),
    CHECK (user_id < friend_id)  -- store each edge once; query both directions
);
CREATE INDEX idx_friendships_user ON friendships(user_id);
CREATE INDEX idx_friendships_friend ON friendships(friend_id);

-- Directed follow (Twitter/Instagram style):
CREATE TABLE follows (
    follower_id  BIGINT NOT NULL,
    followee_id  BIGINT NOT NULL,
    created_at   TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (follower_id, followee_id)
);
-- Separate indexes for "who follows X" and "who does X follow"
CREATE INDEX idx_follows_followee ON follows(followee_id);

-- Queries:
-- Get all friends of user 42:
SELECT friend_id FROM friendships WHERE user_id = 42
UNION
SELECT user_id FROM friendships WHERE friend_id = 42;

-- Count mutual friends between user A and user B (expensive without caching):
SELECT COUNT(*) FROM (
    SELECT friend_id FROM friendships WHERE user_id = A
    UNION SELECT user_id FROM friendships WHERE friend_id = A
) AS a_friends
JOIN (
    SELECT friend_id FROM friendships WHERE user_id = B
    UNION SELECT user_id FROM friendships WHERE friend_id = B
) AS b_friends USING (friend_id);

Sharding Strategy

With 50 billion edges across hundreds of shards, the sharding strategy determines query efficiency. Two approaches:

  • User-based sharding: shard by user_id. All of user 42’s friendships are on one shard — efficient for “get my friends” queries. But “friends of friends” requires querying N shards (one per friend). LinkedIn uses this approach.
  • Edge-based sharding: shard by hash(min(u,v), max(u,v)). Mutual friends queries are more efficient but basic “get my friends” requires scatter-gather across shards. Less common for social graphs.

# User-based sharding:
shard_id = user_id % NUM_SHARDS

def get_friends(user_id: int) -> list[int]:
    shard = db_shard(user_id % NUM_SHARDS)
    return shard.query("SELECT friend_id FROM friendships WHERE user_id = ?", user_id)

def get_friends_of_friends(user_id: int) -> set[int]:
    friends = get_friends(user_id)
    # Must query N shards (one per friend's shard)
    fof = set()
    for friend_id in friends:
        fof.update(get_friends(friend_id))
    return fof - friends - {user_id}  # exclude existing friends and self

Graph Databases

Relational databases struggle with multi-hop graph queries (friends of friends of friends). Graph databases store adjacency natively and use pointer-based traversal instead of join-based traversal:

  • Neo4j: property graph model. Nodes and edges both have properties. Cypher query language. Used by eBay, Airbnb, and Walmart for recommendations. Excellent for rich, property-heavy graphs.
  • Amazon Neptune: managed graph database supporting both Gremlin and SPARQL. Used by Lyft, Samsung. Better for cloud-native deployments.
  • JanusGraph: distributed graph database backed by Cassandra or HBase. Used by IBM, LinkedIn (partially). Scales horizontally but operationally complex.

// Neo4j Cypher: find all friends of friends within 2 hops
MATCH (user:User {id: 42})-[:FRIEND*1..2]-(fof:User)
WHERE NOT (user)-[:FRIEND]-(fof) AND fof.id  42
RETURN fof.id, fof.name, COUNT(*) AS mutual_connections
ORDER BY mutual_connections DESC
LIMIT 20;

// Index-free adjacency: traversal cost O(edges_traversed), not O(log N) per hop
// Compared to SQL JOIN: each hop in a relational DB is a full index scan
// For 6-hop queries: graph DB → ms; relational → minutes or timeout

People You May Know (PYMK) Algorithm

PYMK recommends users you might want to connect with. The core signal is mutual friends — if you have 10 mutual friends with someone, you likely know them in real life. But at scale, computing mutual friends for all non-connected user pairs (billions × billions) is infeasible. Approximate algorithms:


# Step 1: Candidate generation (coarse filter)
# For user U, collect all friends-of-friends (2-hop neighbors not already connected)
# This produces O(friends^2) candidates — for 500 friends, ~250,000 candidates

def get_pymk_candidates(user_id: int) -> dict[int, int]:
    """Returns {candidate_id: mutual_count}"""
    friends = set(get_friends(user_id))
    mutual_counts = {}

    for friend_id in friends:
        for fof_id in get_friends(friend_id):
            if fof_id not in friends and fof_id != user_id:
                mutual_counts[fof_id] = mutual_counts.get(fof_id, 0) + 1

    return mutual_counts

# Step 2: Scoring (fine filter with ranking model)
# Features beyond mutual friends:
# - Same employer (LinkedIn profile data)
# - Same school (profile data)
# - Shared interests/groups
# - Viewing their profile recently
# - Second-degree connection (friend of friend vs stranger)
# - Geographic proximity

# Step 3: Ranking with ML model
# LinkedIn uses a gradient boosted tree + neural network to rank candidates
# trained on historical connection acceptance rates

# Step 4: Pre-computation
# Running PYMK for every user on every page load is too slow
# Pre-compute top-50 recommendations for each user offline (daily/hourly)
# Store in key-value store: user_id → [recommended_user_ids]
# Update when social graph changes significantly (new connections)

MinHash for Approximate Mutual Friend Counting

Computing exact mutual friends between all user pairs requires O(N²) intersection operations. MinHash enables approximate similarity estimation in O(N × k) where k is the number of hash functions.


import hashlib
import numpy as np

def minhash_signature(friend_set: set[int], num_hashes: int = 128) -> list[int]:
    """Compute MinHash signature for a set of friend IDs."""
    sig = [float("inf")] * num_hashes
    for friend_id in friend_set:
        for i in range(num_hashes):
            # Different hash function per i (using seed)
            h = int(hashlib.md5(f"{i}:{friend_id}".encode()).hexdigest(), 16)
            sig[i] = min(sig[i], h)
    return sig

def jaccard_similarity(sig_a: list[int], sig_b: list[int]) -> float:
    """Estimate Jaccard similarity (mutual friends / total friends)."""
    matches = sum(1 for a, b in zip(sig_a, sig_b) if a == b)
    return matches / len(sig_a)

# LinkedIn computes MinHash signatures offline for all users
# Locality-Sensitive Hashing (LSH) finds similar signatures in O(N) rather than O(N^2)
# Users with similar friend sets are candidates for PYMK

Feed Generation from the Social Graph

News feed (showing posts from connections) requires traversing the social graph on every feed load. Two architectures:

  • Fan-out on write (push model): when user A posts, push the post ID to all A’s followers’ feed lists (Redis sorted sets). Feed read is O(1). Write is O(followers) — problematic for influencers with 10M followers.
  • Fan-out on read (pull model): when user B loads their feed, query all B’s followees for recent posts and merge-sort by timestamp. Read is expensive (N queries). Good for users with few following.
  • Hybrid: push for normal users (< 10K followers); pull for celebrities. Facebook and Twitter use this approach.

Interview Questions

  • Design LinkedIn’s “People You May Know” feature — how do you generate recommendations for 1B users?
  • How do you store and query a directed social graph with 10 billion edges?
  • Design the friend suggestion algorithm — what signals matter beyond mutual connections?
  • How do you compute the shortest path between two users in a graph of 3 billion people?
  • How does sharding affect multi-hop graph queries and how do you mitigate it?

Frequently Asked Questions

How do you store and query a social graph with 50 billion edges?

A social graph at LinkedIn or Facebook scale (50-600 billion edges) requires distributed storage and specialized query patterns. The standard approach is a sharded adjacency list: store each user's friend list as a row or set of rows in a distributed database, sharded by user_id. Each shard holds the complete friend list for the users assigned to it — a "get my friends" query hits exactly one shard. Facebook uses TAO (The Associations and Objects), a custom distributed key-value store optimized for social graph reads. TAO stores objects (users, posts) and associations (friendships, likes) separately. Associations have a type (friend, like, comment), a timestamp, and data payload. TAO caches aggressively in a two-tier cache: per-region L1 cache (thin Memcached layer) and a cross-region L2 cache, with the MySQL database as the source of truth. Most queries hit L1 (sub-millisecond). LinkedIn uses Voldemort (their distributed key-value store) for the social graph with user_id as the partition key — all of a member's connections are on one partition, enabling O(1) lookups. For multi-hop queries (friends of friends), LinkedIn's Expander system pre-computes 2-hop neighborhoods and materializes them in a secondary store. The fundamental tension: user-based sharding (efficient for "my friends") vs multi-hop queries that must scatter-gather across many shards. Production systems accept this inefficiency for multi-hop and pre-compute results for PYMK offline.

How does the "People You May Know" algorithm work at scale?

People You May Know (PYMK) recommends users to connect with based on shared signals. The algorithm runs in three phases: candidate generation, feature extraction, and ranking. Candidate generation identifies non-connected users with high potential. The primary signal is mutual friends — two users with 20 mutual connections almost certainly know each other. For user U, PYMK collects all 2-hop neighbors (friends of friends not already connected to U). For a user with 500 friends, this generates up to 500×500 = 250,000 candidates. Additional candidate sources: users who appear in the same email contact list (imported during signup), users who work at the same company (profile data), users who attended the same school (profile data), users who have viewed U's profile recently, and users in the same geographic area. Feature extraction computes a feature vector for each (user, candidate) pair: mutual friend count, mutual group count, shared employer, shared school, shared connections in common, recency of candidate's connections to U's friends, and whether the candidate has already sent a connection request. Ranking uses a gradient-boosted tree or neural network trained on historical connection acceptance rates. The model predicts P(acceptance | features). The top-N candidates by predicted acceptance are shown. Pre-computation: PYMK cannot run in real time for 1 billion users. LinkedIn runs batch pre-computation daily (or multiple times per day) using Spark, materializing the top-25 recommendations per user in a key-value store. The recommendations are refreshed when the social graph changes significantly for a user.

When should you use a graph database vs a relational database for social graph queries?

Graph databases and relational databases have fundamentally different cost models for graph queries. In a relational database, a JOIN is a set operation — scanning one index and probing another. For a 2-hop query (friends of friends), you need two JOINs, each with O(N log N) cost where N is the table size. For 3+ hops, the cost grows polynomially and most RDBMS optimizers give up or time out. In a graph database (Neo4j, Amazon Neptune, JanusGraph), each node stores direct pointers to its adjacent nodes. Traversal follows pointers — O(1) per hop, independent of total graph size. This is called index-free adjacency. A 6-hop traversal in Neo4j takes milliseconds; the same query in PostgreSQL would take minutes or time out. When to choose relational: the social graph is an auxiliary feature (most queries are attribute-based, not graph-traversal); the graph is sparse and small (< 10M edges); you need ACID transactions across graph and non-graph data; your team has strong SQL expertise. When to choose graph database: graph traversal (multi-hop paths, communities, subgraph matching) is a primary use case; you need to find the shortest path between two users; recommendation algorithms require real-time 3+ hop traversals; the graph is dense and large. Most social platforms actually use both: a relational database (or distributed K/V store) for the raw adjacency list (friend lookups), and periodic offline computation (Spark, Pregel/GraphX) for complex graph algorithms (PageRank, community detection, PYMK). Real-time graph traversal beyond 2 hops is rare in production social graph systems.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you store and query a social graph with 50 billion edges?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A social graph at LinkedIn or Facebook scale (50-600 billion edges) requires distributed storage and specialized query patterns. The standard approach is a sharded adjacency list: store each user’s friend list as a row or set of rows in a distributed database, sharded by user_id. Each shard holds the complete friend list for the users assigned to it — a “get my friends” query hits exactly one shard. Facebook uses TAO (The Associations and Objects), a custom distributed key-value store optimized for social graph reads. TAO stores objects (users, posts) and associations (friendships, likes) separately. Associations have a type (friend, like, comment), a timestamp, and data payload. TAO caches aggressively in a two-tier cache: per-region L1 cache (thin Memcached layer) and a cross-region L2 cache, with the MySQL database as the source of truth. Most queries hit L1 (sub-millisecond). LinkedIn uses Voldemort (their distributed key-value store) for the social graph with user_id as the partition key — all of a member’s connections are on one partition, enabling O(1) lookups. For multi-hop queries (friends of friends), LinkedIn’s Expander system pre-computes 2-hop neighborhoods and materializes them in a secondary store. The fundamental tension: user-based sharding (efficient for “my friends”) vs multi-hop queries that must scatter-gather across many shards. Production systems accept this inefficiency for multi-hop and pre-compute results for PYMK offline.”
}
},
{
“@type”: “Question”,
“name”: “How does the “People You May Know” algorithm work at scale?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “People You May Know (PYMK) recommends users to connect with based on shared signals. The algorithm runs in three phases: candidate generation, feature extraction, and ranking. Candidate generation identifies non-connected users with high potential. The primary signal is mutual friends — two users with 20 mutual connections almost certainly know each other. For user U, PYMK collects all 2-hop neighbors (friends of friends not already connected to U). For a user with 500 friends, this generates up to 500×500 = 250,000 candidates. Additional candidate sources: users who appear in the same email contact list (imported during signup), users who work at the same company (profile data), users who attended the same school (profile data), users who have viewed U’s profile recently, and users in the same geographic area. Feature extraction computes a feature vector for each (user, candidate) pair: mutual friend count, mutual group count, shared employer, shared school, shared connections in common, recency of candidate’s connections to U’s friends, and whether the candidate has already sent a connection request. Ranking uses a gradient-boosted tree or neural network trained on historical connection acceptance rates. The model predicts P(acceptance | features). The top-N candidates by predicted acceptance are shown. Pre-computation: PYMK cannot run in real time for 1 billion users. LinkedIn runs batch pre-computation daily (or multiple times per day) using Spark, materializing the top-25 recommendations per user in a key-value store. The recommendations are refreshed when the social graph changes significantly for a user.”
}
},
{
“@type”: “Question”,
“name”: “When should you use a graph database vs a relational database for social graph queries?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Graph databases and relational databases have fundamentally different cost models for graph queries. In a relational database, a JOIN is a set operation — scanning one index and probing another. For a 2-hop query (friends of friends), you need two JOINs, each with O(N log N) cost where N is the table size. For 3+ hops, the cost grows polynomially and most RDBMS optimizers give up or time out. In a graph database (Neo4j, Amazon Neptune, JanusGraph), each node stores direct pointers to its adjacent nodes. Traversal follows pointers — O(1) per hop, independent of total graph size. This is called index-free adjacency. A 6-hop traversal in Neo4j takes milliseconds; the same query in PostgreSQL would take minutes or time out. When to choose relational: the social graph is an auxiliary feature (most queries are attribute-based, not graph-traversal); the graph is sparse and small (< 10M edges); you need ACID transactions across graph and non-graph data; your team has strong SQL expertise. When to choose graph database: graph traversal (multi-hop paths, communities, subgraph matching) is a primary use case; you need to find the shortest path between two users; recommendation algorithms require real-time 3+ hop traversals; the graph is dense and large. Most social platforms actually use both: a relational database (or distributed K/V store) for the raw adjacency list (friend lookups), and periodic offline computation (Spark, Pregel/GraphX) for complex graph algorithms (PageRank, community detection, PYMK). Real-time graph traversal beyond 2 hops is rare in production social graph systems."
}
}
]
}

  • Uber Interview Guide
  • Companies That Ask This Question

    Scroll to Top