Social networks combine graph data structures, feed generation, search, and recommendation systems — all at massive scale. This question appears frequently at LinkedIn, Meta, Twitter/X, and Snap. The key insight is that the social graph is the product — every feature depends on efficient graph traversal and storage.
Functional Requirements
- User profiles (name, bio, work history, skills)
- Connection/follow graph (directed for Twitter-style, undirected for LinkedIn-style)
- News feed: posts from connections, ordered by relevance + recency
- Search: find users by name, company, skill
- People you may know (PYMK): connection recommendations
- Notifications: when someone connects, likes, comments
Non-Functional Requirements and Scale
| Metric | LinkedIn-Scale Target |
|---|---|
| Users | 1 billion (200M daily active) |
| Connections per user | Average 500, max 30,000+ |
| Posts per day | 5 million new posts |
| Profile views per day | 100M+ |
| Feed read QPS | ~500,000/s |
Social Graph Storage
Approach 1: Relational Database
-- Adjacency list: simple, works for moderate scale
CREATE TABLE connections (
user_id_1 BIGINT NOT NULL,
user_id_2 BIGINT NOT NULL,
status ENUM("pending", "connected") DEFAULT "pending",
created_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (user_id_1, user_id_2)
);
-- Index both directions for undirected graph
CREATE INDEX idx_connections_reverse ON connections(user_id_2, user_id_1);
-- Get all connections for user 123
SELECT user_id_2 AS friend
FROM connections
WHERE user_id_1 = 123 AND status = "connected"
UNION
SELECT user_id_1
FROM connections
WHERE user_id_2 = 123 AND status = "connected";
Approach 2: Graph Database (LinkedIn uses this for PYMK)
Graph databases (Neo4j, Amazon Neptune) model nodes and edges natively. Cypher query language for complex traversals:
// Neo4j: Find 2nd-degree connections (friends of friends)
MATCH (u:User {id: 123})-[:CONNECTED_TO]->(:User)-[:CONNECTED_TO]->(fof:User)
WHERE NOT (u)-[:CONNECTED_TO]->(fof) AND fof.id <> 123
RETURN fof.name, COUNT(*) AS mutual_connections
ORDER BY mutual_connections DESC
LIMIT 20
// Why graph DB? SQL requires self-join for each degree of separation.
// 3rd degree in SQL: 3 nested joins — slow and messy.
// Graph DB traverses naturally.
Approach 3: Hybrid (most production systems)
- PostgreSQL for direct connections (1st degree) — queried on every profile load
- Redis cache for connection lists of active users
- Graph DB or distributed graph system (LinkedIn uses their own system, “Espresso” + “Venice”) for PYMK multi-hop traversals
Feed Generation: Fan-out On Write vs Fan-out On Read
Fan-out on Write (Push Model)
When a user posts, precompute and write the post to all followers feed caches.
def publish_post(user_id, post):
post_id = db.insert_post(user_id, post)
# Push to all followers feed caches asynchronously
followers = graph.get_followers(user_id) # may be 50M for celebrities
# Fan-out via message queue to avoid blocking
for batch in chunks(followers, 1000):
queue.send("feed_fanout", {post_id: post_id, recipients: batch})
# Feed worker
def process_fanout(post_id, recipients):
for user_id in recipients:
redis.zadd(f"feed:{user_id}",
{post_id: post.timestamp}, # sorted set by timestamp
nx=True)
redis.zremrangebyrank(f"feed:{user_id}", 0, -1001) # keep latest 1000
Pros: read is O(1) — just read from the cache. Cons: high write amplification — a user with 50M followers triggers 50M cache writes.
Fan-out on Read (Pull Model)
When the user opens their feed, fetch posts from all connections.
def get_feed(user_id, limit=20):
connections = cache.get_connections(user_id)
# Fetch recent posts from each connection
# Problem: N connections = N DB queries — too slow
post_ids = db.query("""
SELECT id, created_at FROM posts
WHERE user_id = ANY(%s) AND created_at > NOW() - INTERVAL 3 DAYS
ORDER BY created_at DESC LIMIT %s
""", connections, limit)
return post_ids
Pros: no fan-out on write. Cons: read is O(connections) — slow for users with many connections.
Hybrid Model (Meta, LinkedIn approach)
- Regular users (followers < 10,000): fan-out on write — precompute their post in followers feeds
- Celebrities (followers > 10,000): fan-out on read — their posts are fetched on demand and merged with the precomputed feed
- Threshold tunable based on system load
def get_feed(user_id):
# 1. Get precomputed feed from cache (regular connections)
precomputed = redis.zrevrange(f"feed:{user_id}", 0, 200, withscores=True)
# 2. Find celebrity connections (not in precomputed feed)
celebrities = get_celebrity_connections(user_id)
# 3. Fetch celebrity posts on read
celebrity_posts = db.get_recent_posts(celebrities, limit=50)
# 4. Merge and sort by timestamp
merged = merge_sorted(precomputed, celebrity_posts)
return merged[:20] # top 20
People You May Know (PYMK)
PYMK recommends 2nd-degree connections (friends of friends). The naive approach: for each friend, get their friends, count how many mutual connections each candidate has. Rank by mutual connection count.
def pymk(user_id: int, limit: int = 10) -> list[int]:
connections = set(get_connections(user_id))
candidate_scores: dict[int, int] = {}
for conn_id in connections:
for fof_id in get_connections(conn_id):
if fof_id == user_id or fof_id in connections:
continue
candidate_scores[fof_id] = candidate_scores.get(fof_id, 0) + 1
# Sort by mutual connections (descending)
return sorted(candidate_scores, key=candidate_scores.get, reverse=True)[:limit]
At scale: precompute PYMK scores offline (Spark job), store in Redis sorted set, update incrementally when connections change. LinkedIn uses graph-based collaborative filtering that also considers shared employers, schools, and skills.
Search
User search requires full-text search with faceting (by company, location, skill). Use Elasticsearch:
- Index user profiles: name, headline, company, skills, location
- Query: BM25 relevance + personalization boost (connection degree, mutual connections)
- Facets: filter by company, job title, location
- Typeahead: prefix search with Elasticsearch completion suggester
Database Architecture
User Service:
PostgreSQL (primary + 3 read replicas)
Shard by user_id (hash-based) at 10B+ users
Cache: Redis for hot profiles (TTL 5 min)
Graph Service:
MySQL for 1st-degree connections (OLTP)
Custom graph store for multi-hop traversals
Redis for active user connection lists
Post Service:
Cassandra (time-series, high write throughput)
Partition key: (user_id), clustering key: (created_at DESC)
Allows efficient "get N most recent posts by user X"
Feed Service:
Redis sorted sets (one per user, score = timestamp)
Hot users: memory-efficient LRU eviction
Cache miss: rebuild from graph + post service
Search Service:
Elasticsearch (near-real-time indexing from Kafka)
Personalization scores computed at query time
Interview Discussion Points
- How do you handle the thundering herd when a celebrity posts? Rate-limit fan-out writes; use async queue with backpressure; skip inactive users (not logged in for 30+ days)
- How do you detect fake accounts? Graph analysis — new accounts with too many connections too fast, connection pattern anomalies
- How do you rank the feed? Engagement signals (likes, comments, shares), freshness, connection strength, content type preference, diversity (avoid showing all posts from one person)
- How do you handle privacy settings? Filter before fan-out; re-check permissions at read time for sensitive posts; separate visibility levels (public, connections only, custom)
Frequently Asked Questions
What is the difference between fan-out on write and fan-out on read for social feeds?
Fan-out on write (push model): when a user posts, the system immediately writes the post to all followers feed caches. Reads are O(1) — just fetch from cache. Downside: a user with 50 million followers triggers 50 million cache writes per post, causing massive write amplification. Fan-out on read (pull model): when a user loads their feed, the system fetches recent posts from all their connections on demand. No write amplification, but reads are slow (must query N connections). Production systems use a hybrid: fan-out on write for regular users (fewer than 10,000 followers) and fan-out on read for celebrities, merging the two at read time.
How do you store and query the social graph at scale?
Most systems use a tiered approach. For 1st-degree connections (directly queried for every profile view), use a relational database (MySQL/PostgreSQL) with an adjacency list table, indexed in both directions. Hot connection lists are cached in Redis. For multi-hop traversals (2nd and 3rd degree connections needed for PYMK recommendations), a graph database (Neo4j, Amazon Neptune) or a custom distributed graph system is more efficient than multi-level SQL self-joins. LinkedIn built their own graph system (Espresso + Venice) to support graph traversals across billions of nodes. At billion-user scale, shard the adjacency list by user_id hash.
How does People You May Know (PYMK) work?
The simplest PYMK algorithm counts mutual connections: for each of your connections, look at their connections (friends-of-friends), exclude people you already know, and rank by how many mutual connections you share. This gives you the most "socially close" recommendations. At scale, this is precomputed offline (Spark job) rather than computed on demand, because iterating through all 2nd-degree connections in real time is too slow. More sophisticated systems add additional signals: shared employers, shared schools, shared skills, geographic proximity, profile view history, and collaborative filtering (users with similar connection patterns tend to share interests). LinkedIn refreshes PYMK scores daily.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between fan-out on write and fan-out on read for social feeds?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Fan-out on write (push model): when a user posts, the system immediately writes the post to all followers feed caches. Reads are O(1) — just fetch from cache. Downside: a user with 50 million followers triggers 50 million cache writes per post, causing massive write amplification. Fan-out on read (pull model): when a user loads their feed, the system fetches recent posts from all their connections on demand. No write amplification, but reads are slow (must query N connections). Production systems use a hybrid: fan-out on write for regular users (fewer than 10,000 followers) and fan-out on read for celebrities, merging the two at read time.”
}
},
{
“@type”: “Question”,
“name”: “How do you store and query the social graph at scale?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Most systems use a tiered approach. For 1st-degree connections (directly queried for every profile view), use a relational database (MySQL/PostgreSQL) with an adjacency list table, indexed in both directions. Hot connection lists are cached in Redis. For multi-hop traversals (2nd and 3rd degree connections needed for PYMK recommendations), a graph database (Neo4j, Amazon Neptune) or a custom distributed graph system is more efficient than multi-level SQL self-joins. LinkedIn built their own graph system (Espresso + Venice) to support graph traversals across billions of nodes. At billion-user scale, shard the adjacency list by user_id hash.”
}
},
{
“@type”: “Question”,
“name”: “How does People You May Know (PYMK) work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The simplest PYMK algorithm counts mutual connections: for each of your connections, look at their connections (friends-of-friends), exclude people you already know, and rank by how many mutual connections you share. This gives you the most “socially close” recommendations. At scale, this is precomputed offline (Spark job) rather than computed on demand, because iterating through all 2nd-degree connections in real time is too slow. More sophisticated systems add additional signals: shared employers, shared schools, shared skills, geographic proximity, profile view history, and collaborative filtering (users with similar connection patterns tend to share interests). LinkedIn refreshes PYMK scores daily.”
}
}
]
}