System Design Interview: Social Network at Scale (LinkedIn / Facebook)

⏱ 8 min read

Social networks combine graph data structures, feed generation, search, and recommendation systems — all at massive scale. This question appears frequently at LinkedIn, Meta, Twitter/X, and Snap. The key insight is that the social graph is the product — every feature depends on efficient graph traversal and storage.

Functional Requirements

User profiles (name, bio, work history, skills)
Connection/follow graph (directed for Twitter-style, undirected for LinkedIn-style)
News feed: posts from connections, ordered by relevance + recency
Search: find users by name, company, skill
People you may know (PYMK): connection recommendations
Notifications: when someone connects, likes, comments

Non-Functional Requirements and Scale

Metric	LinkedIn-Scale Target
Users	1 billion (200M daily active)
Connections per user	Average 500, max 30,000+
Posts per day	5 million new posts
Profile views per day	100M+
Feed read QPS	~500,000/s

Approach 1: Relational Database

-- Adjacency list: simple, works for moderate scale
CREATE TABLE connections (
    user_id_1  BIGINT NOT NULL,
    user_id_2  BIGINT NOT NULL,
    status     ENUM("pending", "connected") DEFAULT "pending",
    created_at TIMESTAMP DEFAULT NOW(),
    PRIMARY KEY (user_id_1, user_id_2)
);

-- Index both directions for undirected graph
CREATE INDEX idx_connections_reverse ON connections(user_id_2, user_id_1);

-- Get all connections for user 123
SELECT user_id_2 AS friend
FROM connections
WHERE user_id_1 = 123 AND status = "connected"
UNION
SELECT user_id_1
FROM connections
WHERE user_id_2 = 123 AND status = "connected";

Approach 2: Graph Database (LinkedIn uses this for PYMK)

Graph databases (Neo4j, Amazon Neptune) model nodes and edges natively. Cypher query language for complex traversals:

// Neo4j: Find 2nd-degree connections (friends of friends)
MATCH (u:User {id: 123})-[:CONNECTED_TO]->(:User)-[:CONNECTED_TO]->(fof:User)
WHERE NOT (u)-[:CONNECTED_TO]->(fof) AND fof.id <> 123
RETURN fof.name, COUNT(*) AS mutual_connections
ORDER BY mutual_connections DESC
LIMIT 20

// Why graph DB? SQL requires self-join for each degree of separation.
// 3rd degree in SQL: 3 nested joins — slow and messy.
// Graph DB traverses naturally.

Approach 3: Hybrid (most production systems)

PostgreSQL for direct connections (1st degree) — queried on every profile load
Redis cache for connection lists of active users
Graph DB or distributed graph system (LinkedIn uses their own system, “Espresso” + “Venice”) for PYMK multi-hop traversals

Feed Generation: Fan-out On Write vs Fan-out On Read

Fan-out on Write (Push Model)

When a user posts, precompute and write the post to all followers feed caches.

def publish_post(user_id, post):
    post_id = db.insert_post(user_id, post)

    # Push to all followers feed caches asynchronously
    followers = graph.get_followers(user_id)  # may be 50M for celebrities

    # Fan-out via message queue to avoid blocking
    for batch in chunks(followers, 1000):
        queue.send("feed_fanout", {post_id: post_id, recipients: batch})

# Feed worker
def process_fanout(post_id, recipients):
    for user_id in recipients:
        redis.zadd(f"feed:{user_id}",
                   {post_id: post.timestamp},  # sorted set by timestamp
                   nx=True)
        redis.zremrangebyrank(f"feed:{user_id}", 0, -1001)  # keep latest 1000

Pros: read is O(1) — just read from the cache. Cons: high write amplification — a user with 50M followers triggers 50M cache writes.

Fan-out on Read (Pull Model)

When the user opens their feed, fetch posts from all connections.

def get_feed(user_id, limit=20):
    connections = cache.get_connections(user_id)

    # Fetch recent posts from each connection
    # Problem: N connections = N DB queries — too slow
    post_ids = db.query("""
        SELECT id, created_at FROM posts
        WHERE user_id = ANY(%s) AND created_at > NOW() - INTERVAL 3 DAYS
        ORDER BY created_at DESC LIMIT %s
    """, connections, limit)

    return post_ids

Pros: no fan-out on write. Cons: read is O(connections) — slow for users with many connections.

Hybrid Model (Meta, LinkedIn approach)

Regular users (followers < 10,000): fan-out on write — precompute their post in followers feeds
Celebrities (followers > 10,000): fan-out on read — their posts are fetched on demand and merged with the precomputed feed
Threshold tunable based on system load

def get_feed(user_id):
    # 1. Get precomputed feed from cache (regular connections)
    precomputed = redis.zrevrange(f"feed:{user_id}", 0, 200, withscores=True)

    # 2. Find celebrity connections (not in precomputed feed)
    celebrities = get_celebrity_connections(user_id)

    # 3. Fetch celebrity posts on read
    celebrity_posts = db.get_recent_posts(celebrities, limit=50)

    # 4. Merge and sort by timestamp
    merged = merge_sorted(precomputed, celebrity_posts)
    return merged[:20]  # top 20

People You May Know (PYMK)

PYMK recommends 2nd-degree connections (friends of friends). The naive approach: for each friend, get their friends, count how many mutual connections each candidate has. Rank by mutual connection count.

def pymk(user_id: int, limit: int = 10) -> list[int]:
    connections = set(get_connections(user_id))
    candidate_scores: dict[int, int] = {}

    for conn_id in connections:
        for fof_id in get_connections(conn_id):
            if fof_id == user_id or fof_id in connections:
                continue
            candidate_scores[fof_id] = candidate_scores.get(fof_id, 0) + 1

    # Sort by mutual connections (descending)
    return sorted(candidate_scores, key=candidate_scores.get, reverse=True)[:limit]

At scale: precompute PYMK scores offline (Spark job), store in Redis sorted set, update incrementally when connections change. LinkedIn uses graph-based collaborative filtering that also considers shared employers, schools, and skills.

Search

User search requires full-text search with faceting (by company, location, skill). Use Elasticsearch:

Index user profiles: name, headline, company, skills, location
Query: BM25 relevance + personalization boost (connection degree, mutual connections)
Facets: filter by company, job title, location
Typeahead: prefix search with Elasticsearch completion suggester

Database Architecture

User Service:
  PostgreSQL (primary + 3 read replicas)
  Shard by user_id (hash-based) at 10B+ users
  Cache: Redis for hot profiles (TTL 5 min)

Graph Service:
  MySQL for 1st-degree connections (OLTP)
  Custom graph store for multi-hop traversals
  Redis for active user connection lists

Post Service:
  Cassandra (time-series, high write throughput)
  Partition key: (user_id), clustering key: (created_at DESC)
  Allows efficient "get N most recent posts by user X"

Feed Service:
  Redis sorted sets (one per user, score = timestamp)
  Hot users: memory-efficient LRU eviction
  Cache miss: rebuild from graph + post service

Search Service:
  Elasticsearch (near-real-time indexing from Kafka)
  Personalization scores computed at query time

Interview Discussion Points

How do you handle the thundering herd when a celebrity posts? Rate-limit fan-out writes; use async queue with backpressure; skip inactive users (not logged in for 30+ days)
How do you detect fake accounts? Graph analysis — new accounts with too many connections too fast, connection pattern anomalies
How do you rank the feed? Engagement signals (likes, comments, shares), freshness, connection strength, content type preference, diversity (avoid showing all posts from one person)
How do you handle privacy settings? Filter before fan-out; re-check permissions at read time for sensitive posts; separate visibility levels (public, connections only, custom)

Frequently Asked Questions

What is the difference between fan-out on write and fan-out on read for social feeds?

Fan-out on write (push model): when a user posts, the system immediately writes the post to all followers feed caches. Reads are O(1) — just fetch from cache. Downside: a user with 50 million followers triggers 50 million cache writes per post, causing massive write amplification. Fan-out on read (pull model): when a user loads their feed, the system fetches recent posts from all their connections on demand. No write amplification, but reads are slow (must query N connections). Production systems use a hybrid: fan-out on write for regular users (fewer than 10,000 followers) and fan-out on read for celebrities, merging the two at read time.

How do you store and query the social graph at scale?

Most systems use a tiered approach. For 1st-degree connections (directly queried for every profile view), use a relational database (MySQL/PostgreSQL) with an adjacency list table, indexed in both directions. Hot connection lists are cached in Redis. For multi-hop traversals (2nd and 3rd degree connections needed for PYMK recommendations), a graph database (Neo4j, Amazon Neptune) or a custom distributed graph system is more efficient than multi-level SQL self-joins. LinkedIn built their own graph system (Espresso + Venice) to support graph traversals across billions of nodes. At billion-user scale, shard the adjacency list by user_id hash.

How does People You May Know (PYMK) work?

The simplest PYMK algorithm counts mutual connections: for each of your connections, look at their connections (friends-of-friends), exclude people you already know, and rank by how many mutual connections you share. This gives you the most "socially close" recommendations. At scale, this is precomputed offline (Spark job) rather than computed on demand, because iterating through all 2nd-degree connections in real time is too slow. More sophisticated systems add additional signals: shared employers, shared schools, shared skills, geographic proximity, profile view history, and collaborative filtering (users with similar connection patterns tend to share interests). LinkedIn refreshes PYMK scores daily.

Companies That Ask This Question

Meta Engineering Interview Guide

LinkedIn Engineering Interview Guide

Twitter/X Engineering Interview Guide