System Design Interview: Design a Social Graph (Twitter/LinkedIn)

System Design Interview: Design a Social Graph (Twitter Follow / LinkedIn Connect)

Social graph design is a foundational problem asked at Twitter, LinkedIn, Meta, Snap, and any company building social features. It tests your understanding of graph data structures at scale, feed generation, follower/following relationships, and recommendation systems.

Requirements Clarification

Functional Requirements

  • Follow/unfollow users (Twitter-style: asymmetric) or connect (LinkedIn-style: symmetric)
  • Get followers and following lists for any user
  • Suggest users to follow (people you may know)
  • Check if user A follows user B
  • Calculate mutual connections/followers between two users
  • Generate social feed from followed users’ posts

Non-Functional Requirements

  • Scale: Twitter — 400M users, celebrities with 100M+ followers
  • Follow/unfollow: low latency write (~10ms)
  • Feed generation: P99 under 500ms
  • Follow check: O(1) or O(log n) — used in every feed generation
  • High read-to-write ratio: 100:1

Data Model: Edge-Based Graph

-- Core: each follow relationship is one row
CREATE TABLE follows (
    follower_id     BIGINT NOT NULL,      -- who is following
    followee_id     BIGINT NOT NULL,      -- who is being followed
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (follower_id, followee_id)
);

-- Indexes for bidirectional queries
CREATE INDEX idx_follows_followee ON follows(followee_id, created_at DESC);
-- follower_id index via PK; followee lookup via explicit index

-- Denormalized counters (avoid expensive COUNT queries)
CREATE TABLE user_stats (
    user_id         BIGINT PRIMARY KEY,
    follower_count  INT NOT NULL DEFAULT 0,
    following_count INT NOT NULL DEFAULT 0,
    post_count      INT NOT NULL DEFAULT 0
);

-- For LinkedIn symmetric: store both directions
-- follow(A,B) AND follow(B,A) on connection accept

Follow/Unfollow Operations

import redis
from typing import Optional

redis_client = redis.Redis(decode_responses=True)

def follow_user(follower_id: int, followee_id: int) -> bool:
    """
    Follow user. Atomic: insert edge + update counters.
    Also update Redis cache for fast lookups.
    """
    if follower_id == followee_id:
        return False

    with db.transaction():
        # Insert edge (idempotent with ON CONFLICT)
        result = db.execute("""
            INSERT INTO follows (follower_id, followee_id)
            VALUES ($1, $2)
            ON CONFLICT DO NOTHING
            RETURNING 1
        """, follower_id, followee_id)

        if not result:
            return False  # Already following

        # Update counters atomically
        db.execute("""
            UPDATE user_stats SET following_count = following_count + 1
            WHERE user_id = $1
        """, follower_id)
        db.execute("""
            UPDATE user_stats SET follower_count = follower_count + 1
            WHERE user_id = $1
        """, followee_id)

    # Update Redis cache
    # Follower's following set
    redis_client.sadd(f"following:{follower_id}", followee_id)
    # Followee's followers set
    redis_client.sadd(f"followers:{followee_id}", follower_id)
    # Invalidate feed cache for follower (new posts from followee)
    redis_client.delete(f"feed:{follower_id}")

    # Trigger async: update feed fan-out for follower
    queue.publish('follow_events', {
        'type': 'FOLLOW',
        'follower_id': follower_id,
        'followee_id': followee_id,
    })
    return True

def is_following(follower_id: int, followee_id: int) -> bool:
    """
    O(1) check using Redis set membership.
    Fall back to DB if cache miss.
    """
    # Try Redis first
    cache_key = f"following:{follower_id}"
    if redis_client.exists(cache_key):
        return redis_client.sismember(cache_key, followee_id)

    # DB fallback
    result = db.fetchval("""
        SELECT 1 FROM follows WHERE follower_id = $1 AND followee_id = $2
    """, follower_id, followee_id)
    return result is not None

def get_followers(user_id: int, cursor: Optional[str] = None,
                  limit: int = 20) -> dict:
    """Paginated follower list using cursor-based pagination"""
    query = """
        SELECT follower_id, created_at FROM follows
        WHERE followee_id = $1
    """
    params = [user_id]

    if cursor:
        # Decode cursor: base64-encoded (follower_id, created_at)
        cursor_time, cursor_id = decode_cursor(cursor)
        query += " AND (created_at, follower_id)  limit
    rows = rows[:limit]

    next_cursor = encode_cursor(rows[-1]['created_at'], rows[-1]['follower_id']) if rows and has_more else None
    return {
        'followers': [r['follower_id'] for r in rows],
        'next_cursor': next_cursor,
        'has_more': has_more,
    }

Social Graph Storage at Scale

Option 1: Relational DB (up to ~100M edges)

PostgreSQL with proper indexes handles Twitter-scale for most users. The follows table sharded by follower_id. Limitation: celebrity followers lists (100M rows) are slow to page through.

Option 2: Graph Database (Neo4j/Neptune)

# Neo4j Cypher query for mutual connections
# "People who follow X also follow..."
mutual_query = """
MATCH (me:User {id: })-[:FOLLOWS]->(common:User)<-[:FOLLOWS]-(other:User)
WHERE other.id  
  AND NOT (me)-[:FOLLOWS]->(other)
WITH other, COUNT(common) AS mutual_count
ORDER BY mutual_count DESC
LIMIT 20
RETURN other.id, mutual_count
"""
# Neo4j excels at variable-length graph traversals (BFS/DFS)
# but doesn't scale as well as purpose-built systems for 100M+ edges

Option 3: Custom Graph Store (Twitter Flock, LinkedIn Leo)

class SocialGraph:
    """
    Twitter's Flock: dedicated social graph service.
    Stores adjacency lists in sharded Redis with MySQL as source of truth.
    
    Key insight: for most queries, you only need:
    - Does A follow B? → O(1) set membership
    - Who does A follow? → list of IDs (paginated)
    - Who follows A? → list of IDs (paginated)
    """

    def __init__(self, redis_cluster, db):
        self.redis = redis_cluster
        self.db = db

    def get_following_ids(self, user_id: int) -> list[int]:
        """
        Return all user IDs that user_id follows.
        For normal users: stored in Redis set.
        For super-followers (10K+ following): paginate from DB.
        """
        redis_key = f"following:{user_id}"
        cached = self.redis.smembers(redis_key)

        if cached:
            return [int(uid) for uid in cached]

        # Cache miss: load from DB and populate Redis
        rows = self.db.fetch("""
            SELECT followee_id FROM follows WHERE follower_id = $1
        """, user_id)
        ids = [r['followee_id'] for r in rows]

        if ids:
            pipe = self.redis.pipeline()
            pipe.sadd(redis_key, *ids)
            pipe.expire(redis_key, 3600)  # 1 hour TTL
            pipe.execute()

        return ids

People You May Know (PYMK) Recommendations

def suggest_users_to_follow(user_id: int, limit: int = 10) -> list[dict]:
    """
    Friend-of-friend recommendations:
    Find users followed by people I follow, but not yet followed by me.
    Score by: mutual_follower_count, followed_by_celebs_i_follow, etc.
    """
    my_following = set(get_following_ids(user_id))

    # Second-degree connections
    candidates = {}
    for followee_id in list(my_following)[:100]:  # sample to avoid explosion
        their_following = get_following_ids(followee_id)
        for candidate_id in their_following:
            if candidate_id == user_id or candidate_id in my_following:
                continue
            if candidate_id not in candidates:
                candidates[candidate_id] = {'mutual': 0, 'score': 0}
            candidates[candidate_id]['mutual'] += 1

    # Score and rank
    scored = sorted(candidates.items(),
                   key=lambda x: x[1]['mutual'],
                   reverse=True)[:limit * 2]

    # Enrich with user data
    suggested_ids = [uid for uid, _ in scored[:limit]]
    users = db.fetch("SELECT * FROM users WHERE id = ANY($1)", suggested_ids)
    return users

# More sophisticated: use pre-computed embeddings
# Twitter's WTF (Who to Follow): bipartite graph with SALSA algorithm
# LinkedIn's People You May Know: uses explicit signals (company, school, location)

Feed Generation from Social Graph

def get_feed(user_id: int, cursor: str = None, limit: int = 20) -> dict:
    """
    Pull-based feed (simpler, handles celebrity problem):
    Fetch recent posts from everyone user follows.
    Works well for users following <1000 accounts.
    """
    following_ids = get_following_ids(user_id)

    if not following_ids:
        return {'posts': [], 'cursor': None}

    # Fetch recent posts from all followed users
    posts = db.fetch("""
        SELECT p.*, u.username, u.profile_photo
        FROM posts p
        JOIN users u ON u.id = p.author_id
        WHERE p.author_id = ANY($1::bigint[])
          AND ($2::timestamptz IS NULL OR p.created_at  limit
    posts = posts[:limit]
    new_cursor = posts[-1]['created_at'].isoformat() if posts and has_more else None

    return {'posts': posts, 'cursor': new_cursor}

# Push-based feed (Twitter fanout-on-write): better for users following many
# When celebrity posts: fan out to all followers' feed timelines (in Redis)
# Celebrity problem: 100M followers x 1 post = 100M writes — use hybrid approach

Sharding Strategy for the Graph

  • Shard by follower_id: “Who does user X follow?” is fast (same shard). But “Who follows user X?” requires scatter-gather across all shards for celebrities.
  • Dual-write: Store both (follower→followee) and (followee→follower) edges in separate tables sharded by respective IDs. Doubles storage but eliminates scatter-gather.
  • Celebrity tier: Special handling for users with >1M followers. Store celebrity followers in separate highly-replicated store. Feed generation for celebrity posts uses async fan-out workers.

Key Interview Points

  • Asymmetric (Twitter follow) vs symmetric (LinkedIn connect) — different data models
  • Celebrity problem: Ariana Grande with 100M followers cannot be treated same as regular users
  • Redis for O(1) follow checks and following lists (hot cache); DB as source of truth
  • PYMK via friend-of-friend traversal; at scale use pre-computed ML embeddings
  • Feed: pull model for low fanout users, push model for active users, hybrid for celebrities
Scroll to Top