System Design Interview: Design a Social Graph (Twitter Follow / LinkedIn Connect)
Social graph design is a foundational problem asked at Twitter, LinkedIn, Meta, Snap, and any company building social features. It tests your understanding of graph data structures at scale, feed generation, follower/following relationships, and recommendation systems.
Requirements Clarification
Functional Requirements
- Follow/unfollow users (Twitter-style: asymmetric) or connect (LinkedIn-style: symmetric)
- Get followers and following lists for any user
- Suggest users to follow (people you may know)
- Check if user A follows user B
- Calculate mutual connections/followers between two users
- Generate social feed from followed users’ posts
Non-Functional Requirements
- Scale: Twitter — 400M users, celebrities with 100M+ followers
- Follow/unfollow: low latency write (~10ms)
- Feed generation: P99 under 500ms
- Follow check: O(1) or O(log n) — used in every feed generation
- High read-to-write ratio: 100:1
Data Model: Edge-Based Graph
-- Core: each follow relationship is one row
CREATE TABLE follows (
follower_id BIGINT NOT NULL, -- who is following
followee_id BIGINT NOT NULL, -- who is being followed
created_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (follower_id, followee_id)
);
-- Indexes for bidirectional queries
CREATE INDEX idx_follows_followee ON follows(followee_id, created_at DESC);
-- follower_id index via PK; followee lookup via explicit index
-- Denormalized counters (avoid expensive COUNT queries)
CREATE TABLE user_stats (
user_id BIGINT PRIMARY KEY,
follower_count INT NOT NULL DEFAULT 0,
following_count INT NOT NULL DEFAULT 0,
post_count INT NOT NULL DEFAULT 0
);
-- For LinkedIn symmetric: store both directions
-- follow(A,B) AND follow(B,A) on connection accept
Follow/Unfollow Operations
import redis
from typing import Optional
redis_client = redis.Redis(decode_responses=True)
def follow_user(follower_id: int, followee_id: int) -> bool:
"""
Follow user. Atomic: insert edge + update counters.
Also update Redis cache for fast lookups.
"""
if follower_id == followee_id:
return False
with db.transaction():
# Insert edge (idempotent with ON CONFLICT)
result = db.execute("""
INSERT INTO follows (follower_id, followee_id)
VALUES ($1, $2)
ON CONFLICT DO NOTHING
RETURNING 1
""", follower_id, followee_id)
if not result:
return False # Already following
# Update counters atomically
db.execute("""
UPDATE user_stats SET following_count = following_count + 1
WHERE user_id = $1
""", follower_id)
db.execute("""
UPDATE user_stats SET follower_count = follower_count + 1
WHERE user_id = $1
""", followee_id)
# Update Redis cache
# Follower's following set
redis_client.sadd(f"following:{follower_id}", followee_id)
# Followee's followers set
redis_client.sadd(f"followers:{followee_id}", follower_id)
# Invalidate feed cache for follower (new posts from followee)
redis_client.delete(f"feed:{follower_id}")
# Trigger async: update feed fan-out for follower
queue.publish('follow_events', {
'type': 'FOLLOW',
'follower_id': follower_id,
'followee_id': followee_id,
})
return True
def is_following(follower_id: int, followee_id: int) -> bool:
"""
O(1) check using Redis set membership.
Fall back to DB if cache miss.
"""
# Try Redis first
cache_key = f"following:{follower_id}"
if redis_client.exists(cache_key):
return redis_client.sismember(cache_key, followee_id)
# DB fallback
result = db.fetchval("""
SELECT 1 FROM follows WHERE follower_id = $1 AND followee_id = $2
""", follower_id, followee_id)
return result is not None
def get_followers(user_id: int, cursor: Optional[str] = None,
limit: int = 20) -> dict:
"""Paginated follower list using cursor-based pagination"""
query = """
SELECT follower_id, created_at FROM follows
WHERE followee_id = $1
"""
params = [user_id]
if cursor:
# Decode cursor: base64-encoded (follower_id, created_at)
cursor_time, cursor_id = decode_cursor(cursor)
query += " AND (created_at, follower_id) limit
rows = rows[:limit]
next_cursor = encode_cursor(rows[-1]['created_at'], rows[-1]['follower_id']) if rows and has_more else None
return {
'followers': [r['follower_id'] for r in rows],
'next_cursor': next_cursor,
'has_more': has_more,
}
Social Graph Storage at Scale
Option 1: Relational DB (up to ~100M edges)
PostgreSQL with proper indexes handles Twitter-scale for most users. The follows table sharded by follower_id. Limitation: celebrity followers lists (100M rows) are slow to page through.
Option 2: Graph Database (Neo4j/Neptune)
# Neo4j Cypher query for mutual connections
# "People who follow X also follow..."
mutual_query = """
MATCH (me:User {id: })-[:FOLLOWS]->(common:User)<-[:FOLLOWS]-(other:User)
WHERE other.id
AND NOT (me)-[:FOLLOWS]->(other)
WITH other, COUNT(common) AS mutual_count
ORDER BY mutual_count DESC
LIMIT 20
RETURN other.id, mutual_count
"""
# Neo4j excels at variable-length graph traversals (BFS/DFS)
# but doesn't scale as well as purpose-built systems for 100M+ edges
Option 3: Custom Graph Store (Twitter Flock, LinkedIn Leo)
class SocialGraph:
"""
Twitter's Flock: dedicated social graph service.
Stores adjacency lists in sharded Redis with MySQL as source of truth.
Key insight: for most queries, you only need:
- Does A follow B? → O(1) set membership
- Who does A follow? → list of IDs (paginated)
- Who follows A? → list of IDs (paginated)
"""
def __init__(self, redis_cluster, db):
self.redis = redis_cluster
self.db = db
def get_following_ids(self, user_id: int) -> list[int]:
"""
Return all user IDs that user_id follows.
For normal users: stored in Redis set.
For super-followers (10K+ following): paginate from DB.
"""
redis_key = f"following:{user_id}"
cached = self.redis.smembers(redis_key)
if cached:
return [int(uid) for uid in cached]
# Cache miss: load from DB and populate Redis
rows = self.db.fetch("""
SELECT followee_id FROM follows WHERE follower_id = $1
""", user_id)
ids = [r['followee_id'] for r in rows]
if ids:
pipe = self.redis.pipeline()
pipe.sadd(redis_key, *ids)
pipe.expire(redis_key, 3600) # 1 hour TTL
pipe.execute()
return ids
People You May Know (PYMK) Recommendations
def suggest_users_to_follow(user_id: int, limit: int = 10) -> list[dict]:
"""
Friend-of-friend recommendations:
Find users followed by people I follow, but not yet followed by me.
Score by: mutual_follower_count, followed_by_celebs_i_follow, etc.
"""
my_following = set(get_following_ids(user_id))
# Second-degree connections
candidates = {}
for followee_id in list(my_following)[:100]: # sample to avoid explosion
their_following = get_following_ids(followee_id)
for candidate_id in their_following:
if candidate_id == user_id or candidate_id in my_following:
continue
if candidate_id not in candidates:
candidates[candidate_id] = {'mutual': 0, 'score': 0}
candidates[candidate_id]['mutual'] += 1
# Score and rank
scored = sorted(candidates.items(),
key=lambda x: x[1]['mutual'],
reverse=True)[:limit * 2]
# Enrich with user data
suggested_ids = [uid for uid, _ in scored[:limit]]
users = db.fetch("SELECT * FROM users WHERE id = ANY($1)", suggested_ids)
return users
# More sophisticated: use pre-computed embeddings
# Twitter's WTF (Who to Follow): bipartite graph with SALSA algorithm
# LinkedIn's People You May Know: uses explicit signals (company, school, location)
Feed Generation from Social Graph
def get_feed(user_id: int, cursor: str = None, limit: int = 20) -> dict:
"""
Pull-based feed (simpler, handles celebrity problem):
Fetch recent posts from everyone user follows.
Works well for users following <1000 accounts.
"""
following_ids = get_following_ids(user_id)
if not following_ids:
return {'posts': [], 'cursor': None}
# Fetch recent posts from all followed users
posts = db.fetch("""
SELECT p.*, u.username, u.profile_photo
FROM posts p
JOIN users u ON u.id = p.author_id
WHERE p.author_id = ANY($1::bigint[])
AND ($2::timestamptz IS NULL OR p.created_at limit
posts = posts[:limit]
new_cursor = posts[-1]['created_at'].isoformat() if posts and has_more else None
return {'posts': posts, 'cursor': new_cursor}
# Push-based feed (Twitter fanout-on-write): better for users following many
# When celebrity posts: fan out to all followers' feed timelines (in Redis)
# Celebrity problem: 100M followers x 1 post = 100M writes — use hybrid approach
Sharding Strategy for the Graph
- Shard by follower_id: “Who does user X follow?” is fast (same shard). But “Who follows user X?” requires scatter-gather across all shards for celebrities.
- Dual-write: Store both (follower→followee) and (followee→follower) edges in separate tables sharded by respective IDs. Doubles storage but eliminates scatter-gather.
- Celebrity tier: Special handling for users with >1M followers. Store celebrity followers in separate highly-replicated store. Feed generation for celebrity posts uses async fan-out workers.
Key Interview Points
- Asymmetric (Twitter follow) vs symmetric (LinkedIn connect) — different data models
- Celebrity problem: Ariana Grande with 100M followers cannot be treated same as regular users
- Redis for O(1) follow checks and following lists (hot cache); DB as source of truth
- PYMK via friend-of-friend traversal; at scale use pre-computed ML embeddings
- Feed: pull model for low fanout users, push model for active users, hybrid for celebrities