What Is a Social Graph?
A social graph represents relationships between users: friendships (Facebook), follow relationships (Twitter/Instagram), professional connections (LinkedIn). Core operations: add/remove edge, check if two users are connected, find mutual friends, suggest new connections (friends-of-friends), and traverse degrees of separation. LinkedIn has 1B nodes and 30B edges; Facebook’s social graph has 3B nodes.
Graph Representation
Adjacency list stored in a relational DB or graph DB:
-- Undirected (friendship):
friendships: user_id_1, user_id_2, created_at
-- Directed (follow):
follows: follower_id, followee_id, created_at
Indexes: (user_id_1, user_id_2) unique for O(1) connection check; (user_id_1) for O(degree) neighbor lookup; (user_id_2) for reverse lookup.
Mutual Friends
Mutual friends between user A and user B = intersection of A’s friends and B’s friends. Naive: SELECT friends of A, SELECT friends of B, intersect in application code. Better with SQL: SELECT friend_id FROM friendships WHERE user_id = A INTERSECT SELECT friend_id FROM friendships WHERE user_id = B. For high-degree nodes (celebrities with 10M followers): this query is expensive. Cache friend sets in Redis sorted sets; compute intersection in Redis: SINTERSTORE result friends:A friends:B. Redis SINTER on sets of size M runs in O(M * N) where M is the smaller set — fast for typical friend counts.
Friend-of-Friend Suggestions
Suggest users who are 2 hops away (friends of friends) and not already friends. Algorithm:
- Get user’s friends (1-hop neighbors): set F
- Get all friends of friends (2-hop): union of F[f] for all f in F
- Remove already-known users (user themselves + F)
- Rank by mutual friend count (higher = stronger suggestion)
At scale: 2-hop traversal can explode — a user with 1000 friends, each with 1000 friends = 1M candidates before dedup. Precompute top-K suggestions per user with a weekly batch job (Spark), store results in a suggestions table. Serve from cache.
Graph Database vs. Relational
Relational (PostgreSQL): works well for 1-2 hop queries with good indexing. Struggles with deep traversals (6 degrees of separation). Schema: foreign keys, joins for every hop.
Graph DB (Neo4j, Amazon Neptune): optimized for multi-hop traversal. Stores edges with direct pointers (no join needed). MATCH (a)-[:FRIEND*2]->(b) WHERE a.id=123 traverses 2 hops natively. Better for: recommendation engines, fraud ring detection, org-chart traversal.
LinkedIn uses a custom distributed graph store (Espresso + custom graph layer). Facebook uses TAO (The Associations and Objects) — a distributed key-value store with graph semantics built on MySQL.
High-Degree Node Problem (Celebrities)
A celebrity with 10M followers creates challenges: storing 10M edges takes significant space, reading them for fan-out (timeline) is expensive, and mutual-friend queries become slow. Solutions:
- Edge splitting: for “following” (asymmetric), only store follower→celebrity edges; fan-out is pull-based
- Separate data paths: celebrity posts distributed via CDN/pub-sub rather than iterating all followers
- Approximate mutual friends: HyperLogLog for approximate intersection count when exact is too slow
Interview Tips
- Friendship (undirected): store once as (min(A,B), max(A,B)) to avoid duplicate edges.
- Connection check: O(1) with index on both users; don’t iterate all edges.
- BFS for shortest path (6 degrees): bidirectional BFS from both ends meets in the middle — O(b^(d/2)) instead of O(b^d).
- High-degree nodes: always ask about the distribution — if power-law, need special handling for celebrities.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you efficiently find mutual friends between two users at scale?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Mutual friends = intersection of user A's friend set and user B's friend set. Naive approach: load all friends of A into a set, iterate friends of B, check membership. Time O(|A| + |B|), space O(|A|). Works fine for typical friend counts (hundreds). At LinkedIn/Facebook scale with celebrities having millions of connections: (1) Redis set intersection: SMEMBERS friends:A and SMEMBERS friends:B are loaded into Redis sets; SINTER friends:A friends:B computes intersection. O(min(|A|, |B|)). (2) HyperLogLog for approximate count: when you only need the count of mutual friends (not the list), HLL gives an estimate with ~0.81% error in O(1) space. Used for "X and Y have ~500 mutual connections." (3) For recommendation scoring: precompute mutual friend counts between all pairs within 2 hops as a weekly Spark job. Store top-K potential connections with mutual count in a recommendations table. Query is O(1) at serve time. (4) Bloom filter: if you only need to check if a specific person is a mutual friend: check if they are in both friend Bloom filters. O(1), no false negatives (only false positives).” }
},
{
“@type”: “Question”,
“name”: “How does bidirectional BFS find the shortest social path between two users?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Single-source BFS from A to B explores O(b^d) nodes where b is the average branching factor (friend count) and d is the path length. For a social graph with b=200 average friends and d=6 (six degrees of separation): 200^6 = 64 trillion nodes — impossible. Bidirectional BFS: start BFS simultaneously from both A and B. Alternate expanding one level from A, then one level from B. Stop when the frontiers meet (a node is found in both visited sets). Path length = level_from_A + level_from_B. Complexity: O(b^(d/2)) from each side = O(b^(d/2)) total. For b=200, d=6: 200^3 = 8 million per side — feasible. The path is found when a neighbor being explored from A already exists in B's visited set (or vice versa). Meeting node is reconstructed by following parent pointers from both sides. Implementation: two queues (one per side), two visited dictionaries (with parent pointer for reconstruction). Stop condition: check if current node exists in the other side's visited set after each level expansion.” }
},
{
“@type”: “Question”,
“name”: “How does TAO (Facebook's social graph store) work?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “TAO (The Associations and Objects) is Facebook's distributed graph store, built on top of MySQL with a custom caching layer. Two entity types: Objects (users, posts, pages — key-value stores with typed fields) and Associations (directed edges between objects — friend, like, comment — with time and data). TAO provides a simple API: assoc_get (get edges from source), assoc_count (count edges), obj_get (get object). Why not a graph database: Facebook needed the reliability and operational familiarity of MySQL, not a novel graph DB. TAO adds: (1) Sharding: objects and their edges shard by object ID. All edges from one object land on the same shard. (2) Caching: a large Memcache cluster sits in front of MySQL. Most reads are served from cache. (3) Read replicas: multiple cache clusters per region for geographic distribution. (4) Write-through: writes go to MySQL + invalidate cache. The design choice: strong familiarity with MySQL operations + custom caching + simple graph API beats a purpose-built graph database at their scale and operational requirements.” }
}
]
}