Low Level Design: Social Graph Service

What Is a Social Graph Service?

A social graph service models relationships between users on a platform. These relationships can be bidirectional (friendships on Facebook) or unidirectional (follows on Twitter/X). The service needs to support fast lookups of connections, mutual friends, degree-of-separation queries, and feed construction. It sits at the core of almost every social product and must scale to billions of nodes and trillions of edges.

Data Model

At the storage layer the graph is most naturally represented as an adjacency list. Each user ID maps to an ordered set of neighbor IDs. A relational schema looks like this:

CREATE TABLE edges (
  from_user_id  BIGINT NOT NULL,
  to_user_id    BIGINT NOT NULL,
  edge_type     ENUM('follow', 'friend', 'block') NOT NULL,
  created_at    TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (from_user_id, to_user_id),
  INDEX idx_to   (to_user_id, from_user_id)
);

For bidirectional edges two rows are written atomically. The reverse index on (to_user_id) allows efficient reverse lookups (who follows me?). At scale this table is sharded by from_user_id, with the reverse index living on a separate shard keyed by to_user_id. Graph databases such as Neo4j or Amazon Neptune store native pointer-based adjacency and excel at multi-hop traversals, though they introduce operational complexity. Many large platforms (LinkedIn, Meta) use custom in-memory graph stores backed by persistent key-value stores like RocksDB.

Core Workflow

The primary operations are:

Add edge: Validate both users exist, write forward and reverse rows in a transaction, invalidate cached adjacency lists for both users, enqueue a fan-out event for feed services.
Remove edge: Delete both rows, invalidate caches, enqueue an undo-fan-out event.
Get neighbors: Read from cache (Redis sorted set keyed by user ID); on miss, load from the shard, populate cache with a TTL of several minutes.
Check edge: Hash-lookup in a Bloom filter for a fast negative answer; on positive signal confirm against DB or cache.

Degree-of-separation queries (BFS up to depth 2 or 3) are executed against the in-memory graph replica to avoid heavy DB reads on every request.

Failure Handling and Consistency

Writing two shard rows atomically is the hardest part. Options: use a distributed transaction (two-phase commit), or write the canonical row first and use an async reconciler to write the reverse row, accepting a brief window of inconsistency. Most production systems choose eventual consistency for the reverse row and use idempotent reconcilers with retry queues. If the forward write succeeds but the reverse fails, the user sees the follow immediately while the reverse index catches up within seconds.

Cache invalidation uses a write-through strategy. If a cache node fails, reads fall back to the database. Hot adjacency lists are replicated across multiple cache nodes to prevent single-point failures from causing DB overload.

Scalability Considerations

Sharding by from_user_id distributes write load but creates hot shards for celebrity accounts with hundreds of millions of followers. Strategies to address this:

Read replicas: Route all reads for hot nodes to replicas, keeping the write path uncontested.
Partitioned adjacency: Split a celebrity's follower list across multiple shards; merge at read time.
Separate storage tiers: Store edges for accounts above a follower threshold in a dedicated high-throughput store.

Graph traversal for recommendations and feed construction is offloaded to asynchronous batch jobs (Spark, Flink) that periodically materialize second-degree neighbor sets into a separate recommendation store.

Summary

A social graph service is fundamentally an adjacency list at massive scale. The key design decisions are shard strategy, consistency model for dual-row writes, cache layer design, and how to handle high-degree nodes. Keeping hot reads out of the primary write path and accepting eventual consistency on reverse indexes are the two most impactful choices. For interviews, walk through the edge schema, explain the two-shard write problem, and articulate the celebrity-node strategy clearly.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What data model is best for a social graph service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A graph database (e.g., Neo4j) or an adjacency-list model in a distributed key-value store is typically best for a social graph. Each user node stores edges representing follows or friendships, enabling fast traversal of connections at scale.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle scalability in a social graph service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Scalability is achieved through horizontal sharding of the graph by user ID, caching hot nodes and edges in-memory (e.g., Redis), and using eventual consistency for non-critical read paths. Read replicas and CDN-based caching reduce load on primary storage.”
}
},
{
“@type”: “Question”,
“name”: “How do you design the API for a social graph service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The API should expose endpoints for adding/removing edges (follow, unfollow, friend, unfriend), querying neighbors (followers, following, mutual friends), and performing graph traversals (BFS/DFS for suggestions). Pagination with cursor-based tokens is essential for large result sets.”
}
},
{
“@type”: “Question”,
“name”: “What are common interview follow-up questions for a social graph system design?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Common follow-ups include: How would you detect and handle celebrity/high-degree nodes? How do you support real-time updates to the graph? How would you implement mutual friend counts efficiently? How do you prevent cycles or enforce directed vs. undirected edge semantics?”
}
}
]
}