System Design: Chat System (WhatsApp / Messenger Scale)

⏱ 7 min read

A chat system is a perennial system design interview question. It requires you to reason about real-time communication, persistent storage, delivery guarantees, and state management simultaneously. WhatsApp handles 100 billion messages per day with a surprisingly lean architecture. Understanding how they did it — and the trade-offs they made — will take you far in an interview.

Step 1: Clarify Requirements

Chat types: 1-on-1 only, or group chats too? How large (WhatsApp limits groups to 1,024)?
Message types: Text only, or media (images, video, audio)?
Delivery guarantees: At-least-once? Exactly-once? Delivery receipts (sent/delivered/read)?
Online presence: Show online/offline/last seen status?
Message history: Stored server-side forever, or deleted after delivery (Signal model)?
Scale: How many DAU? Messages per day?

Assume: 500M DAU, 100B messages/day (~1.1M/sec), 1-on-1 and group chat (max 100 members), text + media, server-side message storage with delivery receipts, online presence required.

Step 2: Back-of-Envelope

Messages: 100B/day = ~1.16M/sec average, ~5M/sec peak
Storage per message:
  msg_id (8B) + sender_id (8B) + recipient_id (8B)
  + content (avg 100B) + timestamp (8B) + status (1B) ≈ 133 bytes
  100B msgs/day × 133B = ~13TB/day → ~5PB over 1 year

Media:
  ~10% of messages contain media, avg 500KB
  10B media msgs/day × 500KB = 5PB/day (needs CDN + object storage)

Active connections:
  500M DAU, each holds a persistent WebSocket connection
  → 500M concurrent connections across chat servers

Step 3: Why WebSockets?

HTTP is request-response — the client must ask the server for new messages (polling). For real-time chat this means either:

Short polling: Client asks “any new messages?” every second. Wasteful — 99% of polls return nothing.
Long polling: Client sends a request, server holds it open until a message arrives or a timeout. Better, but complex and doesn’t scale to millions of connections per server.
WebSockets: A persistent, full-duplex TCP connection. Server can push messages to the client at any time. One connection per user. This is what WhatsApp, Slack, and Discord all use.

# WebSocket lifecycle
Client → Chat Server: HTTP Upgrade request
Chat Server → Client: 101 Switching Protocols
# Now bidirectional — server can push, client can send
Client → Server: { "type": "message", "to": "user_b", "content": "hey" }
Server → Client: { "type": "message", "from": "user_a", "content": "hey" }

Step 4: High-Level Architecture

                     ┌─────────────────┐
                     │   API Gateway   │  ← REST API for non-realtime (profile, media upload)
                     └────────┬────────┘
                              │
              ┌───────────────┼───────────────┐
              ↓               ↓               ↓
       ┌────────────┐  ┌────────────┐  ┌────────────┐
       │Chat Server │  │Chat Server │  │Chat Server │  ← stateful WebSocket servers
       │(node 1)    │  │(node 2)    │  │(node 3)    │    each holds ~100K connections
       └─────┬──────┘  └─────┬──────┘  └─────┬──────┘
             └───────────────┼───────────────┘
                             ↓
                    ┌─────────────────┐
                    │  Message Queue  │  ← Kafka: decouple delivery from storage
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              ↓              ↓              ↓
       ┌───────────┐  ┌───────────┐  ┌───────────┐
       │Message DB │  │Presence   │  │Push Notif │
       │(Cassandra)│  │Service    │  │Service    │
       └───────────┘  │(Redis)    │  └───────────┘
                      └───────────┘

Step 5: Message Flow

Sending a Message (User A → User B)

User A sends message over WebSocket to their connected Chat Server (node 1).
Chat Server persists the message to the Message DB (Cassandra) and assigns a message ID.
Chat Server publishes to Kafka: { msg_id, sender: A, recipient: B, content }.
A Delivery Service consumes from Kafka. It checks the Presence Service to find which Chat Server node User B is currently connected to.
If User B is online: Delivery Service sends the message to that Chat Server node via an internal pub/sub (Redis pub/sub or internal gRPC call). That node pushes to User B’s WebSocket.
If User B is offline: Delivery Service hands off to the Push Notification Service (APNs for iOS, FCM for Android).
User B’s client acknowledges receipt → Chat Server updates message status to “delivered.”
When User B reads the message → status updates to “read” → User A sees the read receipt.

Step 6: Routing Messages Across Chat Servers

This is the hard part. User A is on Chat Server 1. User B is on Chat Server 3. How does Server 1 deliver to Server 3?

Option A — Kafka + per-server subscription: Each Chat Server subscribes to a Kafka topic named after itself (or uses a consumer group per server). The Delivery Service determines the target server and publishes to that server’s topic. The server consumes and pushes to the user’s WebSocket. Clean and scalable.

Option B — Redis pub/sub: Each Chat Server subscribes to Redis channels for each connected user: SUBSCRIBE user:{user_b_id}. When a message arrives for User B, the Delivery Service publishes to PUBLISH user:{user_b_id} {message}. Any server subscribed for that user receives it and pushes via WebSocket. Simpler than Kafka for this hop, but Redis pub/sub has no persistence (fire-and-forget).

In practice: Both approaches are valid in an interview. Mention the trade-off: Kafka gives durability and replay; Redis pub/sub is simpler and lower latency but requires the Delivery Service to handle the case where the message is lost (e.g., server crash between publish and delivery).

Step 7: Message Storage

Why Cassandra? Message storage is a write-heavy, time-series workload. Users query messages by conversation and time (“get the last 50 messages in conversation X”). Cassandra’s column-family model is designed exactly for this pattern.

-- Cassandra schema
CREATE TABLE messages (
    conversation_id UUID,
    message_id      TIMEUUID,      -- time-ordered UUID (like Snowflake)
    sender_id       UUID,
    content         TEXT,
    media_url       TEXT,
    status          TINYINT,       -- 0=sent, 1=delivered, 2=read
    PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

-- Query: get last 50 messages in a conversation
SELECT * FROM messages
WHERE conversation_id = ?
LIMIT 50;

conversation_id is the partition key — all messages in a conversation are stored together on the same Cassandra node, enabling efficient sequential reads. message_id is a TIMEUUID (time-ordered), so messages within a conversation are naturally sorted by time.

Step 8: Online Presence

-- Redis for presence
SET presence:{user_id} "online"  EX 30   # heartbeat every 15s from client
GET presence:{user_id}                   # check if user is online

Clients send a heartbeat every 15 seconds. The Chat Server refreshes the key’s TTL on each heartbeat. If no heartbeat for 30 seconds, the key expires and the user appears offline. “Last seen” is recorded when the key expires or the WebSocket closes.

At 500M DAU, storing 500M presence keys in Redis is ~40GB — easily handled by a Redis cluster.

Step 9: Group Chat

Group chat is fanout-on-write at the message layer. When User A sends to a group with 100 members:

1. Insert one row into messages table (conversation_id = group_id)
2. Publish to Kafka
3. Fanout worker reads members of the group (from a group_members table)
4. For each online member: push via their Chat Server
5. For each offline member: send push notification
6. Status: "delivered" only when ALL members have received
           "read" per-member (tracked separately)

For large groups (1,000+ members), the fanout is expensive. WhatsApp limits groups to 1,024 members partly for this reason. Discord handles “server” channels differently — users pull from a channel’s message log rather than receiving individual push delivery.

Step 10: Media Messages

Never send media through the chat server pipeline. Binary data would overwhelm the message broker and DB.

1. Client requests a pre-signed S3 upload URL from Media Service
2. Client uploads media directly to S3 (bypasses chat servers entirely)
3. S3 returns a permanent URL
4. Client sends a normal text message with the media_url attached
5. Recipient downloads media from CDN (not from your servers)

This keeps the chat message pipeline lean (text only) and offloads binary storage to S3/CDN.

Follow-up Questions

Q: How do you guarantee exactly-once delivery?
True exactly-once is very hard. The standard approach is at-least-once delivery + idempotent clients. Each message has a unique message_id. If the client receives a duplicate (network retry), it deduplicates by ID before displaying.

Q: How do you scale to 500M concurrent WebSocket connections?
Each Chat Server handles ~100K–200K connections (tuned with connection limits and event loops). You need 2,500–5,000 Chat Server instances. They’re behind a load balancer that routes WebSocket upgrades using IP hash or consistent hashing to minimize connection reshuffling during deploys.

Q: How do you handle message ordering?
TIMEUUID (or Snowflake ID) gives time-ordered IDs. Within a conversation, Cassandra returns messages in ID order. Network reordering is handled client-side — the client sorts by message_id before rendering.

Q: End-to-end encryption (like Signal/WhatsApp)?
Keys are generated on device and never sent to the server. The server stores ciphertext only — it cannot read message content. Key exchange uses the Double Ratchet Algorithm. This is separate from the architecture above; mention it shows awareness of the privacy dimension.

Summary

Chat systems require persistent WebSocket connections for real-time delivery. The hard problems are routing messages across Chat Server nodes (Kafka or Redis pub/sub), storing messages efficiently (Cassandra keyed by conversation + time), and handling offline delivery (push notifications). Online presence is cheaply managed with Redis TTL keys and heartbeats. Group chat is fanout-on-write at the delivery layer. Media bypasses the chat pipeline entirely via pre-signed S3 URLs.

Message Queues — Kafka routes messages between chat servers and to the analytics/notification pipeline.
Caching Strategies — Redis pub/sub routes messages across server nodes; Redis TTL keys power the presence system.
SQL vs NoSQL — Cassandra is chosen for message storage because of its time-series write performance.
CAP Theorem — chat systems favor availability (AP) — it’s better to show a slightly stale “last seen” than refuse to load.
Load Balancing — WebSocket connections require IP-hash or consistent-hash routing so connections aren’t disrupted on deploys.
Design Twitter Feed — another real-time system; compare fanout strategies.

Companies That Ask This System Design Question

This problem type commonly appears in interviews at:

See our company interview guides for full interview process, compensation, and preparation tips.