System Design Interview: Design a Chat Application (WhatsApp)
Designing a real-time messaging application like WhatsApp is a comprehensive system design question covering WebSocket connections, message persistence, delivery receipts, end-to-end encryption, and online presence. Asked at Meta, LinkedIn, Slack, and Discord.
Requirements Clarification
Functional Requirements
- One-to-one messaging with real-time delivery
- Group chats (up to 256 members)
- Message delivery receipts: sent, delivered, read (single/double/blue checkmarks)
- Online/last-seen status
- Media sharing: images, videos, documents
- Message history: access past messages on new devices
- Push notifications when app is offline
Non-Functional Requirements
- Scale: 2B users, 100B messages/day
- Latency: message delivery <100ms for online users
- Availability: 99.99% (users expect messaging to always work)
- Durability: messages must not be lost
Core Architecture: WebSocket Connections
Real-time messaging requires persistent connections. HTTP request/response is too slow (high latency, high overhead). WebSocket provides full-duplex communication over a single TCP connection:
Client <--WebSocket--> Chat Server
- Client connects on app open
- Server can push messages instantly
- Heartbeat every 30s to detect disconnections
- Client reconnects on disconnect
High-Level Architecture
Users
|
Load Balancer (WebSocket-aware, sticky sessions)
|
Chat Servers (stateful: maintain WebSocket connections)
|
Message Queue (Kafka)
|
Message Processor
- Fan-out to recipient's chat server
- Persist to message DB
- Trigger push notification if offline
|
Message DB (Cassandra) Media Store (S3 + CDN)
|
Presence Service Push Notification Service
(Redis) (FCM, APNs)
Message Delivery Flow
Alice sends message to Bob:
1. Alice's app sends message over WebSocket to Chat Server A
2. Chat Server A publishes to Kafka: {from:alice, to:bob, content, msg_id, ts}
3. Message Processor consumes from Kafka:
a. Persist message to Cassandra
b. Look up which Chat Server Bob is connected to (via Redis hash)
c. Forward message to Chat Server B
4. Chat Server B pushes message to Bob over WebSocket
5. Bob's client sends ACK
6. Chat Server B updates delivery status: "delivered"
7. Alice receives delivered receipt
If Bob is offline:
Step 3b: Bob is not connected
Step 3c: Send push notification via FCM/APNs
Bob opens app -> WebSocket connects -> fetches unread messages from Cassandra
Message Storage (Cassandra)
# Schema optimized for conversation queries
messages_by_conversation:
partition_key: (conversation_id)
clustering_key: (message_id DESC) -- newest first
columns: sender_id, content, content_type, status, created_at
# conversation_id = min(user_a, user_b) + "_" + max(user_a, user_b) for 1:1
# conversation_id = group_id for group chats
# Query: last 50 messages in a conversation
SELECT * FROM messages_by_conversation
WHERE conversation_id = ?
ORDER BY message_id DESC
LIMIT 50;
Message IDs are time-ordered (Snowflake or ULID) for correct sort order and efficient range scans. Cassandra is ideal: high write throughput, time-series access pattern, easy sharding by conversation_id.
Online Presence
# Redis hash: user_id -> {server_id, last_heartbeat}
HSET presence:user123 server_id "chat-server-5" last_seen 1716000000
# Heartbeat every 30s (client pings server)
# TTL: 60s (auto-expire if no heartbeat = offline)
# On disconnect: remove from Redis
HDEL presence:user123
# To find Bob's server:
HGET presence:bob server_id # returns "chat-server-5" or nil (offline)
# Last seen: store timestamp on disconnect
SET last_seen:bob {timestamp} EX 2592000 # 30 day TTL
Group Messaging Fan-out
Group with 256 members: when Alice sends a message, it must be delivered to 255 other members. Fan-out strategies:
- Fan-out on send: message processor looks up all group members, sends to each member’s chat server. 256 server lookups + 256 deliveries per message. OK for groups up to 1000 members.
- Fan-out on receive: store one message copy, each client pulls on reconnect. More efficient for large groups but higher read latency.
- Hybrid: WhatsApp groups are small (256 max) – fan-out on send is used.
Delivery Receipts
Message status state machine:
SENT (stored in Cassandra) -> DELIVERED (recipient's device received) -> READ (recipient opened chat)
# Single check: sent to server
# Double check: delivered to device
# Blue check: read by recipient
Implementation:
1. Delivered: recipient's device sends delivery ACK when message arrives
Server updates message status to DELIVERED
2. Read: recipient's client sends read receipt when user opens the conversation
Server updates status to READ and notifies sender via WebSocket
Media Sharing
- Client uploads media directly to S3 via pre-signed URL (bypass chat servers)
- Client sends message with media URL (and thumbnail) via WebSocket
- Recipient downloads media from CDN URL
- End-to-end encrypted: client encrypts media before upload; key shared only with recipients
- Expiry: media stored for 30-90 days; permanent for saved media
Interview Tips
- Lead with WebSocket – explain why HTTP polling is insufficient
- Explain the connection registry (Redis hash: user_id to server_id)
- Describe Cassandra schema – partition by conversation, cluster by message_id DESC
- Cover delivery receipts state machine (sent/delivered/read)
- Discuss group fan-out: at WhatsApp scale (256 max members) fan-out on send works
- Push notifications for offline users via FCM/APNs