Requirements
- One-on-one messaging and group chats (up to 500 members)
- Real-time message delivery (< 100ms), offline message storage, read receipts
- Message ordering guaranteed per conversation
- 10M DAU, 100M messages/day
Data Model
Conversation(conv_id, type ENUM(DM,GROUP), created_at, last_message_id)
Participant(conv_id, user_id, joined_at, last_read_message_id)
Message(message_id BIGINT, conv_id, sender_id, body TEXT, type ENUM(TEXT,IMAGE,FILE),
created_at, client_msg_id UUID) -- client_msg_id for deduplication
MessageMedia(media_id, message_id, url, mime_type, size_bytes)
message_id uses a monotonically increasing sequence per conversation (not global) to ensure ordering. Use a Snowflake-style ID or a per-conv counter in Redis.
Real-Time Delivery with WebSockets
Each client maintains a persistent WebSocket connection to a Chat Server. When Alice sends a message to Bob:
- Alice’s client sends the message over WebSocket to her Chat Server
- Chat Server stores the message in DB (assign message_id)
- Publish to a Pub/Sub channel (Redis Pub/Sub or Kafka topic per conversation)
- Bob’s Chat Server subscribes to that channel, receives the message, pushes to Bob’s WebSocket
- If Bob is offline: Chat Server stores in a pending_messages queue; delivers on reconnect
Chat Server Architecture
- Stateful servers: each server holds WebSocket connections for a set of users. Connection registry in Redis:
HSET connections {user_id} {server_id}. TTL refreshed on heartbeat (30s). - Routing: to find Bob’s Chat Server, look up
connections:{bob_id}. Forward message to that server via HTTP or internal message bus. - Group chats: fan-out to all online participants. For 500-member groups, fan-out to at most 500 Chat Servers. Offline members get the message queued.
Message Storage and Retrieval
Messages are write-heavy and read-once (users scroll back occasionally). Store in Cassandra (wide-column) sharded by conv_id:
CREATE TABLE messages (
conv_id UUID,
message_id BIGINT,
sender_id UUID,
body TEXT,
created_at TIMESTAMP,
PRIMARY KEY (conv_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
Partition key = conv_id ensures all messages for a conversation are on the same node. Clustering by message_id DESC enables efficient pagination (latest-first). MySQL works for smaller scale but Cassandra handles millions of conversations without hot partitions.
Offline Message Delivery
When a user reconnects, they send their last_seen_message_id per conversation. The server queries: SELECT * FROM messages WHERE conv_id=X AND message_id > last_seen LIMIT 100. This delivers missed messages in order. For push notifications (APNs/FCM) while app is backgrounded: a separate notification service receives new message events from Kafka and sends push payloads.
Read Receipts
Store last_read_message_id per (user, conversation) in the Participant table. When Alice opens a conversation, UPDATE Participant SET last_read_message_id=latest_id. To compute unread count: SELECT COUNT(*) FROM messages WHERE conv_id=X AND message_id > alice.last_read_message_id. Cache unread counts in Redis per user. Fan out read receipt events to other participants over WebSocket so they see “✓✓”.
Message Deduplication
Clients generate a UUID client_msg_id before sending. If the send request times out (network retry), the client resends with the same client_msg_id. The server does: INSERT INTO messages … WHERE NOT EXISTS (SELECT 1 FROM messages WHERE client_msg_id=X). Idempotent insert — duplicate message is silently dropped. The server returns the assigned message_id from the first insert.
Key Design Decisions
- WebSocket + Pub/Sub for real-time; push notifications for background delivery
- Connection registry in Redis to route messages across Chat Servers
- Cassandra partitioned by conv_id for write-scalable message storage
- client_msg_id deduplication prevents duplicates on network retry
- Per-conversation message_id sequence for ordering without global coordination
Meta Messenger and WhatsApp are canonical chat system design topics. See common questions for Meta interview: messaging and chat system design.
Snap system design covers real-time messaging and presence. Review design patterns for Snap interview: real-time messaging system design.
LinkedIn system design covers messaging and real-time notifications. See patterns for LinkedIn interview: messaging and notification system design.