Q: How do read receipts work in a chat system at scale?

Read receipts track two states per message per recipient: delivered (message reached the device) and read (user opened the conversation). Delivered: when the WebSocket server pushes the message to the client, the client sends a DELIVERED ACK. Server updates MessageReceipt(message_id, user_id, status=DELIVERED). Read: when the user opens the chat and the message is visible on screen, the client sends a READ ACK (batch — send one READ event for the highest message_id seen). Server updates last_read_message_id in ChatMember. All messages up to that ID are implicitly read. Scale concern: in a 500-person group chat, every message generates up to 500 delivery receipts. With 10B messages/day in a 10-person average chat = 100B receipt events/day. Batch receipt updates: client sends one READ event per chat session open (not per message). Use Kafka to buffer receipt events and update DB in batches. Cache last_read_message_id in Redis for fast unread count queries.

Q: How does presence detection work in a chat system?

Presence shows whether a user is currently online. Implementation: on WebSocket connect, set Redis key presence:{user_id} = online with TTL=30s. Client sends a heartbeat every 20s extending the TTL. On clean disconnect: delete the key. On unclean disconnect (network drop): the key expires naturally after 30s — the user appears offline within 30s of disconnecting. At 100M connected users, storing a Redis key per user: 100M * ~100 bytes = ~10GB — manageable with Redis Cluster. Presence broadcasting: when a user's status changes (online → offline), broadcast to their contacts. Fan-out challenge: if a user has 5,000 contacts, status change = 5,000 notifications. Optimization: lazy presence — only send online/offline updates to contacts who are currently viewing a chat with that user, not all contacts. Group presence: show only aggregated count ("5 of 10 members online") in large groups.

Q: How do you handle message delivery to offline users in a chat system?

Offline users cannot receive WebSocket messages. Delivery to offline users: (1) Push notifications: when a message is sent and the recipient is offline (no active WebSocket connection, presence key expired), send a push notification via APNs (iOS) or FCM (Android). Include the sender name and message preview. On tap: app opens to the specific chat. (2) Message inbox: messages are persisted in Cassandra regardless of online status. When the user reconnects, they fetch missed messages: SELECT * FROM messages WHERE chat_id IN (my_chats) AND created_at > last_seen_at ORDER BY created_at DESC. (3) Unread count: maintain an unread count per (user, chat) in Redis: HINCRBY unread:{user_id} {chat_id} 1 on each new message. On chat open: reset to 0. Show badge on app icon = sum of all unread counts. Offline delivery guarantee: messages are durably stored — users never lose messages even if offline for weeks.

Question 1

How does a chat system deliver messages in real time using WebSockets?

Accepted Answer

WebSocket provides a persistent, full-duplex TCP connection between client and server. Unlike HTTP (request/response), WebSocket allows the server to push messages to the client at any time. On connect: the client performs an HTTP Upgrade to WebSocket. The connection stays open for the session lifetime. On message send: client sends the message payload over the WebSocket connection. The server receives it, persists to the message store, and routes it to the recipient. On receive: the WebSocket server that holds the recipient's connection writes the message to the socket. The client receives it without polling. Scale challenge: WebSocket servers are stateful — each connection is pinned to one server instance. To route messages across server instances (User A on Server 1, User B on Server 2), use Redis Pub/Sub: Server 1 publishes the message; Server 2 (subscribed to User B's channel) delivers it.

Question 2

How do you store and retrieve chat messages efficiently?

Accepted Answer

Chat messages are write-heavy (billions per day) and read with a time-series access pattern: "give me the last 50 messages in this chat, then load more on scroll." Cassandra is ideal: partition key = chat_id (all messages for a chat on the same node), clustering key = created_at DESC (recent messages first). Read pattern: SELECT * FROM messages WHERE chat_id = X ORDER BY created_at DESC LIMIT 50. Pagination: cursor = (last_created_at, last_message_id) for WHERE created_at < cursor_ts OR (created_at = cursor_ts AND message_id < cursor_mid). Write throughput: Cassandra handles hundreds of thousands of writes/second with horizontal scaling. Retention: keep messages indefinitely for users who need history, or delete after N months for GDPR compliance (soft delete: set status=DELETED, purge content). Media (images, files): store in S3, store only the S3 URL in the message.

Question 3

How do read receipts work in a chat system at scale?

Accepted Answer

Read receipts track two states per message per recipient: delivered (message reached the device) and read (user opened the conversation). Delivered: when the WebSocket server pushes the message to the client, the client sends a DELIVERED ACK. Server updates MessageReceipt(message_id, user_id, status=DELIVERED). Read: when the user opens the chat and the message is visible on screen, the client sends a READ ACK (batch — send one READ event for the highest message_id seen). Server updates last_read_message_id in ChatMember. All messages up to that ID are implicitly read. Scale concern: in a 500-person group chat, every message generates up to 500 delivery receipts. With 10B messages/day in a 10-person average chat = 100B receipt events/day. Batch receipt updates: client sends one READ event per chat session open (not per message). Use Kafka to buffer receipt events and update DB in batches. Cache last_read_message_id in Redis for fast unread count queries.

Question 4

How does presence detection work in a chat system?

Accepted Answer

Presence shows whether a user is currently online. Implementation: on WebSocket connect, set Redis key presence:{user_id} = online with TTL=30s. Client sends a heartbeat every 20s extending the TTL. On clean disconnect: delete the key. On unclean disconnect (network drop): the key expires naturally after 30s — the user appears offline within 30s of disconnecting. At 100M connected users, storing a Redis key per user: 100M * ~100 bytes = ~10GB — manageable with Redis Cluster. Presence broadcasting: when a user's status changes (online → offline), broadcast to their contacts. Fan-out challenge: if a user has 5,000 contacts, status change = 5,000 notifications. Optimization: lazy presence — only send online/offline updates to contacts who are currently viewing a chat with that user, not all contacts. Group presence: show only aggregated count ("5 of 10 members online") in large groups.

Question 5

How do you handle message delivery to offline users in a chat system?

Accepted Answer

Offline users cannot receive WebSocket messages. Delivery to offline users: (1) Push notifications: when a message is sent and the recipient is offline (no active WebSocket connection, presence key expired), send a push notification via APNs (iOS) or FCM (Android). Include the sender name and message preview. On tap: app opens to the specific chat. (2) Message inbox: messages are persisted in Cassandra regardless of online status. When the user reconnects, they fetch missed messages: SELECT * FROM messages WHERE chat_id IN (my_chats) AND created_at > last_seen_at ORDER BY created_at DESC. (3) Unread count: maintain an unread count per (user, chat) in Redis: HINCRBY unread:{user_id} {chat_id} 1 on each new message. On chat open: reset to 0. Show badge on app icon = sum of all unread counts. Offline delivery guarantee: messages are durably stored — users never lose messages even if offline for weeks.

Chat System Low-Level Design (WhatsApp / Messenger)

Requirements

Architecture

WebSocket Connection Management

Data Model

Message Delivery Flow

Read Receipts

Presence Service

Message Search

Key Design Decisions