Chat Service Low-Level Design: Message Storage, Delivery Guarantees, and Read Receipts

Message Schema

Every message is a structured record stored durably before any delivery is attempted:

message_id: Snowflake ID — 64-bit integer encoding timestamp, datacenter ID, and sequence number. Time-sortable without a secondary sort key.
conversation_id: identifies the 1:1 or group conversation thread
sender_id: authenticated user who created the message
content: text payload or reference to media object in blob storage
type: text | image | file | voice | system
status: SENT | DELIVERED | READ (updated as delivery progresses)
created_at, deleted_at: soft delete via deleted_at timestamp

Message Storage

Cassandra is well-suited for chat message storage:

Partition key: conversation_id — all messages for a conversation land on the same partition
Clustering key: message_id DESC — messages stored in reverse chronological order within the partition
Append-only writes: chat messages are never updated in place (edits create new versions), so Cassandra's write-optimized SSTable structure is ideal
Time-range queries: fetch last N messages with WHERE conversation_id = ? AND message_id < ? LIMIT 50 — efficient with the clustering key

Delivery Flow

The sequence from send to delivery:

Sender submits message to chat server over WebSocket or HTTP
Chat server persists message to Cassandra with status SENT
ACK sent back to sender — sender's UI updates message to SENT state
Chat server looks up recipient(s)' connection routing table to find their WebSocket server
Message forwarded to recipient's WebSocket server → pushed to recipient's connected client
Recipient's client ACKs delivery → server updates status to DELIVERED

Online Delivery via WebSocket

WebSocket provides a persistent, full-duplex connection between client and chat server. When the recipient is connected:

The chat server pushes the message over the open WebSocket connection immediately
No polling required — sub-100ms delivery latency is achievable
Connection routing: a distributed hash table (Redis) maps user_id → WebSocket server address, allowing any chat server to forward to the correct server holding the recipient's connection

Offline Queuing

When the recipient is disconnected, the message is persisted with status SENT and delivery deferred:

On reconnect, the client sends its last_seen_message_id
The server queries Cassandra for all messages in the conversation with message_id > last_seen_message_id
Missed messages are replayed in order over the newly established WebSocket connection

Message Ordering

Snowflake IDs provide total ordering without distributed coordination:

The timestamp component (41 bits) ensures messages from different senders are ordered by wall clock time
The sequence number component resolves ties within the same millisecond on the same generator node
Clients sort received messages by message_id — lexicographic sort equals chronological sort

At-Least-Once Delivery

The sender retries if it does not receive an ACK within a timeout (e.g., 5 seconds):

Each message carries a unique message_id (client-generated or server-assigned)
The server applies a unique constraint on message_id at the storage layer — duplicate submits are idempotent: the existing record is returned, not a new one created
This guarantees at-least-once delivery without risk of duplicate messages appearing in the conversation

Read Receipts

Read receipts allow senders to know when their message has been seen:

When the recipient's client renders a message, it sends a read event: { conversation_id, last_read_message_id }
The server updates the recipient's last_read_message_id in a conversation_members table
The server broadcasts the read event to other conversation members so their UIs can display read receipts
Batch reads: clients batch read events into a single update rather than sending one per message to reduce write amplification

Group Chat Fan-Out

In 1:1 chat, delivery is simple. Group chat requires delivering one message to N members:

Fan-out-on-write (small groups, <500 members): write one copy of the message to each member's delivery queue immediately. Simple, low read latency.
Fan-out-on-read (large groups, >500 members): store one copy of the message, members fetch it when they open the conversation. Avoids write amplification for very large groups (e.g., broadcast channels).

Message Editing, Deletion, Push Notifications, and E2E Encryption

Editing: store a new version of the message content with an edited_at timestamp. Keep edit history for audit purposes. Clients display the latest version with an “edited” indicator.

Soft deletion: set deleted_at on the message record. Clients display “This message was deleted.” No content is transmitted after deletion.

Push notifications: for offline users, send an APNs (iOS) or FCM (Android) push notification with a truncated message preview. The notification wakes the app, which then fetches the full message history via the offline queue mechanism.

End-to-end encryption: the sender encrypts message content with the recipient's public key before transmitting to the server. The server stores only ciphertext — it cannot read message content. Key exchange uses the Signal protocol (double ratchet algorithm) for forward secrecy.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you design the message storage schema for a chat service that supports billions of messages?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Model messages with a wide-column store (e.g., Cassandra or DynamoDB) using a composite partition key of (conversation_id) and a clustering key of (message_id DESC), where message_id is a time-ordered UUID (UUIDv7 or Snowflake ID). This layout gives O(1) writes and efficient range scans for paginated history. Store message payload, sender_id, type, and client-assigned idempotency_key in the row. Cap partition size by bucketing large conversations: partition key becomes (conversation_id, bucket) where bucket = floor(message_timestamp / bucket_duration). Maintain a separate `conversations` table in a relational store for metadata (participant list, last_message_id, created_at) and a `conversation_members` table for group membership and per-member last-read pointers.”
}
},
{
“@type”: “Question”,
“name”: “How do you guarantee at-least-once message delivery with deduplication in a chat system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The sender assigns a client-generated idempotency_key (UUID) to each message. The server stores this key with the message and returns an ACK containing the server-assigned message_id. If the sender doesn't receive an ACK within a timeout, it retransmits the same message with the same idempotency_key. The server detects duplicates via a UNIQUE constraint (or conditional write) on (conversation_id, idempotency_key) and returns the original message_id without re-inserting. For delivery to recipients, publish the message to a queue (e.g., Kafka topic partitioned by conversation_id). Each recipient's push worker consumes the queue, pushes to the client via WebSocket or FCM, and tracks delivery state in a `message_deliveries` table (message_id, recipient_id, status ENUM(‘pending’,’delivered’,’read’)). Clients ACK receipt; unACKed messages are retried by the worker.”
}
},
{
“@type”: “Question”,
“name”: “How do read receipts work at scale, and how do you avoid write amplification in large group chats?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “In 1:1 chats, a read receipt is a simple UPDATE on `message_deliveries` (recipient_id, message_id) → status=’read’, plus a WebSocket push to the sender. In group chats with N members, a naive per-message-per-member receipt table produces O(N × M) rows for M messages. Mitigate with a cursor approach: store only each member's `last_read_message_id` (a high-water mark) in the `conversation_members` table. A message is considered ‘read by all’ when MIN(last_read_message_id) across all members exceeds its ID. This reduces receipt storage to O(N) per conversation regardless of message count. Batch read-receipt updates: clients send a single ‘read up to message_id X’ event rather than one event per message. Debounce on the server side with a short window (1–2 seconds) before flushing to the DB to absorb rapid scroll-through events.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle message ordering and gap detection when clients reconnect after being offline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Assign each message a monotonically increasing sequence number scoped to the conversation (a per-conversation counter maintained in Redis with INCR, persisted to the DB on write). Clients track the highest sequence number they've received. On reconnect, the client sends a ‘sync’ request with its last_seen_seq; the server queries the message store for all messages WHERE conversation_id = X AND seq > last_seen_seq ORDER BY seq ASC LIMIT 200. Clients detect gaps by checking for sequence discontinuities in the stream. For real-time delivery, use a fan-out-on-write pattern: when a message is persisted, push it to an in-memory pub/sub channel (e.g., Redis Pub/Sub) keyed by conversation_id; all connected participants' WebSocket servers subscribe to this channel and forward to clients. Offline clients miss pub/sub events and rely entirely on the sync-on-reconnect pull path.”
}
}
]
}