What Is a Real-Time Messaging Service?
A real-time messaging service is the transport layer that moves data from one endpoint to another with millisecond-level latency. Unlike a generic chat service that also handles UI concerns like threads and reactions, the real-time messaging layer focuses purely on connection management, message routing, and delivery guarantees. It underpins chat apps, collaborative editing tools, live feeds, and multiplayer games.
Data Model / Schema
The messaging service maintains minimal state. Most persistent data lives in upstream services; the messaging layer tracks sessions and queued frames:
-- Active sessions (in-memory, e.g., Redis Hash)
sessions:{user_id} = {
server_node : STRING, -- which Chat Server holds the socket
connected_at : TIMESTAMP,
last_ping : TIMESTAMP
}
-- Outbound queue (per user, when socket is temporarily unavailable)
CREATE TABLE outbox (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
user_id BIGINT NOT NULL,
payload JSON NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
expires_at TIMESTAMP,
INDEX (user_id, id)
);
JSON payloads in the outbox store the full message envelope so delivery can be retried without querying upstream services again.
Core Algorithm / Workflow
The real-time path has two legs: the ingress path (client to server) and the egress path (server to client).
Ingress Path
- Client sends a frame over WebSocket. Frame format:
{ type, msg_id, payload }. - Server validates the frame, assigns a server-side timestamp, and publishes to the appropriate Kafka topic.
- Server immediately ACKs the frame back to the sender with the server-assigned
msg_id.
Egress Path
- A Router Service consumes Kafka and resolves the target
user_idlist from the conversation service. - For each target, the Router looks up the session map to find the correct Chat Server node.
- The Router publishes the frame to a per-node Redis channel. The Chat Server node receives it and writes to the open socket.
Failure Handling
Connection drops: Clients use an exponential back-off reconnect loop (starting at 100 ms, capped at 30 s). On reconnect, the client sends the last confirmed msg_id in the session handshake. The server replays any frames in the outbox with a higher ID.
Server node crash: The load balancer detects the dead node via health checks and reroutes new connections. Existing sessions are lost; clients reconnect and replay from the outbox. Session TTLs in Redis expire automatically, preventing stale routing.
Kafka consumer lag: If the Router Service falls behind, messages are buffered in Kafka (configured retention of at least 24 hours). This acts as a natural buffer during traffic spikes without dropping messages.
Scalability Considerations
Connection scaling: Each server node handles ~50 k WebSocket connections using an async I/O event loop (epoll/kqueue). Thousands of nodes behind a Layer 4 load balancer give virtually unlimited horizontal capacity.
Hot conversations: A very active group chat generates fan-out to hundreds of nodes simultaneously. Batch the per-node publish calls and pipeline Redis writes to minimize round trips.
Geo-distribution: Deploy server nodes in multiple regions. Route users to the nearest region via anycast or GeoDNS. Cross-region messages travel the Kafka backbone between regional clusters.
Summary
A real-time messaging service achieves low latency by keeping hot state (sessions, queues) in memory, using async I/O for connection management, and relying on Kafka as a durable, ordered backbone. The separation of the transport layer from business logic (conversations, threads) makes it easy to scale and operate independently.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Atlassian Interview Guide
See also: Snap Interview Guide