Question 1

Why do WebSocket servers require sticky sessions and how are they implemented?

Accepted Answer

A WebSocket connection is stateful — all packets for a connection must reach the same server instance that established it, because the server holds the connection handle (file descriptor) in memory. If a load balancer routes packets for the same connection to different servers, the connection breaks. Sticky session mechanisms: IP hash (route by client IP — breaks with NAT), cookie-based stickiness (L7 load balancer sets a cookie identifying the server instance), or encoding the server ID in the WebSocket URL path. AWS ALB and Nginx both support cookie-based sticky sessions for WebSocket connections.

Question 2

How do you broadcast a message to all subscribers across multiple WebSocket server instances?

Accepted Answer

Connections are distributed across server instances, so a message arriving at instance A must reach subscribers connected to instances B and C. Fan-out via Redis pub/sub: when a message arrives, publish it to a Redis channel (PUBLISH channel:room123 message). All server instances subscribe to the channel (SUBSCRIBE channel:room123) and forward received messages to their local connections in that room. This achieves O(1) Redis publish latency and O(subscribers_on_instance) delivery per instance. For very high message rates, use Kafka instead of Redis pub/sub — Kafka provides durable message delivery and replay capability that Redis pub/sub lacks.

Question 3

How do you track which users are currently connected (presence) at scale?

Accepted Answer

Store presence in Redis with a TTL. On connection: SET presence:{user_id} {server_id} EX 30. Heartbeat every 15s: EXPIRE presence:{user_id} 30 (refresh TTL). On disconnect: DEL presence:{user_id}. For room presence: HSET room:{room_id}:presence {user_id} {server_id}, refreshed by heartbeat. Query: HEXISTS room:{room_id}:presence {user_id}. The TTL handles ghost connections (server crashes without clean close) — the presence key expires within 30 seconds of the last heartbeat. HGETALL room:{room_id}:presence returns all current occupants of a room.

Question 4

How should WebSocket clients handle reconnection after a connection drop?

Accepted Answer

Reconnection should use exponential backoff with jitter: first retry after 500ms, then 1s, 2s, 4s, 8s, max 30s. Add jitter (±20% random variation) to prevent thundering herds when a server restarts and thousands of clients reconnect simultaneously. On reconnection, clients must recover missed messages. Pattern: the server assigns a sequence number to every message; the client tracks last_received_seq; on reconnect the client sends {reconnect: true, last_seq: N}; the server replays messages with seq > N from a short-lived message buffer (Redis, 2-5 minutes TTL). Design this protocol before building the WebSocket server — retrofitting message recovery is expensive.

Low Level Design: WebSocket Server at Scale

WebSocket Connection Lifecycle

Connection State Management

Sticky Sessions and Routing

Fan-Out for Broadcast

Reconnection and State Recovery

Presence and Connection Tracking

Horizontal Scaling