WebSocket provides full-duplex, low-latency communication between a browser and server over a single persistent TCP connection. Unlike HTTP polling (client repeatedly requests updates), WebSocket pushes data from server to client the moment it is available. Scaling WebSocket servers is fundamentally different from scaling stateless HTTP services: WebSocket connections are stateful and long-lived, requiring sticky session routing, careful connection tracking, and efficient fan-out for broadcast use cases.
WebSocket Connection Lifecycle
A WebSocket connection begins with an HTTP Upgrade handshake: the client sends GET /ws HTTP/1.1 with Upgrade: websocket and Sec-WebSocket-Key headers; the server responds with 101 Switching Protocols and Sec-WebSocket-Accept. The connection then upgrades from HTTP to the WebSocket protocol on the same TCP connection. The server and client exchange framed messages until one side closes. Each frame has an opcode (text, binary, ping, pong, close) and payload. The server sends pings periodically; the client responds with pongs. Absent pong responses indicate a broken connection — close and clean up the server-side state.
Connection State Management
Each WebSocket connection is stateful: the server holds a reference to the connection (file descriptor, goroutine, thread, or event loop callback) for the lifetime of the connection. With 1 million concurrent connections, the server holds 1 million live connection handles. Memory per connection: the TCP receive/send buffer (4-8KB per side), application-level buffers, and metadata (user_id, subscriptions, last_ping_time). Use an event-driven, non-blocking I/O model (Node.js, Go with goroutines, Netty) rather than one-thread-per-connection — a million threads would exhaust memory. Go goroutines cost ~4KB stack each; Netty channels use off-heap buffers managed by the runtime.
Sticky Sessions and Routing
A WebSocket connection is pinned to one server instance — the connection is not migrated mid-session. Load balancers must use sticky sessions: route all packets for a connection to the same server instance. Mechanisms: IP hash (route based on client IP — breaks with NAT where many clients share one IP), cookie-based stickiness (L7 load balancer sets a cookie identifying the server; routes subsequent requests to the same server), and connection ID in URL path (/ws/{server_id}/{conn_id} — clients reconnect to the correct server). AWS ALB supports sticky sessions via cookie. Nginx upstream hash can hash on client IP or any upstream variable.
Fan-Out for Broadcast
Broadcasting a message to all subscribers of a channel (chat room, game lobby, live sports score) requires sending to all connections subscribed to that channel — which may be spread across multiple server instances. Fan-out architectures: fan-out on write via pub/sub (when a message arrives, publish it to a Redis channel; all server instances subscribe and forward the message to their local connections matching the channel), fan-out via Kafka (each server instance has a consumer for its partition; messages for connections on that server are routed to the correct partition), and centralized fan-out service (a dedicated broadcast service holds all subscriptions in memory and fans out to server instances). Redis pub/sub is the standard starting point; Kafka scales better for very high fan-out rates.
Reconnection and State Recovery
WebSocket connections drop frequently: mobile network switches, server restarts, load balancer timeouts. Clients must reconnect with exponential backoff (0.5s, 1s, 2s, 4s, max 30s) plus jitter to prevent thundering herds when a server restarts. On reconnection, clients need to recover missed messages. Techniques: sequence numbers (server assigns a sequence to each message; client requests messages since last_seen_seq on reconnect), message log with TTL (store recent messages in Redis for 5 minutes; client fetches missed messages on reconnect), and client-side state reconciliation (client compares its local state snapshot with a fresh snapshot from the REST API and applies only the diff). Design the reconnection protocol before the first line of WebSocket code.
Presence and Connection Tracking
Presence — knowing which users are currently connected — requires tracking connections across all server instances. Use Redis to store the connection set: on connect, SET presence:{user_id} {server_id} EX 30 (30-second TTL); on disconnect, DEL presence:{user_id}; heartbeat keeps the key alive. For room-level presence (who is in this chat room), use a Redis hash: HSET room:{room_id}:presence {user_id} {server_id}, refreshed on heartbeat. Query presence: HGETALL room:{room_id}:presence. Presence TTL handles ghost connections (server crashes without sending close): connections that don’t send heartbeats expire from the presence store within TTL seconds.
Horizontal Scaling
Scale WebSocket servers horizontally by adding instances behind a load balancer with sticky sessions. Each server instance handles N connections (typically 10,000-100,000 per instance depending on message rate). The fan-out pub/sub layer (Redis) coordinates cross-instance message delivery. Scale the pub/sub layer: Redis Cluster for pub/sub at very high message rates (each shard handles a subset of channels). Add a WebSocket gateway tier (Nginx, Envoy, or a custom gateway) that handles TLS termination, authentication, and connection routing before traffic reaches WebSocket application servers. Monitor: connections per instance, message throughput, fan-out latency (time from message arrival to delivery to all subscribers).