WebSockets in Production: Scaling, Auth, and Reconnect

⏱ 2 min read

WebSockets are easy to demo and hard to run at scale. Senior frontend interviews probe whether you understand the production realities — connection management, server-side scaling, auth, and the dozen ways a real-time system can fail.

The basic flow

Client opens WebSocket: new WebSocket('wss://...')
HTTP upgrade handshake
Persistent TCP connection
Both sides can send messages

Authentication

HTTP cookies sent during upgrade work for same-origin. Cross-origin needs:

Token in URL (logged in server logs — avoid)
First-message auth: client sends auth message after connect; server validates before processing
Custom subprotocol header: auth in Sec-WebSocket-Protocol

Token expiry handling: when token expires, server closes connection with specific code; client refreshes token and reconnects.

Reconnection

Networks drop. Standard reconnect:

Exponential backoff (1s, 2s, 4s, 8s)
Cap at 30s
Indicate to user: “Reconnecting…”
Reset timer on successful reconnect

Libraries handle this: reconnecting-websocket, socket.io-client.

Message ordering on reconnect

The hard part. Strategies:

Client tracks last received message ID
On reconnect, sends “give me messages since X”
Server replays missed messages

Without this, users miss messages during reconnect.

Heartbeat / ping

Networks may silently drop connections. Detect with heartbeat:

Client sends ping every 30s
Server responds with pong
If no pong within timeout, declare connection dead and reconnect

WebSocket protocol has built-in ping/pong; some implementations expose, some don’t.

Server-side scaling

WebSockets are stateful. Each connection ties to a server instance. Scaling concerns:

Connection limits: Linux file descriptor limits, typically 65K per process
Memory: ~10–50KB per connection, depending on framework
Sticky sessions: load balancer must route the same client to the same instance

For 100K+ concurrent connections, plan capacity carefully.

Pub/sub for fanout

Broadcasting messages to many users:

Server instances subscribe to a Redis pubsub channel
App publishes to Redis
All instances receive; broadcast to their connected clients

This pattern (or NATS, Kafka) is standard for chat, live updates, etc.

Channel / room management

Users care about specific topics (chat rooms, document IDs). Pattern:

Client joins channel after connecting
Server tracks which connections are in which channels
Broadcast only to relevant channels

Backpressure

If a slow client cannot keep up:

Buffer fills
Server runs out of memory
Close the slow connection (rather than affecting other clients)

Implement explicit backpressure: drop messages, close slow clients, signal “you fell behind, reconnect.”

Proxy and load balancer issues

Some proxies idle-timeout WebSocket connections after 30–60 seconds
HTTP/1.1 proxies may not support WebSocket upgrades
nginx and HAProxy support WebSockets natively but need explicit config

Test with your actual deployment topology.

Mobile-specific

Background apps lose WebSocket connection
iOS app suspended? Connection dies. Reconnect on foreground.
Cellular handoff (Wi-Fi → LTE) drops the connection
For critical real-time, use push notifications as backup signal

Common mistakes

No reconnection logic
Token expires; connection silently dies
No heartbeat; zombie connections accumulate
No message replay; users miss messages on reconnect
Single-server architecture; cannot scale

Frequently Asked Questions

Should I use Socket.io or native WebSocket?

Native WebSocket is leaner. Socket.io adds reconnect, fallback to long polling, namespaces — useful overhead.

Can I run WebSockets on serverless?

API Gateway WebSocket on AWS, Cloudflare Durable Objects — both support. Higher latency than dedicated servers but easier ops.

How many concurrent WebSocket connections can a Node.js server handle?

Tens of thousands per instance with proper tuning. Beyond that, scale horizontally.