Low Level Design: TCP Connection Management

⏱ 13 min read

Three-Way Handshake

Every TCP connection begins with a three-way handshake that establishes synchronized sequence numbers on both sides and verifies bidirectional reachability. The sequence is:

SYN: client sends a SYN segment with its Initial Sequence Number (ISN_c), a randomly chosen 32-bit value. The SYN flag indicates this is a connection request.
SYN-ACK: server responds with its own ISN_s (also random) and acknowledges the client’s ISN by setting ACK = ISN_c + 1. The server is simultaneously opening its side of the connection.
ACK: client acknowledges the server’s ISN by sending ACK = ISN_s + 1. The connection is now ESTABLISHED on both sides.

The ISN is random (pseudo-random based on a clock and a secret in modern implementations) for two reasons: first, it prevents sequence number prediction attacks where an attacker forges segments with valid sequence numbers; second, it prevents delayed packets from a previous connection on the same port pair from being misinterpreted as belonging to the new connection.

The handshake costs 1.5 RTT before data can flow (0.5 RTT for SYN, 0.5 RTT for SYN-ACK, then ACK+data together). TLS 1.3 on top adds another 1 RTT, meaning a fresh HTTPS connection costs 2.5 RTT before the first byte of application data arrives — the core motivation for connection pooling and TCP Fast Open.

Half-open connections (SYN received but handshake incomplete) are stored in the SYN queue. SYN flood attacks exhaust this queue. SYN cookies mitigate this: instead of allocating state for each SYN, the server encodes connection parameters in the ISN_s, validating them only when the ACK arrives.

TCP Connection State Machine

TCP is a state machine with 11 states. Understanding the state transitions is essential for diagnosing connection issues with netstat or ss:

CLOSED: no connection exists
LISTEN: server waiting for incoming connections (bound socket)
SYN_SENT: client sent SYN, waiting for SYN-ACK
SYN_RECEIVED: server received SYN, sent SYN-ACK, waiting for ACK
ESTABLISHED: connection is open, data transfer in progress
FIN_WAIT_1: active closer sent FIN, waiting for ACK
FIN_WAIT_2: active closer received ACK of FIN, waiting for passive closer’s FIN
TIME_WAIT: active closer received passive’s FIN, sent final ACK, waiting 2*MSL
CLOSE_WAIT: passive closer received FIN, sent ACK, waiting for application to close
LAST_ACK: passive closer sent FIN, waiting for final ACK
CLOSING: both sides sent FIN simultaneously, waiting for ACK of own FIN

The four-way close sequence: active closer sends FIN (FIN_WAIT_1), passive closer ACKs (FIN_WAIT_2 / CLOSE_WAIT), passive closer sends FIN (TIME_WAIT / LAST_ACK), active closer sends final ACK (TIME_WAIT). The asymmetry is that CLOSE_WAIT allows the passive side to finish sending data before closing.

A large number of CLOSE_WAIT connections in ss output indicates a bug: the application received a FIN but never called close() on the socket. This leaks file descriptors and eventually exhausts the connection table.

Flow Control with Sliding Window

TCP flow control prevents a fast sender from overwhelming a slow receiver’s buffer. The mechanism is the receive window (rwnd), advertised in every ACK segment.

rwnd is the number of bytes the receiver can currently accept — the free space in its receive buffer. The sender must not have more unacknowledged bytes in flight than rwnd. As the application reads data from the receive buffer, the receiver advertises a larger window; as the buffer fills, rwnd shrinks.

The sliding window allows continuous transmission without waiting for individual ACKs. The sender maintains three pointers: last byte acknowledged (left edge), last byte sent, and last byte it can send (left edge + rwnd). The "window" slides right as ACKs arrive.

A zero window occurs when the receiver advertises rwnd = 0 — the receive buffer is full. The sender pauses and sends periodic window probe segments (1 byte) to check if the window has reopened. This probing continues with exponential backoff. The receiver sends a window update (unsolicited ACK with new rwnd) when buffer space becomes available.

The silly window syndrome is a pathology where the receiver opens the window by a tiny amount and the sender immediately transmits a tiny segment. Clark’s solution: receiver should not open the window unless it can accept at least min(MSS, half of receive buffer). Nagle’s algorithm (discussed below) addresses the sender side.

Modern TCP uses window scaling (RFC 7323) to support windows larger than 65535 bytes (the 16-bit field limit), essential for high-bandwidth-delay-product links. A scale factor is negotiated in the SYN/SYN-ACK and applied as a bit shift to the advertised window.

Congestion Control Algorithms

While flow control prevents receiver buffer overflow, congestion control prevents network buffer overflow. TCP infers congestion from packet loss and RTT increases, adjusting its send rate accordingly.

The sender maintains a congestion window (cwnd). The effective send window is min(cwnd, rwnd). The congestion control algorithm governs how cwnd changes:

Slow Start: begins with cwnd = 1 MSS (or 10 MSS in modern Linux). Each ACK increases cwnd by 1 MSS — this doubles cwnd per RTT (exponential growth). Slow start ends when cwnd reaches the slow start threshold (ssthresh) or a loss event occurs.

Congestion Avoidance: once cwnd >= ssthresh, increase cwnd by MSS * MSS / cwnd per ACK — approximately 1 MSS per RTT (linear growth). This is the AIMD (Additive Increase, Multiplicative Decrease) regime.

Fast Retransmit: receiving 3 duplicate ACKs indicates the receiver got segments after a gap — the missing segment is retransmitted immediately without waiting for a retransmit timeout (RTO). This avoids the costly multi-second RTO wait.

Fast Recovery (TCP Reno): on loss via 3 dup ACKs, set ssthresh = cwnd / 2, set cwnd = ssthresh + 3, retransmit the lost segment. On timeout (more severe), set cwnd = 1 and restart slow start.

CUBIC (default in Linux since 2.6.19): replaces Reno’s linear congestion avoidance with a cubic function of time since the last congestion event. CUBIC is more aggressive on high-bandwidth-delay-product links and fairer to multiple flows sharing a bottleneck. The cubic function grows rapidly toward the previous window size, then plateaus near it, then probes cautiously beyond.

BBR (Bottleneck Bandwidth and RTT, Google 2016): departures from loss-based control. BBR estimates the bottleneck bandwidth and RTT to maintain the pipe full without filling buffers. It uses pacing (spreading packets across the RTT) rather than burst-then-stop. BBR dramatically improves throughput on lossy links (e.g., cellular) where loss-based algorithms over-reduce their window.

TIME_WAIT State and Ephemeral Port Exhaustion

After sending the final ACK, the active closer enters TIME_WAIT and remains there for 2*MSL (Maximum Segment Lifetime). Linux defaults to MSL = 60 seconds, so TIME_WAIT lasts 2 minutes. RFC 793 recommends MSL = 2 minutes, giving 4 minutes of TIME_WAIT — most modern systems use shorter values.

TIME_WAIT serves two purposes:

Reliable final ACK delivery: if the final ACK is lost, the passive closer retransmits its FIN. The active closer must still be alive to re-send the ACK. Without TIME_WAIT, a new connection on the same port pair might receive the stale FIN.
Preventing delayed packet confusion: delayed segments from the old connection could arrive after a new connection reuses the same 4-tuple (src IP, src port, dst IP, dst port). TIME_WAIT ensures all segments from the old connection have expired before the tuple can be reused.

On high-connection-rate servers (proxies, load balancers, crawlers), TIME_WAIT sockets accumulate rapidly and exhaust the ephemeral port range (typically 32768–60999, ~28000 ports). Mitigations:

net.ipv4.tcp_tw_reuse = 1: allow reusing TIME_WAIT sockets for new outbound connections when safe (timestamped connections only)
SO_REUSEADDR: allow binding to a port in TIME_WAIT (server restart use case)
Expand ephemeral port range: net.ipv4.ip_local_port_range = 1024 65535
Use connection pooling to reduce connection churn
net.ipv4.tcp_tw_recycle was a dangerous option (caused issues with NAT) and was removed in Linux 4.12

Nagle Algorithm and TCP_NODELAY

The Nagle algorithm (RFC 896) addresses the small-packet problem: an application sending many tiny writes (e.g., 1-byte keypresses over telnet) would generate a flood of small segments, each with 40 bytes of IP+TCP headers — catastrophic on slow links.

Nagle’s rule: a TCP sender may have at most one unacknowledged small segment outstanding. If data to send is smaller than MSS AND there is already unacknowledged data, hold the data in the send buffer and wait. Release the buffered data when either: (a) an ACK arrives for the outstanding data, or (b) enough data accumulates to fill an MSS.

This batching reduces packet count dramatically for chatty applications. However, it introduces latency: a write is held until the previous ACK arrives, adding up to 1 RTT of delay. For latency-sensitive applications (online games, high-frequency trading, interactive terminals), this is unacceptable.

The TCP_NODELAY socket option disables Nagle. Any data written to the socket is sent immediately. This is the correct setting for RPC-style protocols where the client sends a complete request in one or a few writes and needs the server to receive it without artificial delay.

A subtle interaction: Nagle + delayed ACK (the receiver holds the ACK for up to 200ms to piggyback on a response) can cause a 200ms delay for two-write request patterns (header in first write, body in second). The first write goes out immediately; the second write is held by Nagle waiting for ACK of first; the ACK is delayed by the receiver. This is a classic misconfiguration in custom protocol implementations.

TCP Connection Pooling

Establishing a TCP connection costs 1 RTT (handshake) + 1 RTT (TLS 1.3) = 2 RTT before application data flows. At 100ms RTT (cross-continent), that’s 200ms of latency overhead per connection, plus the slow start penalty for the first window of data. For databases, microservices, and HTTP backends, per-request connection establishment is prohibitively expensive.

Connection pooling maintains a pool of pre-established, keep-alive connections. New requests borrow a connection from the pool, use it, and return it. The pool handles:

Pool size: max connections. Too few: requests queue waiting for a connection. Too many: overwhelms the backend with idle connections consuming memory and file descriptors.
Idle timeout: return connections to OS after idle period to avoid holding server-side resources. Must be shorter than server’s idle timeout to avoid receiving RST on a "live" pooled connection.
Health checking: test connections before borrowing (check with a lightweight query) to detect silently broken connections (e.g., firewall RST after timeout).
Connection validation: many pool implementations test connections on borrow with a validation query or TCP keepalive.

Server-side poolers proxy connections at the database protocol level: pgBouncer for PostgreSQL (transaction-mode pooling: a backend connection is only held during an active transaction, not for the client’s entire session), ProxySQL for MySQL (with query routing, caching, and read/write splitting). These allow thousands of application connections to multiplex onto a smaller number of actual database connections.

Client-side pools (HikariCP for Java, pgx pool for Go, psycopg3 pool for Python) sit in the application process. They are simpler and avoid a network hop to the proxy, but each application instance has its own pool, so total connections to the database = instances * pool_size.

TCP Fast Open

TCP Fast Open (TFO, RFC 7413) reduces connection establishment latency for repeat connections by allowing data to be sent in the SYN packet, eliminating 1 RTT.

The mechanism uses a TFO cookie: an HMAC-based token tied to the client’s IP address, generated by the server and stored by the client.

First connection (cookie request): client sends SYN with a TFO cookie request option. Server generates a cookie (HMAC(secret, client_IP)) and returns it in SYN-ACK. Client stores the cookie for future connections to this server.

Subsequent connections: client sends SYN + TFO cookie option + application data. Server validates the cookie: if valid, it passes the data to the application immediately, before the handshake completes. Server sends SYN-ACK + response data (if available). The final ACK from the client completes the handshake.

This saves 1 RTT for repeat connections — critical for short-lived connections like HTTP/1.1 without keep-alive, or for latency-sensitive RPC calls. TFO is supported in Linux (client since 3.6, server since 3.7) and requires TCP_FASTOPEN socket option or net.ipv4.tcp_fastopen sysctl.

TFO has limitations: the SYN data can be replayed by the network (delayed duplicate SYN), so TFO is only safe for idempotent requests. HTTP GET is safe; POST with side effects is not. Additionally, some middleboxes (firewalls, NAT devices) drop SYN packets with unknown options, breaking TFO — TFO implementations fall back to normal handshake on failure.

What happens during the TCP three-way handshake?

The client sends SYN with its initial sequence number. The server responds with SYN-ACK, acknowledging the client ISN and sending its own ISN. The client sends ACK, acknowledging the server ISN. After this exchange, both sides have synchronized sequence numbers and the connection enters ESTABLISHED state. Each SYN consumes one sequence number.

What is the TCP TIME_WAIT state and why does it exist?

After sending FIN and receiving the peer’s FIN-ACK, the active closer enters TIME_WAIT for 2*MSL (Maximum Segment Lifetime, typically 60 seconds). TIME_WAIT ensures the final ACK reaches the peer (if lost, the peer retransmits FIN and the closer can re-ACK). It also prevents old duplicate segments from a previous connection from being interpreted as new data on a reused port+address pair.

How does TCP flow control work?

The receiver advertises a receive window (rwnd) in each ACK — the amount of buffer space available. The sender limits in-flight bytes to min(cwnd, rwnd). If rwnd drops to zero, the sender pauses and probes with 1-byte window probe segments until the receiver advertises a non-zero window. This prevents the sender from overwhelming the receiver.

What is TCP Fast Open (TFO)?

TFO reduces connection latency by allowing data to be sent in the SYN packet. The client first performs a regular handshake and receives a TFO cookie from the server. On subsequent connections, the client includes the cookie and data in the SYN. The server validates the cookie and processes the data without waiting for the handshake to complete, saving one round trip.

How does TCP congestion control work in CUBIC?

CUBIC is the default congestion control in Linux. It uses a cubic function of time since the last congestion event to compute the congestion window, rather than the linear increase of Reno. CUBIC is aggressive in recovering bandwidth after a loss event: it quickly approaches the previous maximum window, then probes slowly. This makes CUBIC efficient for high-bandwidth, high-latency networks.