Low Level Design: Network Protocol Stack Internals

⏱ 7 min read

Overview of the Network Protocol Stack

The network protocol stack is the layered software and hardware system responsible for transmitting data between processes, machines, and networks. It is organized according to the TCP/IP model (a practical simplification of the OSI 7-layer model): Application → Transport (TCP/UDP) → Network (IP) → Data Link (Ethernet) → Physical. Each layer adds a header (encapsulation) on transmit and strips it on receive. Understanding the internals — from socket API to kernel bypass — is essential for designing high-throughput, low-latency networked systems.

The Socket API

The Berkeley socket API is the standard programming interface to the kernel’s networking stack, used identically on Linux, macOS, and Windows (Winsock). The core calls:

socket(AF_INET, SOCK_STREAM, 0) — create a TCP socket (file descriptor).
bind(fd, addr, addrlen) — assign a local address and port.
listen(fd, backlog) — mark socket as passive, set accept queue depth.
accept(fd, addr, addrlen) — dequeue a completed connection, return new fd.
connect(fd, addr, addrlen) — initiate TCP three-way handshake.
send(fd, buf, len, flags) / recv(fd, buf, len, flags) — write to / read from the TCP send/receive buffer.
close(fd) — initiate TCP FIN sequence and release fd.

The socket fd is just a file descriptor — it integrates with the UNIX I/O model: select, poll, epoll, read, write, and sendfile all work on socket fds. This uniformity is a design strength of the UNIX philosophy.

Socket Buffers and Tuning

The kernel maintains a send buffer (SO_SNDBUF) and receive buffer (SO_RCVBUF) per socket. send() copies data into the kernel send buffer and returns immediately (if there is space) — the kernel handles the actual transmission asynchronously. recv() copies data from the kernel receive buffer to userspace.

For high-throughput connections (e.g., cross-datacenter at 10 Gbps), the TCP receive window (negotiated from receive buffer size) limits in-flight data. On a link with 100ms RTT and 10 Gbps bandwidth, the bandwidth-delay product (BDP) is 125 MB — the receive buffer must be at least that large to keep the link full. Linux auto-tunes buffer sizes up to net.core.rmem_max / net.core.wmem_max. Set these to 16–64 MB on high-throughput servers. TCP_WINDOW_CLAMP and window scaling (RFC 1323) are also relevant for WAN links.

Blocking vs Non-Blocking Sockets

By default, sockets are blocking: recv() blocks the calling thread until data arrives; send() blocks if the send buffer is full; accept() blocks until a connection arrives. This is simple but limits a single thread to serving one connection at a time.

Non-blocking sockets (set with fcntl(fd, F_SETFL, O_NONBLOCK)) return immediately with EAGAIN/EWOULDBLOCK if no data is available or the buffer is full. The application must use an event notification mechanism (select, poll, epoll) to know when to retry. Non-blocking I/O is the foundation of all high-connection-count servers: Nginx, Node.js, Redis, HAProxy.

epoll: Scalable Event Notification

The classic select() and poll() syscalls require passing the entire set of monitored file descriptors on every call — O(N) overhead per wakeup. For 10,000+ connections, this is prohibitive. Linux epoll (added in 2.6) solves this with O(1) wakeup cost regardless of the number of watched fds.

Usage: epoll_create1() returns an epoll instance fd; epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event) registers interest; epoll_wait(epfd, events, maxevents, timeout) blocks until events occur and returns only the ready fds. Two modes: level-triggered (LT) (default — keeps returning ready until all data is consumed, safer) and edge-triggered (ET) (fires once on state change — requires draining the entire buffer with a loop, enables more efficient event loops). Nginx, Node.js, Redis, and Envoy all use epoll as their event loop foundation on Linux.

io_uring: Next-Generation Async I/O

Introduced in Linux 5.1 (2019), io_uring is a radical redesign of async I/O in Linux. It uses two lock-free ring buffers shared between kernel and userspace: a submission queue (SQ) where the application posts I/O requests, and a completion queue (CQ) where the kernel posts results. The application can submit batches of operations with a single io_uring_enter() syscall — or zero syscalls if the kernel is polling the SQ in a dedicated thread (IORING_SETUP_SQPOLL).

io_uring supports sockets, regular files, pipes, splice, and more — unifying async I/O across all fd types (unlike epoll, which only works with fds that support polling). It enables true zero-syscall I/O submission for high-IOPS workloads. Frameworks built on io_uring: liburing (reference library), Tokio (Rust async runtime) experimenting with io_uring backend, NGINX async I/O experiments, and database projects like ScyllaDB and RocksDB.

Zero-Copy I/O

Serving a file over a TCP connection traditionally requires: read file from kernel page cache → copy to userspace buffer → copy from userspace to kernel socket buffer → transmit. That’s two extra copies and two context switches. Zero-copy eliminates the userspace trip:

sendfile(out_fd, in_fd, offset, count): transfers data directly from file page cache to socket buffer entirely in the kernel. Used by Nginx for static file serving, Kafka for log segment delivery to consumers — Kafka’s famous "zero-copy" is precisely this.
splice(fd_in, off_in, fd_out, off_out, len, flags): moves data between two kernel buffers (e.g., file to pipe, pipe to socket) without copying to userspace, using a pipe as an intermediary.
MSG_ZEROCOPY flag on send(): pins userspace buffer pages and DMA’s directly from them — avoids the copy into kernel send buffer, with a completion notification when the NIC is done with the buffer.

Kernel Bypass Networking: DPDK and RDMA

Even with zero-copy, the Linux kernel TCP/IP stack adds latency: syscall overhead (~1 μs), scheduler jitter, interrupt handling, and protocol processing in the kernel. For extreme throughput (100 Gbps+) or sub-10 μs latency, kernel bypass removes the kernel from the data path entirely.

DPDK (Data Plane Development Kit): a userspace framework that programs the NIC directly via a Poll Mode Driver (PMD), polling the NIC ring buffers in a tight loop on dedicated CPU cores. No syscalls, no interrupts, no kernel involvement. Achieves ~100 Gbps line rate and ~1–5 μs latency for packet processing. Used in Cisco VPP, OVS-DPDK, 5G vRAN, and financial trading infrastructure.

RDMA (Remote Direct Memory Access): allows one server to read/write another server’s memory directly over the network without involving the remote CPU or OS. The NIC handles the transfer autonomously using the DMA engine. Protocols: InfiniBand (dedicated fabric), RoCE (RDMA over Converged Ethernet), iWARP. Used in HPC clusters, AI training (NCCL all-reduce), distributed databases, and Microsoft Azure’s RDMA-based storage.

TCP Tuning for High Performance

Key TCP tunable parameters for production servers:

TCP_NODELAY (disable Nagle’s algorithm): send small packets immediately without waiting to coalesce. Critical for RPC latency (Redis, gRPC, databases). Nagle + delayed ACK interaction can add 40ms latency.
tcp_max_syn_backlog and net.core.somaxconn: depth of the SYN queue and accept queue; increase for high-connection-rate servers.
tcp_tw_reuse: allow reuse of TIME_WAIT sockets for new connections with the same 4-tuple. Important for high-churn client workloads (e.g., short-lived HTTP/1.1 connections).
tcp_fin_timeout: how long to keep FIN_WAIT_2 state; reduce to free fd resources faster.
tcp_slow_start_after_idle=0: disable slow start restart after idle; important for persistent connections that have bursts of activity.
Congestion control algorithm: BBR (Bottleneck Bandwidth and RTT) outperforms CUBIC on high-BDP and lossy links — used by Google for YouTube and GCP.

QUIC: Userspace Protocol Stack

QUIC (RFC 9000, standardized 2021) is a transport protocol built on UDP that implements reliable, ordered, multiplexed streams — essentially TCP+TLS+HTTP/2 features, but running entirely in userspace. Running in userspace means: no kernel changes required for protocol improvements, faster deployment of new congestion control algorithms, and protocol evolution at the speed of software releases rather than OS kernel cycles.

QUIC solves TCP’s head-of-line blocking at the transport layer (a lost packet blocks all TCP streams; in QUIC, only the stream awaiting the lost packet stalls). It integrates TLS 1.3 natively, reducing handshake to 1-RTT (or 0-RTT for resumed sessions). HTTP/3 is HTTP over QUIC. Major implementations: quiche (Cloudflare/Rust), ngtcp2, MsQuic (Microsoft), QUIC-Go. Deployed by Google, Cloudflare, Meta, and AWS.

Interview Design Checklist

Trace a send() call from userspace through the kernel TCP/IP stack to the NIC.
Explain the difference between blocking, non-blocking, and async I/O.
Compare select/poll vs epoll — why is epoll O(1)?
What is io_uring and how does it differ from epoll?
Explain sendfile() and name two systems that use it for zero-copy.
When would you use DPDK? What are the trade-offs vs kernel TCP?
What is RDMA and in what workloads does it provide the most benefit?
Why does QUIC run in userspace? What TCP limitations does it address?