Low Level Design: Virtual Memory System

What Is a Virtual Memory System?

Virtual memory is a foundational OS abstraction that gives each process the illusion of having its own large, contiguous address space — isolated from every other process. The CPU and OS cooperate to translate virtual addresses (VA) used by programs into physical addresses (PA) in DRAM. This translation happens on every memory access via the Memory Management Unit (MMU) built into the CPU. Without virtual memory, every process would need to be written for specific physical addresses, isolation would be impossible, and running multiple programs safely would require massive engineering effort.

Page Tables and Multi-Level Address Translation

The OS maintains a page table per process that maps virtual page numbers to physical frame numbers. On x86-64, a 4-level page table is used: PGD (Page Global Directory) → PUD (Page Upper Directory) → PMD (Page Middle Directory) → PTE (Page Table Entry). Each level is indexed by a slice of the 48-bit virtual address (9 bits per level, 12 bits for the page offset). A 4KB page is the standard granularity.

Each PTE contains the physical frame number plus control flags: present (page is in RAM), writable, user/kernel (privilege level), accessed (set by hardware on read), and dirty (set by hardware on write). The OS reads the accessed and dirty bits to implement page replacement policy without needing to intercept every access.

TLB: Translation Lookaside Buffer

A 4-level page table walk requires 4 memory accesses per translation — unacceptable overhead for every load/store instruction. The CPU includes a Translation Lookaside Buffer (TLB): a small, fully associative hardware cache of recent VA→PA translations. A TLB hit resolves the translation in one cycle. A TLB miss triggers a hardware page table walk (on x86) or a software handler (on MIPS/ARM variants).

When an OS modifies a page table mapping — for example during munmap() or when revoking write access — it must invalidate TLB entries on all CPUs that may have cached the old translation. This is done via a TLB shootdown: the modifying CPU sends an Inter-Processor Interrupt (IPI) to all other CPUs, causing them to execute INVLPG or CR3 reload to flush the stale entries. TLB shootdown is a real scalability bottleneck on large multi-core servers doing heavy mmap/munmap operations.

Demand Paging

Demand paging is the technique of not loading a page into physical memory until it is actually accessed. When a process is started, the OS sets up page table entries but marks them as not-present. When the program accesses a page for the first time, the CPU raises a page fault (exception). The OS page fault handler runs in kernel mode: it allocates a physical frame, reads the page from disk (or zeros it for anonymous memory), updates the PTE to mark it present with the correct physical address, and returns to userspace — which retries the faulting instruction transparently.

Demand paging enables programs to be larger than physical memory, fast startup (only load what is actually needed), and efficient memory sharing between processes running the same binary.

Copy-on-Write (COW)

Copy-on-write is a critical optimization used by fork(). After fork, parent and child share the same physical pages, but both have their PTEs marked read-only. When either process attempts a write, a page fault fires. The fault handler checks that this is a COW fault (not a true access violation), allocates a new physical frame, copies the content of the original page into it, updates the faulting process’s PTE to point to the new frame with write permission, and returns.

This means fork() followed immediately by exec() (the classic Unix pattern for spawning a process) is very cheap — almost no physical memory is copied because exec() discards the address space before any write fault occurs. COW is also used by Redis for background saves: fork() a child to snapshot memory while the parent continues serving writes, with only modified pages consuming extra memory.

Page Replacement Algorithms

When physical memory is full and a new page must be brought in, the OS must evict an existing page. True LRU is expensive to implement in hardware, so operating systems use approximations. The CLOCK algorithm (also called second-chance) sweeps through pages in circular order: if a page’s accessed bit is set, clear it and move on; if it is clear, evict that page. This approximates LRU with O(1) overhead per eviction.

Linux uses a more sophisticated active/inactive list split: pages start on the inactive list, get promoted to active on second access, and eviction candidates are pulled from the tail of the inactive list. Anonymous pages (heap/stack) and file-backed pages are managed separately. The OOM killer is invoked as a last resort when memory cannot be reclaimed — it selects and kills a process based on memory consumption and heuristics.

Swapping

Swapping extends effective physical memory by evicting anonymous pages (heap, stack, mmap’d anonymous regions) to a swap partition or swap file on disk. The PTE of a swapped-out page is marked not-present, and the swap location is encoded in the PTE bits. On access, the page fault handler reads the page back from swap into a physical frame.

Linux exposes a vm.swappiness tunable (0–100) controlling how aggressively the kernel swaps anonymous pages vs reclaims file-backed page cache. Setting swappiness=0 on database servers is common to avoid evicting hot heap data to disk. Containers and VMs often disable swap entirely for predictable latency.

Huge Pages

The default page size of 4KB means a process using 1GB of memory needs 262,144 PTEs. With huge pages (2MB on x86-64), that drops to 512 entries — far less TLB pressure for workloads with large working sets. 1GB huge pages are also supported for extremely large mappings.

Linux offers two mechanisms: Transparent Huge Pages (THP), which automatically promotes eligible anonymous regions to 2MB pages in the background (with khugepaged), and hugetlbfs, which pre-allocates a pool of huge pages that applications explicitly request via mmap() with MAP_HUGETLB. Oracle Database, JVMs, and in-memory databases like Redis and Aerospike explicitly use huge pages to reduce TLB misses and improve throughput on large heaps.

NUMA: Non-Uniform Memory Access

Multi-socket servers have multiple memory controllers, each attached to one CPU socket. Accessing memory attached to the local socket takes ~80ns; accessing remote memory (across the QPI/UPI interconnect) takes ~140ns or more. This is the NUMA architecture. For latency-sensitive workloads, memory allocation policy matters significantly.

Linux provides numactl to bind processes to specific NUMA nodes and control allocation policy (local-first, interleave, etc.). The kernel’s default policy is local-first allocation. A subtler issue is false sharing: two CPU cores on different sockets (or same socket) modify different variables that happen to reside in the same 64-byte cache line. The cache line bounces between cores’ L1/L2 caches, serializing what should be independent writes. Fix: pad structs so hot variables occupy their own cache lines, or use per-CPU data structures.

Memory-Mapped Files

mmap() maps a file (or device) directly into the process’s virtual address space. Reading a mapped region triggers page faults that load file data from the page cache — the same unified cache used by read()/write() syscalls, so data is not duplicated. Writes to a MAP_SHARED mapping update the page cache and are flushed asynchronously to disk; msync() forces synchronous flush.

Memory-mapped files are used extensively: databases use them for I/O (SQLite, LMDB, RocksDB’s block cache); the JVM uses mmap for class file loading and memory-mapped ByteBuffers; Kafka uses mmap for its log segments, enabling zero-copy sends via sendfile(). Shared mappings (MAP_SHARED) also enable efficient inter-process communication — two processes mapping the same file share the same physical pages.

Interview Design Checklist

Explain the VA→PA translation path: page table walk, TLB hit/miss, MMU role.
Draw the 4-level page table structure for x86-64 and explain PTE flags.
Describe the page fault handler flow for an anonymous mapping and a file-backed mapping.
Explain COW and why fork()+exec() is cheap.
Compare LRU, CLOCK, and Linux active/inactive list eviction.
When would you use huge pages? What are the trade-offs with THP vs hugetlbfs?
Explain TLB shootdown and why it is a scalability concern.
How does NUMA topology affect memory allocation strategy on a 4-socket server?

FAQ: Virtual Memory System

How does virtual memory enable process isolation?

Each process has its own virtual address space — typically 128TB on x86-64. The OS maintains a separate page table for each process. When the CPU accesses a virtual address, the MMU translates it to a physical address using the current process’s page table. Process A and Process B can both use virtual address 0x1000 but map to different physical frames. The OS marks pages in different processes as inaccessible to each other (user-mode page entries lack the user bit for other processes’ pages). A rogue process cannot read another process’s memory because the mappings simply don’t exist in its page table.

What is copy-on-write and how does it make fork() efficient?

When a process calls fork(), the OS creates a new process with an identical address space. Copying all pages immediately would be expensive. Copy-on-write (CoW) defers the copy: both parent and child initially share the same physical pages, marked read-only. When either process writes to a shared page, a page fault fires. The kernel allocates a new physical page, copies the content, updates the writing process’s page table entry to point to the new page, and resumes execution. Only written pages are ever duplicated. For processes that exec() immediately after fork(), almost no pages are ever copied.

What are huge pages and when should you use them?

The default page size is 4KB. With millions of pages, the page table is large and TLB pressure is high — a process that accesses 1GB of data needs 262,144 page table entries and TLB misses are frequent. Huge pages (2MB on x86) reduce the number of TLB entries needed by 512x. Use huge pages for: databases (PostgreSQL, Oracle, MongoDB use hugetlbfs), in-memory data stores (Redis can benefit), and JVM heap for large heap applications. Linux Transparent Huge Pages (THP) automatically backs memory regions with huge pages. Disable THP for latency-sensitive workloads (Redis recommends disabling) as THP defragmentation causes latency spikes.