Low Level Design: Memory Management and Allocators

Stack vs Heap Allocation

Every running program uses two distinct memory regions for dynamic data: the stack and the heap. Understanding the tradeoffs between them drives most memory management design decisions.

The stack is a LIFO (last-in, first-out) region managed automatically by the CPU and compiler. Each function call pushes a stack frame containing local variables, saved registers, and the return address. When the function returns, the frame is popped — its memory is instantly reclaimed by decrementing the stack pointer. Stack allocation is O(1): a single subtract instruction. There is no fragmentation because frames are always reclaimed in LIFO order.

The stack has a fixed size per thread, typically 8 MB on Linux (configurable with ulimit -s or pthread_attr_setstacksize). Exceeding this limit causes a stack overflow (segfault from accessing an unmapped guard page below the stack). Stack is not suitable for large data structures or data that outlives its creating function.

The heap is a region managed explicitly (in C/C++) or by a garbage collector (Java, Go, Python). Allocation returns a pointer to a block of memory that persists until explicitly freed or collected. Heap allocation is significantly more expensive than stack allocation due to metadata management, fragmentation handling, and synchronization between threads. The heap is bounded only by available virtual memory (typically many gigabytes), making it suitable for large and long-lived allocations.

The general rule: prefer stack allocation for small, short-lived data; use heap allocation for large data, data shared between functions/threads, or data whose lifetime is not tied to a single function scope.

malloc Implementation: Free Lists and Policies

The heap allocator’s job is to service malloc(n) requests by returning a pointer to n usable bytes, and to reclaim that memory on free(ptr). The core data structure is the free list: a collection of free (available) memory blocks.

Each free block stores metadata in its header (and sometimes footer): block size and a flag indicating whether it’s free or allocated. The allocator uses this metadata to find free blocks and to merge adjacent free blocks.

Free block search policies:

First-fit: scan the free list from the beginning and return the first block large enough. Fast for small blocks near the front, but leads to fragmentation at the front of the list over time.
Next-fit: like first-fit but start scanning from where the last allocation was made. Distributes fragmentation more evenly.
Best-fit: scan the entire free list and return the smallest block that fits. Minimizes wasted space per allocation but requires a full scan (O(n)) and produces many tiny unusable fragments.
Segregated free lists: maintain separate free lists for different size classes (e.g., 8, 16, 32, 64, 128 bytes, …). Allocation for a given size goes directly to the appropriate list — O(1) in the common case. This is the approach used by all high-performance allocators.

When free(ptr) is called, the block is returned to the appropriate free list. If the block is adjacent to other free blocks, they should be coalesced (merged) into a single larger block to prevent external fragmentation. Coalescing requires knowing the size and status of adjacent blocks.

Boundary Tags for O(1) Coalescing

Knuth’s boundary tag technique enables O(1) coalescing with both the previous and next block in memory, without maintaining any additional linked list of blocks.

The key insight: store a copy of the block size at both the beginning (header) and end (footer) of each block. The footer is at address block_start + size - sizeof(footer).

With boundary tags, coalescing on free(ptr) proceeds:

Check next block: jump forward by current_size bytes to find next block’s header. If next block is free, merge.
Check previous block: read the footer at ptr - sizeof(footer) to get previous block’s size and free status. If previous block is free, merge.

Each check and merge is O(1). Without footers, finding the previous block requires traversing from the heap start, making coalescing O(n).

The overhead is 8–16 bytes per allocation (header + footer). For small allocations (8–32 bytes), this overhead is significant — modern allocators use separate metadata for small objects (slab allocator) to avoid per-object header/footer cost.

Boundary tags also enable safe traversal of the heap for debugging and ASAN-style memory safety tools: starting from the heap base, each block’s size is in its header, so all blocks can be visited in O(n) total.

Buddy System Allocator

The buddy system allocator constrains all allocations to power-of-2 sizes and uses a clever pairing scheme to achieve O(log N) allocation, deallocation, and coalescing.

The allocator maintains a free list for each size class: 2^0, 2^1, 2^2, …, 2^k bytes. Allocation of n bytes:

Round up n to the next power of 2, say 2^k
If the free list for 2^k is non-empty, return a block from it
Otherwise, find the smallest free list with blocks (say 2^j where j > k), take a block, split it in half, put one half in the 2^(j-1) free list, recurse until 2^k

Deallocation of a 2^k block at address addr:

Compute the buddy address: buddy = addr XOR (1 << k). The buddy is always at a specific, computable address — no searching required.
If the buddy is free (check the free list), remove it from the free list and merge the two into a 2^(k+1) block
Recurse upward: check if the merged block’s buddy is free, merge again, repeat

The XOR trick works because buddies are always aligned to 2*size boundaries — the only difference in their addresses is a single bit at position k.

The Linux kernel’s zone allocator uses the buddy system for physical page allocation (page size = 4KB, orders 0–10 give 4KB–4MB allocations). This is the /proc/buddyinfo output. The buddy system is ideal for kernel page allocation because power-of-2 sizes match natural alignment requirements and the computable-buddy property means fast coalescing.

The main weakness: severe internal fragmentation. Requesting 65 bytes requires a 128-byte block — 48% waste. The buddy system is best suited for large, power-of-2-aligned allocations. Slab allocators handle the fixed-size, small-object case more efficiently.

Slab Allocator

The slab allocator, designed by Jeff Bonwick (1994) for the SunOS kernel, addresses the allocation pattern dominant in kernel code: repeatedly allocating and freeing objects of the same type (e.g., struct task_struct, struct inode, struct socket).

A slab is a contiguous region of memory (one or more pages) divided into fixed-size slots, each sized for a specific object type. A cache manages one or more slabs for a single object type.

The key innovation: when an object is freed, it is returned to the slab but its initialization is preserved. Mutexes remain initialized, linked list heads remain initialized, reference counts are reset but the memory layout is intact. The next allocation from the slab returns a pre-initialized object — the constructor cost is paid only once per object across its entire lifetime in the slab.

Slab states:

full: all slots occupied
partial: some slots free (preferred for allocation)
empty: all slots free (can be returned to the buddy allocator)

Allocation: take a slot from a partial slab (or create a new slab from the buddy allocator if none exists). Free: return slot to its slab, mark as available.

The Linux kernel uses SLAB, SLUB (the modern default since 2.6.23), and SLOB (for embedded systems) variants. SLUB simplifies the original design by eliminating per-slab queues and storing metadata in the page frame itself, reducing memory overhead.

User-space analogs: memory pools in application code follow the same principle — pre-allocate a fixed number of objects, hand them out from a free list, return them without calling the system allocator. This is critical in latency-sensitive systems (game engines, HFT) where malloc/free latency spikes are unacceptable.

Memory Fragmentation

Fragmentation is the condition where the allocator cannot satisfy a request even though sufficient total free memory exists. There are two kinds:

External fragmentation: free memory exists but is split into many small, non-contiguous blocks. A request for 1MB fails even if there are 100 free 10KB blocks scattered across the heap. This is the classic fragmentation problem and is fundamentally unsolvable without moving objects (compaction).

External fragmentation is caused by interleaving allocations of different lifetimes. A common pattern: short-lived and long-lived objects are allocated in sequence; when short-lived objects are freed, they leave gaps between long-lived objects that are too small for subsequent large allocations.

Internal fragmentation: a block is allocated larger than requested. Causes:

Size class rounding: a request for 65 bytes is served from the 128-byte size class — 63 bytes wasted
Alignment requirements: allocations padded to 8 or 16-byte boundaries
Minimum block size: allocator requires blocks of at least 16 bytes for metadata

Internal fragmentation is bounded (by the size class granularity) and predictable. External fragmentation is unbounded in theory but controlled by allocator design (segregated free lists, periodic compaction) and allocation patterns.

Compaction eliminates external fragmentation by moving live objects to fill gaps and updating all pointers. This requires either a runtime that tracks all pointers (garbage-collected languages) or a level of indirection (handles instead of raw pointers). C and C++ programs cannot compact their heap because raw pointers cannot be updated transparently. Java, Go, and C# GCs compact routinely.

Garbage Collection Overview

Mark-and-sweep is the foundational GC algorithm. Phase 1 (mark): starting from roots (global variables, stack variables, CPU registers), traverse all reachable objects and mark them. Phase 2 (sweep): scan the entire heap; any unmarked object is dead and returned to the free list. Mark-and-sweep requires a stop-the-world pause to ensure a consistent view of the heap during marking — the mutator (application) cannot run while the GC is marking, or newly created objects might be missed.

Generational GC exploits the generational hypothesis: most objects die young. Objects are allocated in a small, fast young generation (also called nursery). Minor GC collects the young generation frequently — since most objects are dead, this is fast. Objects that survive several minor GCs are promoted to the old generation, which is collected infrequently with a more expensive major GC. Java’s G1, ZGC, and Shenandoah all use generational collection.

Concurrent GC runs the mark phase concurrently with the mutator, eliminating most of the stop-the-world pause. The challenge: the mutator can modify the object graph while marking is in progress. Write barriers (hooks on pointer writes) ensure the GC sees all pointer modifications during concurrent marking. ZGC and Shenandoah (Java), and Go’s GC use concurrent marking. Pause times drop from hundreds of milliseconds to sub-millisecond.

Incremental GC breaks the GC work into small increments interleaved with mutator execution. Instead of one long pause, many small pauses occur. The total work is higher (due to write barrier overhead) but individual pauses are bounded. This is the approach used by older low-latency GC designs and JavaScript engines.

Reference counting (Python, Swift, Rust Rc/Arc) tracks each object’s reference count; free when count reaches zero. Advantages: immediate reclamation, no stop-the-world pauses. Disadvantages: cannot collect cycles (Python uses a cycle detector as a supplement), high overhead for pointer-heavy workloads (every pointer assignment modifies reference counts), poor cache behavior.

jemalloc and tcmalloc

Production systems rarely use the system malloc (glibc ptmalloc2). Modern multi-threaded workloads expose ptmalloc2’s single global arena lock as a serious bottleneck. Purpose-built allocators like jemalloc and tcmalloc address this with per-thread or per-CPU caches.

jemalloc (Jason Evans, 2005; used by FreeBSD, Firefox, Meta’s backend services):

Arenas: multiple independent heaps (by default, 4x the number of CPUs). Each thread is assigned to an arena, distributing lock contention across arenas.
Size classes: fine-grained size classes (e.g., 8, 16, 32, 48, 64, 80, 96, 112, 128, …) minimize internal fragmentation
Bins: per-arena, per-size-class free lists. Small allocations (<= 14KB) use bins; large allocations use a global extent tree.
Dirty page decay: unused pages are returned to the OS after a configurable delay, controlling RSS.
jemalloc provides detailed memory profiling via heap profiling and malloc_stats_print()

tcmalloc (Thread-Caching Malloc; Google 2005; used by Chrome, Google servers, Go runtime):

Thread-local cache: each thread has a per-thread cache for small allocations (<= 256KB). Allocation and deallocation from the thread cache are lock-free.
Central free list: when thread cache overflows, excess blocks are returned to the central free list (one per size class, with a spinlock).
Page heap: large allocations (> 256KB) go directly to the page heap, which manages spans (runs of pages) with a radix tree.
Go’s runtime uses a heavily modified version of tcmalloc with per-P (processor) mcaches instead of per-thread caches, integrated with the GC write barrier.

Benchmark comparisons consistently show jemalloc and tcmalloc 2–5x faster than ptmalloc2 for multi-threaded workloads, with lower fragmentation. For services with high allocation rates (web servers, databases, compilers), replacing the system allocator with LD_PRELOAD of jemalloc or tcmalloc is a common, low-effort performance win.

How does malloc find a free block?

malloc maintains free lists organized by size class. The allocator searches the appropriate size class list for a block large enough to satisfy the request. If no block is found, it requests more memory from the OS via sbrk or mmap. Common search strategies: first fit (first block large enough), best fit (smallest block that fits), or segregated fits (separate free list per size class).

What are boundary tags in a heap allocator?

Boundary tags (footer tags) store the block size and allocation status at both the beginning and end of each block. The footer allows O(1) coalescing with the previous block: when a block is freed, the allocator checks the previous block’s footer to determine if it is free, then merges both into a single larger free block, reducing fragmentation.

How does the buddy system work?

The buddy system allocates memory in powers of two. To allocate N bytes, the allocator rounds up to the next power of two and finds a free block of that size. If only a larger block is available, it splits it in half (buddies), recursing until the correct size is reached. When a block is freed, it coalesces with its buddy (if also free) to form a larger block.

How does jemalloc differ from ptmalloc (glibc malloc)?

jemalloc uses per-thread arenas with size-segregated free lists, avoiding the global lock contention of ptmalloc. jemalloc is optimized for multi-threaded workloads (used by Facebook and Firefox) with lower fragmentation due to better size class design. tcmalloc (Google) takes a similar per-thread cache approach and is used in Chrome and many Google services.

What is the slab allocator?

The slab allocator pre-allocates pools of fixed-size objects (slabs). Each slab holds multiple objects of the same type. When an object is freed, it returns to its slab pool rather than being coalesced back into the heap. This eliminates fragmentation for frequently allocated and freed objects (e.g., kernel data structures) and avoids the overhead of general-purpose malloc for known fixed-size allocations.