Low Level Design: CPU Cache Optimization

Why CPU Cache Optimization Matters

Modern CPUs execute instructions in nanoseconds, but DRAM access takes ~80ns — roughly 200 CPU cycles of stall on every cache miss. The gap between CPU speed and memory speed (the "memory wall") makes cache behavior the single most important factor in real-world performance of compute-intensive code. Understanding the cache hierarchy and writing cache-friendly code is the difference between 10ns and 800ns per operation at the tight inner loop level.

Cache Hierarchy and Access Latencies

Modern x86 processors have three levels of cache between the CPU registers and DRAM:

L1 cache: 32–64 KB per core, split into instruction (L1i) and data (L1d), ~1–4 ns latency (4–12 cycles).
L2 cache: 256 KB–1 MB per core, unified, ~5–12 ns latency (12–40 cycles).
L3 cache (LLC): 4–64 MB shared across cores, ~20–40 ns latency (40–80 cycles).
DRAM: gigabytes, ~60–100 ns latency (~200 cycles).

A cache miss that falls all the way to DRAM costs 50–100× more than an L1 hit. Optimizing for cache is fundamentally about keeping working sets small and access patterns predictable.

Cache Lines: The Unit of Transfer

Caches operate in units of cache lines, which are 64 bytes on all modern x86/ARM processors. When you read a single byte, the CPU fetches the entire 64-byte cache line containing it. This has a critical implication: spatial locality is free — if you read element 0 of an array, elements 1–15 (assuming 4-byte ints) are already in the cache line. Sequential array traversal is extremely cache-friendly; pointer-chasing through a linked list is not, because each node can be anywhere in memory.

Practically: pack fields that are accessed together into the same struct. Avoid padding that spreads hot data across multiple cache lines. Use sizeof to verify struct sizes and __attribute__((packed)) judiciously when layout matters.

False Sharing

False sharing occurs when two threads on different CPU cores write to different variables that reside in the same 64-byte cache line. Even though there is no logical data sharing, the cache coherence protocol (MESI) forces the cache line to bounce between cores’ L1 caches — effectively serializing what should be fully parallel writes.

Classic example: a global array of counters, one per thread, laid out as int counters[NUM_THREADS]. If NUM_THREADS is 16 and each int is 4 bytes, all 16 counters fit in a single cache line, causing massive false sharing. Fix: pad each counter to 64 bytes with alignas(64) or use a struct with padding. Java’s @Contended annotation does this automatically. Linux kernel uses ____cacheline_aligned_in_smp for per-CPU variables.

Data Locality: AoS vs SoA

The layout of data structures profoundly affects cache efficiency. Two canonical patterns:

Array of Structs (AoS): struct Particle { float x, y, z, mass; } particles[N]. All fields of one particle are adjacent. Good when you frequently access all fields of a single particle together (e.g., physics simulation updating position from velocity).

Struct of Arrays (SoA): float x[N], y[N], z[N], mass[N]. All X coordinates are adjacent. Good when you process one field across all elements — for example, computing the sum of all masses, or applying SIMD to all X coordinates simultaneously. SoA is standard in game engine ECS (Entity Component System) architectures and scientific computing.

The rule of thumb: if your hot loop touches only a subset of struct fields, SoA avoids loading irrelevant fields into cache. Profile first — the right answer depends on your access pattern.

Branch Prediction

CPUs speculatively execute instructions before knowing whether a branch is taken. A branch misprediction costs approximately 15–20 cycles as the pipeline is flushed and the correct path is fetched. Modern branch predictors are sophisticated and achieve >99% accuracy on predictable patterns (e.g., loop counters), but data-dependent branches on random input can be essentially unpredictable.

Strategies to reduce misprediction cost: use branchless patterns — (condition) ? a : b often compiles to a CMOV (conditional move) instruction with no branch. Sort data before processing to make branches more predictable. Use profile-guided optimization (PGO) so the compiler knows which branches are hot. Use __builtin_expect(condition, 1) (GCC/Clang) to hint the likely branch to the compiler, affecting code layout for better instruction cache behavior.

SIMD: Single Instruction Multiple Data

SIMD instructions process multiple data elements in parallel using wide registers: SSE2 (128-bit: 4×float or 2×double), AVX2 (256-bit: 8×float), AVX-512 (512-bit: 16×float). A single vmulps instruction multiplies 8 floats simultaneously on AVX2, providing up to 8× throughput on floating point workloads.

Compilers auto-vectorize loops at -O2/-O3 when: the loop has no cross-iteration dependencies, the data is contiguous in memory, and the compiler can prove no aliasing. SoA layout is far more amenable to auto-vectorization than AoS. Use compiler reports (-Rpass=loop-vectorize in Clang) to verify vectorization happened. For maximum control, use intrinsics directly or libraries like Intel’s oneDPL / Eigen.

Hardware and Software Prefetching

The CPU’s hardware prefetcher automatically detects sequential and strided access patterns and prefetches cache lines before they are demanded, hiding memory latency. Sequential array traversal benefits automatically. Irregular access patterns (pointer chasing, hash table probing) defeat the hardware prefetcher.

For irregular accesses, you can use software prefetch instructions: in GCC/Clang, __builtin_prefetch(addr, rw, locality) (where rw=0 for read, 1 for write; locality 0–3 for cache level hint). Issue the prefetch several iterations ahead of when the data will be needed — typically 10–20 iterations for DRAM latency. This technique is used in B-tree implementations to prefetch the next node while processing the current one.

Memory Alignment

Accessing data at its natural alignment (4-byte int at 4-byte boundary, 16-byte SIMD vector at 16-byte boundary) is always at least as fast as misaligned access, and on some architectures (older ARM, some atomic operations) misaligned access is either slower or outright illegal. For cache line-sensitive data, use alignas(64) (C++11) or __attribute__((aligned(64))) to ensure a struct starts at a cache line boundary.

Dynamic allocations: aligned_alloc(alignment, size) (C11) or posix_memalign() for POSIX. Many SIMD intrinsics require 16- or 32-byte aligned pointers for the aligned load variants (_mm_load_ps vs _mm_loadu_ps); the aligned variant is faster.

Cache-Oblivious Algorithms

Cache-oblivious algorithms achieve optimal cache performance for any cache size without knowing the cache parameters. The key idea: a recursive divide-and-conquer algorithm that halves the problem size at each level will eventually reach subproblems that fit in L1, then L2, then L3 — naturally exploiting the entire cache hierarchy.

Classic examples: cache-oblivious matrix multiplication (recursive tiling), cache-oblivious merge sort (Funnelsort). These algorithms are theoretically elegant but in practice, explicitly cache-aware blocked algorithms with hand-tuned tile sizes often win due to simpler control flow and better SIMD opportunities. Libraries like BLAS use blocked algorithms with architecture-specific tile sizes determined at build time.

Cache Awareness in Database Engines

Database engines are among the most cache-conscious software systems. InnoDB (MySQL) uses 16KB pages; PostgreSQL uses 8KB pages — sized to be multiples of OS pages for efficient I/O. The buffer pool (InnoDB) or shared buffers (PostgreSQL) are the primary cache of disk pages in memory. Sequential scans issue prefetch I/O to keep the buffer pool warm ahead of the scan; large sequential scans use a small ring buffer to avoid evicting hot random-access data.

Column-store databases (ClickHouse, DuckDB, Redshift) exploit SoA layout at the storage level — data for a single column is stored contiguously, enabling excellent cache utilization and SIMD processing for analytical queries that touch few columns across many rows. Row-store databases (PostgreSQL, MySQL) use AoS layout, which is better for OLTP workloads accessing all columns of individual rows.

Interview Design Checklist

Explain the cache hierarchy and state the latency of L1, L2, L3, and DRAM.
What is a cache line? Why does it matter for struct layout?
Define false sharing and give a concrete fix using padding.
Compare AoS vs SoA and explain when each is preferable.
How does branch misprediction affect performance? What is CMOV?
What is auto-vectorization? What loop patterns enable or block it?
When would you use software prefetch over hardware prefetch?
Explain why cache-oblivious algorithms work without knowing cache size.

FAQ: CPU Cache Optimization

What is false sharing and how do you fix it?

False sharing occurs when two threads on different CPUs frequently write to different variables that happen to occupy the same 64-byte cache line. The MESI protocol invalidates and transfers the entire cache line between CPUs on each write — even though the variables are logically independent. This serializes what should be parallel operations. Fix: pad each variable to occupy a full cache line (add 56 bytes of padding after a long/8-byte field), or ensure each thread’s hot data is allocated in separate cache lines. Java’s @Contended annotation (JEP 142) automates padding. This is why Java’s AtomicLong performs poorly under contention — use LongAdder (striped counters) instead.

What is the difference between Array of Structs (AoS) and Struct of Arrays (SoA)?

AoS: each element is a struct containing all fields, stored contiguously: {x1,y1,z1},{x2,y2,z2}… Accessing a single field across all elements (e.g., all x values for a physics update) loads unnecessary y and z data into cache. SoA: separate arrays for each field: x[]{x1,x2,…}, y[]{y1,y2,…}. Processing all x values is purely sequential with perfect cache utilization. SoA enables SIMD: load 8 floats into an AVX register and compute 8 x-values in one instruction. Game engines (Unity’s ECS, DOTS), scientific computing (numpy), and database column stores use SoA for this reason.

How does branch misprediction affect performance?

Modern CPUs speculatively execute instructions beyond a branch before knowing the branch outcome (branch prediction). If the prediction is wrong, ~15-20 cycles are wasted flushing the pipeline. For a branch mispredicted 50% of the time on a 3GHz CPU, this is ~10ns per branch — significant in tight loops. Fixes: branchless code (use conditional move CMOV instead of branch — x = condition ? a : b compiles to CMOV with -O2), sorted data (sort before processing so branches are predictable — cache-friendly and branch-friendly), profile-guided optimization (PGO: compiler uses profiling data to mark likely/unlikely branches), and __builtin_expect hint in GCC/Clang.