Low Level Design: Lock-Free Data Structures

⏱ 9 min read

Compare-and-Swap (CAS)

Compare-and-Swap is the atomic instruction at the heart of all lock-free algorithms. It takes three operands: a memory address, an expected value, and a new value. If the memory location currently contains the expected value, it is atomically replaced with the new value and the operation returns true. Otherwise it returns false and makes no change.

bool CAS(int* addr, int expected, int desired) {
    // executed atomically by CPU
    if (*addr == expected) { *addr = desired; return true; }
    return false;
}

Lock-free algorithms wrap CAS in a retry loop: load the current value, compute the desired new state, attempt CAS, and retry the entire sequence if CAS returns false (meaning another thread intervened). Progress is guaranteed because at least one thread succeeds on every contended CAS — even if others retry. This is the definition of lock-free: system-wide progress even if individual threads are delayed.

Lock-Free Stack

A lock-free stack maintains a top pointer to the head of a singly-linked list.

Push: Read top, set new_node.next = top, then CAS top from the old value to new_node. If CAS fails (another thread changed top), repeat.

Pop: Read top; if null, return empty. Read top.next. CAS top from top to top.next. If CAS fails, repeat. Return the old top node.

Each operation is O(1) amortized. Under low contention, CAS almost always succeeds on the first try. Under high contention, threads contend on the single top pointer — a scalability bottleneck. Elimination arrays can reduce contention by pairing push and pop operations that cancel each other out without touching the shared pointer.

ABA Problem

CAS checks whether a value equals the expected value, but it cannot detect that the value changed from A to B and back to A between the load and the CAS. Consider a lock-free pop on a stack where top = A → B → C:

Thread 1 reads top = A, reads A.next = B. Gets preempted.
Thread 2 pops A, pops B, then pushes A back. Stack is now A → C.
Thread 1 resumes, CAS succeeds (top is still A), sets top = B. But B is no longer in the list — memory corruption.

Solutions:

Tagged pointer: Pack a version counter into unused bits of the pointer (e.g., low 3 bits on 8-byte-aligned allocations, or high bits on x86-64). The sequence becomes A:1 → B:2 → A:3. CAS compares the full tagged value, so A:1 ≠ A:3 — the stale CAS fails correctly.
Hazard pointers: Threads publish which nodes they are currently reading; reclamation is deferred until no thread holds a hazard pointer to the node.
Epoch-based reclamation: Delay freeing memory until all threads have passed through a safe epoch, ensuring no thread holds a stale pointer.

Lock-Free Queue

The Michael-Scott queue (1996) is the standard lock-free FIFO queue. It uses separate head and tail pointers to allow concurrent enqueue and dequeue without interfering.

Enqueue: Find the last node (tail or tail.next if tail lags), CAS last.next from null to the new node, then CAS tail to the new node. This is a two-phase operation — if a thread is preempted between the two CASes, other threads "help" by advancing tail before their own enqueue.

Dequeue: Read head.next (the sentinel-based design keeps a dummy head node). CAS head from current head to head.next. Return the value from the old head.next.

The two-phase enqueue and the helping mechanism are what make this queue linearizable and lock-free despite not using any mutex.

Memory Ordering

Modern CPUs and compilers reorder memory operations for performance. Without explicit ordering constraints, lock-free code breaks silently on weakly-ordered architectures (ARM, POWER) and can even break on x86 due to compiler reordering.

C++11 memory order options for atomic operations:

memory_order_relaxed: No ordering guarantee beyond atomicity. Only safe for counters where ordering does not matter.
memory_order_acquire: No subsequent reads or writes in the current thread can be reordered before this load. Used when reading a flag that guards other data.
memory_order_release: No prior reads or writes can be reordered after this store. Used when publishing data by writing a flag.
memory_order_seq_cst: Total sequential consistency — all threads see all seq_cst operations in the same global order. Strongest guarantee, highest cost.

The acquire/release pairing forms the fundamental publish-subscribe pattern: writer does release store of a flag, reader does acquire load of the same flag — guaranteeing all writes before the release are visible after the acquire.

Hazard Pointers

Hazard pointers solve the safe memory reclamation problem: how do you free a node that a concurrent thread might still be reading?

Each thread maintains a small array of hazard pointers — globally visible pointers to nodes currently in use. Before dereferencing a shared pointer, a thread writes it to one of its hazard pointer slots. When a thread wants to free a retired node, it scans all hazard pointer slots across all threads. If any thread’s hazard pointer points to that node, freeing is deferred. Once no thread holds a hazard pointer to the node, it is safe to free.

The overhead is proportional to the number of threads times hazard pointer slots per thread — a scan before each free. Practical implementations batch retirements and perform the scan when the retired list exceeds a threshold (e.g., 2 * threads * hazard_slots_per_thread).

Epoch-Based Reclamation

Epoch-based reclamation (EBR) uses a global epoch counter and per-thread epoch tracking. Threads "enter" the current epoch at the start of an operation and "exit" when done. Retired objects are tagged with the epoch in which they were retired.

An object retired in epoch E can be safely freed only when all threads have observed epoch E+1 or later — guaranteeing no thread holds a pointer into epoch E’s retired set. The global epoch advances when all active threads are in the current epoch.

EBR has lower per-operation overhead than hazard pointers (no per-dereference writes to shared memory) but requires threads to make progress — a stalled thread blocks epoch advancement and prevents memory reclamation. Read-Copy-Update (RCU) in the Linux kernel is a variant of this idea, widely used for read-heavy data structures like routing tables and process credential caches.

Practical Considerations

Lock-free is not a synonym for fast. Key tradeoffs:

High contention: CAS retry loops spin and waste CPU proportional to the number of competing threads. A mutex with a wait queue lets the OS scheduler park contended threads, freeing CPU for useful work. Lock-free wins when contention is low or zero.
Non-blocking requirement: Lock-free is mandatory in contexts where blocking is illegal: interrupt handlers, real-time threads with strict deadlines, signal handlers.
False sharing: Lock-free structures often have hot fields (head, tail, top) accessed by all threads. If these fields share a cache line with other data, every CAS invalidates the cache line across all cores. Use explicit cache-line padding (alignas(64)) to place each hot field on its own cache line.
Complexity: Lock-free code is significantly harder to write, review, and debug than mutex-protected code. ABA bugs and missing memory barriers are subtle. Reach for a well-tested library (e.g., Java’s ConcurrentLinkedQueue, Intel TBB, Folly’s lock-free structures) before rolling your own.

What is the ABA problem in lock-free programming?

The ABA problem occurs when a CAS operation reads a value A, another thread changes it to B and back to A, and the original thread incorrectly assumes nothing changed. CAS succeeds even though the data structure state may have changed. Solutions: tagged pointers (attach a version counter to the pointer, incrementing on each change) or hazard pointers which defer reclamation until no thread holds a reference.

How does a lock-free stack work?

A lock-free stack uses a CAS on the top pointer. Push: create a new node pointing to the current top, then CAS(top, current, new). Pop: read the current top, then CAS(top, current, current->next). On CAS failure, retry. ABA problem arises on pop; tagged pointers or hazard pointers mitigate it. Lock-free stacks are used in work-stealing schedulers and memory allocators.

What is Epoch-Based Reclamation (EBR)?

EBR is a memory reclamation scheme for lock-free data structures. All threads participate in a global epoch counter. When a thread enters a critical section, it records the current epoch. A node can only be safely freed when no thread is in an epoch older than the node’s deletion epoch. This provides safe reclamation without per-object reference counting overhead.

When should you prefer lock-free over mutex-based data structures?

Lock-free structures are preferable when: threads must not block (real-time systems), performance at high concurrency matters (lock contention is a bottleneck), or the critical section is very small (CAS overhead amortizes the absence of mutex calls). However, lock-free structures are harder to implement correctly. For most use cases, well-tuned mutex-based structures (like Java ConcurrentHashMap) outperform naive lock-free alternatives.

How does Java ConcurrentHashMap achieve high concurrency?

Java 8+ ConcurrentHashMap uses per-bucket fine-grained locking and CAS for insertions. Empty bucket slots use CAS to insert the first node without locking. Slots with existing nodes use synchronized on the bucket head. Size counting uses distributed LongAdder cells to avoid contention on a single counter. This achieves near lock-free read performance and low-contention writes.