Lock-Free Queue Low-Level Design: CAS Operations, ABA Problem, and Memory Ordering

What Is a Lock-Free Queue?

A lock-free queue provides a progress guarantee: at least one thread makes progress even if others are preempted or delayed. Unlike mutex-based queues, no thread ever blocks waiting for a lock. This eliminates priority inversion, convoy effects, and deadlock at the cost of increased algorithmic complexity. Lock-free data structures rely on atomic compare-and-swap (CAS) operations provided by modern CPUs.

Michael-Scott Lock-Free Queue

The Michael-Scott algorithm is the canonical lock-free queue. It uses a singly-linked list with two sentinel-guarded pointers: head (for dequeue) and tail (for enqueue). Both pointers are updated via CAS.

Enqueue Operation

Enqueue is a two-step process for linearizability:

Allocate a new node with the value.
CAS tail.next from null to the new node. If this fails, another thread has already linked a node — help it advance tail first, then retry.
CAS tail from old_tail to the new node (swing the tail pointer forward).

The two-step design means tail may lag one step behind the true last node. Any thread can observe a lagging tail and help advance it — this is the helping mechanism that ensures lock-freedom.

Dequeue Operation

Read head.next (the first real node, since head is always a sentinel).
If head.next is null, the queue is empty.
CAS head from old_head to head.next. If successful, return head.next.value.

The ABA Problem

The ABA problem is a subtle correctness hazard in CAS-based algorithms. Thread T1 reads a pointer value A and is then preempted. While T1 sleeps:

T2 pops node A from the queue.
T2 pushes a new node B.
T2 pops node B and then pushes node A again (perhaps because A was recycled from a memory pool).

When T1 resumes, its CAS on A succeeds because the pointer matches — but the state of the data structure has fundamentally changed. The queue may be corrupted.

Mitigation: Tagged Pointers

The standard solution is to combine the pointer and a monotonically incrementing version counter into a single 64-bit atomic word (on 64-bit systems, the low-order bits of an aligned pointer are always zero and can store a tag). Each successful CAS increments the tag, so A and A' differ even if the raw address is the same.

On x86-64, the CMPXCHG16B instruction enables 128-bit CAS — enough for a full 64-bit pointer plus a 64-bit counter.

Memory Ordering

CPUs and compilers reorder memory operations for performance. Lock-free algorithms must specify the correct memory ordering for every atomic operation:

Acquire on a load: the thread sees all writes that were released before this load. Used when reading a shared pointer to follow it safely.
Release on a store: all preceding writes are visible to any thread that subsequently acquires this location. Used when publishing a new node.
Sequentially consistent: total order across all SC operations; highest overhead; the default in many high-level libraries.

In the Michael-Scott queue, the CAS on tail.next uses acquire-release. Relaxed ordering on non-synchronizing reads can reduce cache coherence traffic.

Safe Memory Reclamation

In a garbage-collected language, freed nodes are never reused while a thread holds a reference. In C/C++, reclamation is explicit and risky because another thread may be traversing a node at the moment it is freed.

Hazard Pointers

Each thread maintains a small array of hazard pointers. Before following a pointer to a node, a thread publishes the pointer in its hazard array. The reclaiming thread scans all hazard pointers; if a node appears in any, it is deferred. Nodes not in any hazard array can be freed immediately.

Epoch-Based Reclamation

A global epoch counter is advanced periodically. Threads declare entry and exit into critical sections by recording the current epoch. A node is safe to free only when all threads have passed through at least one full epoch since the node was unlinked. This amortizes reclamation overhead but requires bounded epoch-crossing delay.

SQL Schema

CREATE TABLE LockFreeQueueStat (
    id                BIGSERIAL PRIMARY KEY,
    queue_name        VARCHAR(128)  NOT NULL,
    enqueue_count     BIGINT        NOT NULL DEFAULT 0,
    dequeue_count     BIGINT        NOT NULL DEFAULT 0,
    cas_failure_count BIGINT        NOT NULL DEFAULT 0,
    sampled_at        TIMESTAMPTZ   NOT NULL DEFAULT now()
);

CREATE INDEX idx_lfqs_queue_sampled
    ON LockFreeQueueStat (queue_name, sampled_at DESC);

Python Simulation

import threading

class Node:
    def __init__(self, value):
        self.value = value
        self.next  = None

class LockFreeQueue:
    """
    Michael-Scott lock-free queue simulation.
    Python's GIL makes this thread-safe without true atomics,
    but the algorithm mirrors the CAS logic exactly.
    """

    def __init__(self):
        sentinel   = Node(None)
        self._head = sentinel
        self._tail = sentinel

    def _cas(self, obj, attr, expected, new_value):
        """Simulate CAS: returns True if swap succeeded."""
        if getattr(obj, attr) is expected:
            setattr(obj, attr, new_value)
            return True
        return False

    def enqueue(self, value):
        node = Node(value)
        while True:
            tail      = self._tail
            tail_next = tail.next
            if tail is self._tail:
                if tail_next is None:
                    if self._cas(tail, 'next', None, node):
                        self._cas(self, '_tail', tail, node)
                        return
                else:
                    self._cas(self, '_tail', tail, tail_next)

    def dequeue(self):
        while True:
            head      = self._head
            tail      = self._tail
            head_next = head.next
            if head is self._head:
                if head is tail:
                    if head_next is None:
                        return None
                    self._cas(self, '_tail', tail, head_next)
                else:
                    value = head_next.value
                    if self._cas(self, '_head', head, head_next):
                        return value


def benchmark(num_producers=4, num_consumers=4, items=1000):
    q       = LockFreeQueue()
    results = []
    lock    = threading.Lock()

    def producer():
        for i in range(items):
            q.enqueue(i)

    def consumer():
        seen = 0
        while seen < items:
            v = q.dequeue()
            if v is not None:
                with lock:
                    results.append(v)
                seen += 1

    threads = (
        [threading.Thread(target=producer) for _ in range(num_producers)] +
        [threading.Thread(target=consumer) for _ in range(num_consumers)]
    )
    for t in threads: t.start()
    for t in threads: t.join()
    return len(results)

Performance Characteristics

Throughput under low contention: comparable to mutex queue; CAS succeeds on the first attempt.
Throughput under high contention: CAS failures increase; threads spin-retry; cache line bouncing degrades performance. Consider batching or segmented queues (one queue per producer).
Latency tail: no mutex means no unbounded waits; p99 latency is tighter than mutex queues under contention spikes.
Memory overhead: hazard pointers add O(threads x hazard_count) memory; epoch reclamation adds per-thread epoch tracking.

Lock-Free vs Wait-Free

Lock-free guarantees system-wide progress — at least one thread advances. Wait-free is strictly stronger: every thread makes progress within a bounded number of steps regardless of contention. Wait-free queues exist but are significantly more complex. For most production systems, lock-free is sufficient and more practical.

Frequently Asked Questions

What is the difference between lock-free and wait-free?

Lock-free guarantees that at least one thread makes progress at any time. Wait-free guarantees that every thread individually completes its operation within a bounded number of steps, regardless of what other threads do. Wait-free is stronger but harder to implement and often slower in practice due to the helping overhead required to bound individual thread progress.

What exactly is the ABA problem in CAS?

The ABA problem occurs when a CAS reads value A, another thread changes the location to B and back to A, and the original CAS then incorrectly succeeds because it observes A without detecting the intermediate change. Tagged pointers solve this by embedding a version counter in the atomic word so that A' (A after a round-trip) differs from the original A even at the same address.

How do hazard pointers prevent premature reclamation?

Before a thread follows a pointer to a node, it publishes that pointer in a per-thread hazard slot. Any thread wishing to free a node must first scan all hazard slots. If the node's address appears in any slot, freeing is deferred until no thread is protecting that pointer. This ensures nodes are never freed while another thread is actively reading them.

Why does memory ordering matter for lock-free queues?

Without explicit ordering constraints, the CPU or compiler may reorder the store of a node's value after the store of the pointer pointing to it. A consumer thread could follow the pointer and read uninitialized data. Acquire ordering on loads and release ordering on stores establishes a happens-before relationship that prevents this class of bug across cores and CPU sockets.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between lock-free and wait-free?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Lock-free guarantees that at least one thread makes progress at any time. Wait-free guarantees that every individual thread completes within a bounded number of steps regardless of contention. Wait-free is stronger but harder to implement and often slower due to the helping overhead required.”
}
},
{
“@type”: “Question”,
“name”: “What exactly is the ABA problem in CAS?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “ABA occurs when a CAS reads value A, another thread changes the location to B and back to A, and the original CAS incorrectly succeeds. Tagged pointers solve this by embedding a version counter so A-after-roundtrip differs from the original A even at the same address.”
}
},
{
“@type”: “Question”,
“name”: “How do hazard pointers prevent premature reclamation?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Before following a pointer, a thread publishes it in a per-thread hazard slot. Any thread wishing to free a node must scan all hazard slots first. If the node appears in any slot, freeing is deferred until no thread is protecting it.”
}
},
{
“@type”: “Question”,
“name”: “Why does memory ordering matter for lock-free queues?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Without ordering constraints, the CPU or compiler may reorder a node value store after the pointer store. A consumer could follow the pointer and read uninitialized data. Acquire on loads and release on stores establishes happens-before to prevent this across cores.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does a lock-free queue use CAS to enqueue without locks?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A lock-free queue uses compare-and-swap to atomically update the tail pointer only if it still holds the expected value, retrying in a loop if another thread raced ahead. This ensures linearizability without mutexes, so threads never block each other's progress.”
}
},
{
“@type”: “Question”,
“name”: “What is the ABA problem and how is it prevented?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The ABA problem occurs when a pointer is changed from A to B and back to A between a load and a CAS, making the CAS succeed incorrectly. It is prevented by tagging each pointer with a version counter so that A-then-B-then-A yields a different tagged value that CAS will reject.”
}
},
{
“@type”: “Question”,
“name”: “How does memory ordering affect lock-free algorithm correctness?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “CPUs and compilers may reorder memory operations, so lock-free code must use appropriate memory fences or atomic operations with acquire/release semantics to ensure that all threads observe writes in the intended order. Without correct ordering, a dequeuer may read a node's data before the enqueuer's write to that node becomes visible.”
}
},
{
“@type”: “Question”,
“name”: “What are the practical trade-offs of lock-free vs mutex-based queues?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Lock-free queues avoid priority inversion and deadlocks and perform better under high contention with many threads, but they are harder to implement correctly and can waste CPU in spinning retry loops under heavy write contention. Mutex-based queues are simpler, predictable, and often faster at low concurrency because they avoid repeated CAS failures.”
}
}
]
}