Low Level Design: Concurrency Patterns

⏱ 9 min read

Concurrency is one of the hardest problems in systems design. Bugs are non-deterministic, hard to reproduce, and can corrupt data silently. Understanding the canonical concurrency patterns — and when each one applies — is essential for any system design interview and for building robust production software.

Producer-Consumer Pattern

The producer-consumer pattern decouples the rate at which work is generated from the rate at which it is processed. Producers push tasks onto a shared queue; consumer threads pull from that queue and do the actual work.

The critical implementation detail is using a bounded blocking queue. In Java, ArrayBlockingQueue is the standard choice: it has a fixed capacity, and put() blocks when full, take() blocks when empty. This gives you back-pressure for free — if consumers fall behind, producers automatically slow down rather than growing memory unboundedly until the JVM crashes.

Back-pressure propagation matters in chained systems. If you have multiple pipeline stages, each stage should be backed by a bounded queue so slowness at any stage propagates upstream rather than causing unbounded buffering. Unbounded queues (like LinkedBlockingQueue with no capacity argument) will accept work indefinitely and cause out-of-memory errors under sustained overload.

Reader-Writer Lock

The reader-writer problem arises when a shared resource has reads that are safe to run concurrently but writes that require exclusive access. A naive mutex serializes all access — readers block each other unnecessarily. A ReadWriteLock allows multiple concurrent readers OR one exclusive writer.

In Java, ReentrantReadWriteLock implements this. Readers acquire the read lock (shared); writers acquire the write lock (exclusive). A write blocks until all current readers release, and new readers block while a writer is waiting.

Write starvation is the core risk: if readers arrive continuously, a writer may never get the lock. The ReentrantReadWriteLock(true) fair variant queues requests in arrival order, preventing starvation at the cost of some throughput. For read-heavy workloads with infrequent writes, this pattern yields significant throughput gains over a plain mutex.

Thread Pool Pattern

Creating a new OS thread for each task is expensive: stack allocation (~1MB default on many JVMs), kernel context switches, and scheduling overhead. The thread pool pattern solves this by pre-creating a fixed number of threads that wait for work, execute it, then wait again — eliminating per-task thread creation cost.

Java’s Executors factory offers three key variants: newFixedThreadPool(n) — constant number of threads, backed by an unbounded queue (careful: queue can grow without bound); newCachedThreadPool() — grows as needed, recycles idle threads after 60s, unbounded thread count (risky under load spikes); newScheduledThreadPool(n) — supports delayed and periodic tasks.

For production use, prefer constructing a ThreadPoolExecutor directly with explicit core/max thread counts, a bounded queue, and a rejection policy. CallerRunsPolicy is the most robust choice: when the queue is full and no threads are available, the calling thread executes the task itself, automatically slowing down the submitter and propagating back-pressure up the call stack.

Fork-Join Framework

Fork-join is designed for recursive divide-and-conquer parallelism. A large task is split into smaller sub-tasks (fork), each runs in parallel, and results are combined (join). Classic use cases: parallel merge sort, parallel tree traversal, parallel prefix sums.

The key insight of Java’s ForkJoinPool (and Rust’s rayon) is work stealing: each worker thread has its own double-ended deque (deque) of tasks. When a thread runs out of work, it steals tasks from the tail of another thread’s deque. This keeps all threads busy without a central queue bottleneck, and the double-ended structure means the owner pushes/pops from one end (LIFO — better cache locality) while stealers take from the other end (FIFO — larger tasks stolen first, reducing steal frequency).

Fork-join performs poorly when sub-tasks have highly variable runtime (stragglers hold up joins) or when tasks have external I/O waits (threads block, stealing breaks down). It’s best for CPU-bound recursive computation.

Actor Model

The actor model eliminates shared mutable state entirely. Each actor is an independent unit of computation with its own private state and a mailbox queue. Actors communicate exclusively by sending immutable messages; no actor can directly access another’s state.

An actor processes one message at a time from its mailbox — sequential within each actor, concurrent across actors. This means there are no race conditions by design: there is no shared memory to race on. The runtime (Akka, Erlang/OTP) schedules actors across thread pools transparently.

Akka in Java/Scala and Erlang are the canonical implementations. Erlang’s actor model (called processes) is the basis for telecom systems with nine-nines availability — lightweight actors, supervision trees, and let-it-crash fault recovery. The main challenge with actors is reasoning about message ordering and handling mailbox overflow when producers outpace consumers.

Async/Await and the Event Loop

Traditional thread-per-request servers hit limits around tens of thousands of concurrent connections — OS thread overhead becomes prohibitive. The event loop model handles massive concurrency with a single thread (or small thread pool) by never blocking.

The event loop runs continuously, picking up I/O completion events. When a request needs I/O (network, disk), it registers a callback and yields control immediately. When the I/O completes, the callback is queued for execution. No thread sits blocked waiting — the same thread handles thousands of in-flight requests. Node.js, Python’s asyncio, and Rust’s Tokio all use this model.

async/await is syntactic sugar that makes event-loop code look sequential. The compiler transforms it into a state machine. The critical rule: never block the event loop thread — a synchronous sleep or blocking I/O call stalls all other requests. CPU-intensive work must be offloaded to a thread pool.

Green threads (goroutines in Go, fibers in some runtimes) are a middle path: cooperatively scheduled lightweight threads managed by the runtime rather than the OS. Millions can exist concurrently with low overhead. The runtime multiplexes them onto OS threads, yielding at I/O points automatically.

Compare-and-Swap (CAS) for Lock-Free Programming

Mutexes are expensive: acquiring a lock involves kernel calls, context switches, and cache line invalidations. For simple counters and flags, compare-and-swap (CAS) provides lock-free atomicity using a single CPU instruction.

CAS semantics: "if the current value equals expected, atomically set it to new and return true; otherwise return false." A CAS loop retries until it succeeds: read current value, compute new value, CAS — if another thread modified it between read and CAS, retry. Java’s AtomicInteger, AtomicLong, AtomicReference expose this. Rust’s std::sync::atomic provides the same with explicit memory ordering controls.

The ABA problem: a value changes from A to B and back to A between your read and CAS — your CAS succeeds despite a logical change. Solution: use a stamped/versioned reference (AtomicStampedReference in Java) that increments a version counter on every write, making A-with-version-1 distinct from A-with-version-3.

Thread-Local Storage

Sometimes you want per-thread state without any synchronization at all. Thread-local storage gives each thread its own independent copy of a variable — reads and writes never contend with other threads.

Java’s ThreadLocal<T> is the standard mechanism. Common uses: database connection per thread (connection pools often do this internally), request context propagation (store current user/request ID for logging without passing through every method call), non-thread-safe formatter instances (e.g., SimpleDateFormat).

The major pitfall: thread pool thread reuse. If a thread-local is set during request handling and not cleared, the next request on the same thread sees the previous request’s value. Always clean up in a finally block, or use InheritableThreadLocal with care.

Structured Concurrency

Unstructured concurrency — spawning tasks with no guaranteed lifecycle — is a major source of bugs: leaked tasks, tasks that outlive their parent, tasks that fail silently. Structured concurrency imposes a discipline: all tasks spawned within a scope must complete before the scope exits.

Java 21 introduced StructuredTaskScope: open a scope, fork tasks within it, call scope.join() — the join blocks until all forked tasks complete (or the scope is cancelled). If any task fails, the scope can cancel remaining tasks. No task outlives the scope — lifetime is lexically bounded, like structured programming vs. goto.

Swift’s structured concurrency (async let, task groups) follows the same model. This eliminates entire classes of concurrency bugs: no orphaned background tasks, no need to track task handles manually, cancellation propagates automatically through the scope tree.

Choosing the Right Pattern

Quick decision guide: for work queues with variable load, use producer-consumer with a bounded blocking queue and CallerRunsPolicy. For read-heavy shared state, use ReadWriteLock. For I/O-bound high-concurrency services, use async/await or green threads. For recursive CPU-bound parallelism, use fork-join. For distributed systems where shared state is impossible, use actors. For simple counters and flags without lock overhead, use CAS atomics. For request-scoped state in thread pools, use ThreadLocal with diligent cleanup. For any new concurrent code, prefer structured concurrency to prevent task leaks.

Frequently Asked Questions

What is the producer-consumer pattern and how does back-pressure work?

The producer-consumer pattern decouples producers (which generate work) from consumers (which process it) using a shared queue. A bounded queue provides back-pressure: when the queue is full, producers block or reject new items, preventing memory exhaustion. Java’s ArrayBlockingQueue implements a bounded blocking queue. When consumers are slower than producers, the queue fills and producers naturally slow down. Without back-pressure (unbounded queue), producers can generate work faster than memory allows, causing OOM. Reactive Streams formalizes back-pressure in async pipelines.

What is the difference between parallelism and concurrency?

Concurrency is about dealing with multiple tasks at once — structuring a program so multiple things can be in progress simultaneously (e.g., one thread handles I/O while another processes data). Parallelism is about doing multiple things at the same time — executing multiple computations simultaneously using multiple CPU cores. A single-core machine can be concurrent (via context switching) but not parallel. Node.js is concurrent but not parallel (single event loop thread). Java with multiple threads is both concurrent and parallel on a multi-core machine.

What is the actor model and how does it avoid shared state?

In the actor model, all computation happens inside actors — isolated units with their own private state. Actors communicate exclusively by sending and receiving immutable messages; there is no shared memory between actors. Each actor has a mailbox (message queue) and processes one message at a time. This eliminates data races by construction — no locks needed. Akka (JVM) and Erlang are the primary actor model implementations. The model scales naturally to distributed systems because message passing works the same way across network boundaries.