Q: When should you use a distributed lock versus other coordination primitives?

Distributed locks are often overused. Alternatives: (1) Database unique constraints -- if you need to prevent duplicate processing of an order, a unique constraint on (order_id, status) is simpler and more reliable than a distributed lock. The database handles concurrency natively. (2) Idempotent operations -- design operations to be safely retryable. If processing an event twice produces the same result, you do not need a lock to prevent duplicate processing. Store a processed_events table and check before processing. (3) Optimistic concurrency control -- use version numbers or ETags. Read the resource with its version, modify it, write it back with a conditional update (WHERE version = expected_version). If another process modified it, the write fails and you retry. No lock required. (4) Queue-based serialization -- instead of locking a resource, send operations to a queue partitioned by resource ID. A single consumer processes operations for each resource sequentially. Kafka consumer groups with key-based partitioning achieve this. Use distributed locks only when: you need mutual exclusion across processes, the operation is not idempotent, and the coordination cannot be pushed to the database or a queue.

Question 1

Why is the Redlock algorithm controversial and what are its failure modes?

Accepted Answer

The Redlock controversy stems from Martin Kleppmann 2016 analysis identifying fundamental safety issues. Failure mode 1: Process pause -- client A acquires the lock, then experiences a long GC pause (or page fault, or scheduling delay). The lock TTL expires during the pause. Client B acquires the lock. Client A resumes and believes it still holds the lock. Both clients operate on the shared resource simultaneously. Failure mode 2: Clock jump -- Redlock relies on time measurement to determine if the lock was acquired within the TTL. If a server clock jumps forward (NTP correction, VM migration), the algorithm time calculations become incorrect. A lock that should be valid may be considered expired, or vice versa. Failure mode 3: Network delay -- the time between acquiring a lock and using it includes network latency. If the network is slow, the lock may expire before the holder can use it. Kleppmann conclusion: for correctness-critical applications, use a consensus-based system (ZooKeeper, etcd) with fencing tokens. Redlock is acceptable only for efficiency-optimized locking where occasional double-locking is tolerable.

Question 2

How do ZooKeeper ephemeral nodes prevent deadlocks in distributed locking?

Accepted Answer

Deadlocks in distributed locking occur when a lock holder crashes without releasing the lock. With TTL-based locks (Redis), the TTL eventually expires, but choosing the right TTL is difficult: too short and the lock expires while the holder is still working; too long and the system waits unnecessarily after a crash. ZooKeeper ephemeral nodes solve this elegantly. An ephemeral node exists only as long as the client session is active. The client maintains the session with periodic heartbeats (session timeout is typically 10-30 seconds). If the client crashes, the heartbeats stop, the session expires after the timeout, and ZooKeeper automatically deletes the ephemeral node (the lock). The next waiter in the sequential queue is notified via its watch and acquires the lock. No TTL tuning required -- the session timeout is the only knob, and it is well-defined: if the client cannot send heartbeats for 30 seconds, it is either crashed or network-partitioned, and releasing the lock is the correct behavior in both cases. The client can also explicitly delete the ephemeral node to release the lock immediately when done.

Question 3

How do fencing tokens prevent data corruption from expired distributed locks?

Accepted Answer

Without fencing tokens, an expired lock allows two clients to write simultaneously. Scenario: client A acquires lock (token 33), starts a write operation, gets paused by GC. The lock expires. Client B acquires lock (token 34), performs its write. Client A resumes and performs its write, overwriting client B changes -- data corruption. With fencing tokens: every write operation includes the token. The storage system tracks the highest token seen. Client B writes with token 34 -- storage records highest_token = 34. Client A resumes and writes with token 33 -- storage rejects the write because 33 < 34. Data integrity is preserved. Implementation requirement: the storage system must participate in the fencing protocol. For databases: add a fencing_token column to the table and include a WHERE fencing_token < new_token condition in updates. For message queues: include the token in the message and have consumers reject messages with stale tokens. Not all storage systems support fencing natively -- this is a limitation. ZooKeeper sequential nodes provide natural fencing tokens (the sequence number). Redis-based locks require a separate fencing token mechanism.

Question 4

When should you use a distributed lock versus other coordination primitives?

Accepted Answer

Distributed locks are often overused. Alternatives: (1) Database unique constraints -- if you need to prevent duplicate processing of an order, a unique constraint on (order_id, status) is simpler and more reliable than a distributed lock. The database handles concurrency natively. (2) Idempotent operations -- design operations to be safely retryable. If processing an event twice produces the same result, you do not need a lock to prevent duplicate processing. Store a processed_events table and check before processing. (3) Optimistic concurrency control -- use version numbers or ETags. Read the resource with its version, modify it, write it back with a conditional update (WHERE version = expected_version). If another process modified it, the write fails and you retry. No lock required. (4) Queue-based serialization -- instead of locking a resource, send operations to a queue partitioned by resource ID. A single consumer processes operations for each resource sequentially. Kafka consumer groups with key-based partitioning achieve this. Use distributed locks only when: you need mutual exclusion across processes, the operation is not idempotent, and the coordination cannot be pushed to the database or a queue.

System Design: Distributed Lock Service — ZooKeeper, etcd, Redis Redlock, Fencing Tokens, Consensus, Leader Election

Why Distributed Locks Are Needed

Redis-Based Distributed Locks

Redlock: Multi-Node Redis Locking

Fencing Tokens for Correctness

ZooKeeper-Based Distributed Locks

etcd Locking with Leases

Leader Election Pattern

Choosing the Right Distributed Lock