Question 1

Why do fencing tokens solve the GC pause problem that Redlock cannot?

Accepted Answer

Redlock acquires a lock across N Redis nodes using majority quorum, but the lock is time-bounded by TTL. If a JVM garbage collection pause (or any process pause) causes the lock holder to stall past the TTL expiry, another client acquires the lock, and both clients now believe they hold the lock simultaneously. Redlock has no mechanism to detect this split-brain scenario. Fencing tokens solve it by having the lock server issue a monotonically increasing integer (the fence) with each successful acquisition. The protected resource (e.g., a storage service) rejects any write that arrives with a token lower than the highest it has already seen. Even if a paused client wakes up and tries to write after its lock expired, the resource will reject the stale-fenced request. This makes mutual exclusion a property of the resource, not just the lock service.

Question 2

How do ZooKeeper ephemeral nodes differ from Redis TTL-based keys for lock expiry?

Accepted Answer

Redis TTL-based locks expire after a wall-clock duration regardless of client health. If a client dies, the lock is held until TTL elapses, causing unavailability. Conversely, if the client is slow but alive, the TTL may expire while the client is still working, silently releasing the lock. ZooKeeper ephemeral nodes are tied to the client's session: when the client's TCP session to ZooKeeper drops (due to crash, network partition, or prolonged GC exceeding the session timeout), ZooKeeper automatically deletes the ephemeral node, releasing the lock. This provides session-based liveness semantics rather than time-based. The tradeoff is that ZooKeeper's session timeout is typically 2-10 seconds, so recovery after a crash takes longer than a short Redis TTL. ZooKeeper also provides sequential ephemeral nodes that implement fair queued locking natively.

Question 3

How should a service safely release a distributed lock it may have already lost?

Accepted Answer

The canonical pattern for safe release is to use a compare-and-delete operation guarded by a unique token. At acquisition time, the client stores a random UUID (or fencing token) as the lock value. At release time, the client executes an atomic Lua script in Redis: if GET lock_key equals my_token, then DEL lock_key, else do nothing. This prevents releasing a lock owned by a different client if the original holder's TTL expired. In ZooKeeper, the client stores the path of the ephemeral node it created and calls delete on that specific path only; if the session already expired and ZooKeeper auto-deleted the node, the delete call fails harmlessly with NoNodeException. In both cases the application must be designed to tolerate the possibility that critical work executed while the lock was unknowingly lost, and use idempotency or fencing tokens at the resource level as a second line of defense.

Question 4

How should you choose an appropriate TTL for a distributed lock?

Accepted Answer

TTL must be longer than the maximum expected time to complete the protected critical section, including worst-case GC pauses, slow disk I/O, and network retries. A common heuristic is to set TTL = (P99 critical section duration) x 3, then add an explicit heartbeat or lock-renewal mechanism so that healthy, long-running holders extend the TTL before it elapses. The renewal should happen at TTL/3 intervals, giving two renewal windows before expiry. Avoid very short TTLs ( 5 min) because they cause long unavailability after a crash. If the critical section duration is genuinely unbounded, consider a lease-based model with explicit renewal rather than a static TTL.

Question 5

What is the difference between fair and unfair distributed lock implementations and when does fairness matter?

Accepted Answer

An unfair lock grants acquisition to whichever waiter successfully executes the compare-and-set first after the current holder releases, which is non-deterministic under contention. This can cause starvation: a slow node that repeatedly loses the race may wait indefinitely. A fair lock maintains a FIFO queue of waiters and grants the lock to the longest-waiting requester. ZooKeeper implements fair locking naturally using sequential ephemeral nodes: each client creates a node like /lock/req-0000000042, watches the node with the next-lower sequence number, and acquires the lock when its watched node is deleted. This guarantees strict FIFO ordering. Redis-based fair locks can be implemented using a sorted set with request timestamps as scores. Fairness matters in systems where starvation would cause SLA violations or where lock holders perform write operations that must be serialized in submission order (e.g., a leader-election system or a job scheduler).

Low Level Design: Distributed Lock Manager

Why Distributed Locks Are Hard

Fencing Tokens

Redis Single-Instance Lock

Redlock Algorithm

ZooKeeper Ephemeral Nodes

Lock TTL and Renewal

Deadlock Prevention

Fair Lock Semantics