Q: When should you use ZooKeeper locks instead of Redis?

Use ZooKeeper (or etcd) when: strong consistency is required (critical financial operations, leader election in distributed databases), correctness matters more than throughput, and the lock holder may pause for extended periods. ZooKeeper is a CP system — it remains consistent under network partitions, sacrificing availability. Redis-based locks are not safe under process pauses beyond TTL (two processes can simultaneously believe they hold the lock). ZooKeeper ephemeral nodes auto-delete on client disconnect (session timeout), with automatic failover to the next waiter via watches. Latency: ZooKeeper 1-10ms vs Redis 0.1-1ms. For deduplication or low-stakes mutual exclusion, Redis is fine. For leader election where incorrectness causes data corruption, use ZooKeeper or etcd.

Q: What happens if the lock holder crashes while holding a distributed lock?

The TTL (time-to-live) is the safety valve. If the holder crashes, the key auto-expires after TTL milliseconds and other processes can acquire the lock. TTL should be longer than the expected critical section duration but short enough to recover quickly from crashes. For operations with unknown duration: implement lock renewal (heartbeat). While the holder is alive and processing, periodically extend the TTL: if the stored token matches, extend. If the holder crashes, heartbeat stops, TTL eventually expires. Lock renewal with ZooKeeper: the client keeps its session alive with heartbeats; ephemeral node persists as long as the session is alive. Session timeout = how long ZooKeeper waits before declaring the client dead and deleting its ephemeral nodes.

Question 1

How does a Redis-based distributed lock work?

Accepted Answer

The core command: SET lock_key unique_token NX PX ttl_milliseconds. NX (Not eXists) makes the SET conditional — it only succeeds if the key does not already exist. PX sets an expiry in milliseconds so the lock auto-releases if the holder crashes. On success: the key is set and the caller holds the lock. On failure (nil returned): another process holds the lock. Release: use an atomic Lua script to check that the stored value matches the caller's token, then delete. Never use GET + DEL for release — another process could acquire the lock between the GET and DEL. The unique token (UUID) per acquisition ensures you only release your own lock, not a lock re-acquired by another process after your TTL expired.

Question 2

Why must distributed lock release be atomic?

Accepted Answer

Without atomic release, this race condition is possible: (1) Server A holds lock, stores token=A1. (2) A pauses for longer than TTL. Lock expires. (3) Server B acquires lock, stores token=B1. (4) A resumes, does GET lock_key — sees B1. But A checks "is B1 == A1?" — no, so A correctly skips. Wait — what if A stored a non-unique token? Then: (4) A checks if key==my_token, sees match (same non-unique value), deletes — releasing B's lock. The unique token prevents this. The Lua script check-and-delete must be atomic (single round trip) to prevent the check-then-delete TOCTOU race where B acquires between A's GET and DEL.

Question 3

What is a fencing token and when do you need it?

Accepted Answer

A fencing token is a monotonically increasing integer returned by each lock acquisition. The lock service increments a counter on each successful acquisition: first caller gets token 1, second gets token 2, etc. When lock holder A sends a request to a downstream system (database, file storage), it includes its fencing token. The downstream system rejects any request with a token lower than the highest token it has seen. This prevents: A holds lock (token=5), pauses past TTL, B acquires lock (token=6), B writes to DB with token 6, A resumes and tries to write with token 5 — DB rejects A's write (token 5 < 6). Requires the downstream resource to implement token checking. Used in Google Chubby, HBase RegionServer assignment.

Question 4

When should you use ZooKeeper locks instead of Redis?

Accepted Answer

Use ZooKeeper (or etcd) when: strong consistency is required (critical financial operations, leader election in distributed databases), correctness matters more than throughput, and the lock holder may pause for extended periods. ZooKeeper is a CP system — it remains consistent under network partitions, sacrificing availability. Redis-based locks are not safe under process pauses beyond TTL (two processes can simultaneously believe they hold the lock). ZooKeeper ephemeral nodes auto-delete on client disconnect (session timeout), with automatic failover to the next waiter via watches. Latency: ZooKeeper 1-10ms vs Redis 0.1-1ms. For deduplication or low-stakes mutual exclusion, Redis is fine. For leader election where incorrectness causes data corruption, use ZooKeeper or etcd.

Question 5

What happens if the lock holder crashes while holding a distributed lock?

Accepted Answer

The TTL (time-to-live) is the safety valve. If the holder crashes, the key auto-expires after TTL milliseconds and other processes can acquire the lock. TTL should be longer than the expected critical section duration but short enough to recover quickly from crashes. For operations with unknown duration: implement lock renewal (heartbeat). While the holder is alive and processing, periodically extend the TTL: if the stored token matches, extend. If the holder crashes, heartbeat stops, TTL eventually expires. Lock renewal with ZooKeeper: the client keeps its session alive with heartbeats; ephemeral node persists as long as the session is alive. Session timeout = how long ZooKeeper waits before declaring the client dead and deleting its ephemeral nodes.

Distributed Lock System Low-Level Design

Why Distributed Locks?

Redis-Based Lock (Single Node)

Safe Lock Release with Lua Script

The TTL Problem and Fencing Tokens

Redlock (Multi-Node Redis)

ZooKeeper and etcd Locks

Comparison