System Design Interview: Distributed Lock Manager (Redis / Zookeeper)

⏱ 8 min read

Why Distributed Locks Are Needed

In a distributed system, multiple instances of a service may run concurrently. When these instances need to coordinate — ensuring only one processes a cron job, one holds a resource exclusively, or one is elected leader — a distributed lock provides mutual exclusion across processes on different machines. Unlike a single-machine mutex (which only works within one process), a distributed lock is stored in a shared external system (Redis, Zookeeper, or a database) that all instances can access.

Redis SET NX: Basic Distributed Lock

The simplest distributed lock uses Redis SET NX EX (set if not exists, with expiry). The command is atomic: it sets the key with a TTL only if the key does not already exist. The lock holder stores a unique token (UUID) as the value — only the holder that set the key can release it.


import redis, uuid

r = redis.Redis()

def acquire_lock(lock_name, ttl_seconds=30):
    token = str(uuid.uuid4())
    # SET lock_name token NX EX ttl_seconds
    acquired = r.set(lock_name, token, nx=True, ex=ttl_seconds)
    return token if acquired else None

def release_lock(lock_name, token):
    # Atomic check-and-delete via Lua script:
    # Get the value; if it matches our token, delete — otherwise do nothing.
    # This prevents releasing a lock re-acquired by another instance after our TTL expired.
    script = r.register_script("""
        local v = redis.call('get', KEYS[1])
        if v == ARGV[1] then
            return redis.call('del', KEYS[1])
        end
        return 0
    """)
    script(keys=[lock_name], args=[token])

# Usage
token = acquire_lock("job:daily-report", ttl_seconds=60)
if token:
    try:
        run_daily_report()
    finally:
        release_lock("job:daily-report", token)

Critical details: (1) The TTL prevents lock starvation if the holder crashes before releasing. (2) The unique token prevents releasing a lock that expired and was re-acquired by another instance. (3) The Lua script for release is atomic — check and delete cannot be split by concurrent operations.

The Redlock Algorithm

A single Redis instance is a single point of failure — if it crashes, the lock is lost. Martin Kleppmann and the Redis team developed Redlock for higher availability: acquire the lock on N independent Redis nodes (typically 5) with the same key and TTL. The lock is considered acquired if the client successfully set on the majority (N/2 + 1 = 3 out of 5) within a fraction of the TTL. Release by deleting the key on all N nodes. A single node failure does not prevent acquisition — as long as 3 nodes are available. Redlock is controversial: critics argue it is unsafe under GC pauses or network partitions (the lock holder may believe it holds the lock while the TTL expired on the Redis nodes). Use fencing tokens (see below) to handle these cases.

Fencing Tokens

A process can hold a lock, experience a long GC pause (or network delay), have the lock expire and be re-acquired by another process, then resume — believing it still holds the lock. This causes split-brain. Fencing tokens solve this: each lock acquisition returns a monotonically increasing token (stored in the lock service). The lock holder includes this token in every request to the protected resource. The protected resource rejects requests with tokens lower than the highest it has seen — even if the stale lock holder sends a request after the lock was re-acquired, the resource rejects it because the newer token is higher.


# Lock service: atomic increment on acquire
def acquire_lock_with_token(lock_name, ttl):
    token = r.incr("lock:token:" + lock_name)  # monotonically increasing
    acquired = r.set(lock_name, token, nx=True, ex=ttl)
    return token if acquired else None

# Protected resource: reject stale tokens
class ProtectedResource:
    def __init__(self):
        self.max_token_seen = 0

    def write(self, data, fencing_token):
        if fencing_token <= self.max_token_seen:
            raise StaleTokenError("Rejecting stale write")
        self.max_token_seen = fencing_token
        self._do_write(data)

Zookeeper for Leader Election

Zookeeper provides stronger consistency guarantees than Redis for distributed coordination. Leader election using Zookeeper ephemeral nodes: each candidate creates an ephemeral sequential node under /election/. The candidate with the lowest sequence number is the leader. If the leader dies, its ephemeral node is deleted automatically (Zookeeper detects session loss), and the next candidate (next lowest sequence) watches for this deletion and becomes leader. This pattern is crash-safe — no TTL to expire — because ephemeral nodes are deleted automatically when the Zookeeper session ends. Kafka and HDFS use Zookeeper for controller election. Etcd (used by Kubernetes) provides similar guarantees with a simpler API.

Database-Based Locks

For low-throughput lock needs, a database table provides a simple distributed lock: INSERT INTO distributed_locks (lock_name, owner, acquired_at) with a unique constraint on lock_name. Success means the lock is held; unique violation means it is already held. A background job cleans up stale locks older than a timeout. Pros: no additional infrastructure, ACID guarantees. Cons: database becomes a bottleneck at high lock contention, no automatic TTL (requires cleanup job), typically millisecond lock acquisition latency vs Redis microsecond.

Interview Tips

Redis SET NX EX with a UUID value and Lua-based atomic release is the standard answer
Redlock is worth mentioning for multi-node HA, but acknowledge the controversy
Fencing tokens are the correct solution for GC-pause safety — this advanced detail impresses interviewers
Zookeeper ephemeral nodes are the right answer for leader election (crash-safe, no TTL dependency)
Always ask: what is the lock protecting? What happens if a stale holder acts after the lock expires?

Frequently Asked Questions

How do you implement a distributed lock with Redis?

Use Redis SET NX EX (set if not exists with expiry): SET lock_name unique_token NX EX 30. This is atomic — it sets the key with a 30-second TTL only if it does not already exist. Return value indicates whether the lock was acquired. The unique_token (UUID) is stored as the value; only the holder that set this token can release the lock. Release uses a Lua script for atomic check-and-delete: get the value; if it matches the caller token, delete the key; otherwise do nothing. The Lua script atomicity prevents a race where the TTL expires and another instance acquires the lock between the get and delete. The TTL prevents stale locks if the holder crashes before releasing. This pattern handles 99% of distributed lock use cases.

What is a fencing token and why is it needed for distributed locks?

A fencing token is a monotonically increasing number issued with each lock acquisition. It protects against split-brain: a process can hold a lock, pause (GC, network delay), have the lock TTL expire and be re-acquired by another process, then resume believing it still holds the lock. Without fencing tokens, both processes might write to the same resource simultaneously. With fencing tokens: the lock service atomically increments a counter on each acquisition and returns the value as a token. The lock holder includes this token in every request to the protected resource. The resource tracks the highest token it has seen and rejects any request with a lower token. The stale holder has a lower token than the new holder, so its requests are safely rejected even if it acts after the lock was re-acquired.

When should you use Zookeeper instead of Redis for distributed locking?

Use Zookeeper when you need stronger consistency and crash-safety guarantees that Redis cannot provide. The key difference: Redis locks rely on a TTL — if the holder crashes, the lock expires after TTL seconds. Zookeeper uses ephemeral nodes — nodes tied to a client session that are automatically deleted when the client connection drops, with no TTL dependency. Zookeeper also provides a linearizable consensus protocol (ZAB), giving stronger ordering guarantees than Redis replication (which is asynchronous and can lose writes on failover). Use Zookeeper for: leader election in critical systems (Kafka controller, HDFS NameNode), distributed coordination where losing a lock transition is unacceptable, and systems already running Zookeeper for configuration management. Use Redis for: high-throughput lock scenarios, simple mutual exclusion with acceptable TTL-based expiry, and when adding Zookeeper is operationally too expensive.

Uber Interview Guide

Cloudflare Interview Guide

Shopify Interview Guide

Airbnb Interview Guide

Coinbase Interview Guide

Companies That Ask This Question

Asked at: Databricks Interview Guide

Asked at: Atlassian Interview Guide