Low Level Design: Distributed Lock Service

What Is a Distributed Lock Service?

A distributed lock service allows multiple processes or nodes across a network to coordinate exclusive access to a shared resource. Unlike a single-process mutex, a distributed lock must handle partial failures, network partitions, and clock skew. Classic use cases include leader election, rate limiting, inventory reservation, and preventing double-writes in distributed databases.

Data Model

The simplest backing store is Redis. Each lock is represented as a key with an expiry (TTL) and a unique owner token:

SET lock:<resource> <owner_token> NX PX <ttl_ms>

Fields:

lock:<resource> — namespaced key identifying the protected resource.
owner_token — a UUID or ULID generated by the client at acquire time. Used to prevent a client from releasing a lock it no longer owns.
TTL (ms) — wall-clock lease duration. Must be long enough to complete the critical section but short enough to recover quickly after a crash.

In a SQL-backed variant, a single locks table is used:

CREATE TABLE locks (
  resource    VARCHAR(255) PRIMARY KEY,
  owner_token VARCHAR(64)  NOT NULL,
  acquired_at TIMESTAMP    NOT NULL DEFAULT NOW(),
  expires_at  TIMESTAMP    NOT NULL,
  status      ENUM('held', 'released') NOT NULL DEFAULT 'held'
);

An atomic INSERT ... ON DUPLICATE KEY UPDATE (MySQL) or INSERT ... ON CONFLICT DO NOTHING (Postgres) provides the compare-and-set semantics needed for safe acquisition.

Core Algorithm: Redlock

For a single Redis instance the SET NX pattern suffices, but it creates a single point of failure. The Redlock algorithm extends this to N independent Redis masters (typically 5):

Record the current time in milliseconds (t_start).
Attempt SET NX PX <ttl> sequentially on all N nodes with a small per-node timeout (e.g., 5 ms).
Compute elapsed time drift = t_now - t_start. The lock is valid only if it was acquired on a majority (> N/2) of nodes AND drift < ttl.
The effective validity window is ttl - drift - clock_drift_factor.
On success, proceed with the critical section. On failure, immediately release all partial locks and retry with exponential backoff and jitter.

Release is always done via a Lua script to make the check-and-delete atomic:

if redis.call('get', KEYS[1]) == ARGV[1] then
  return redis.call('del', KEYS[1])
else
  return 0
end

Failure Modes

TTL Expiry During GC Pause: If the JVM (or GC) pauses longer than the remaining TTL, the lock expires while the process still believes it holds it. Mitigate with fencing tokens: a monotonically increasing version number the lock server issues at grant time; the storage backend rejects writes with a stale token.
Split-Brain: A network partition can cause two clients to each win a majority on different sides. Redlock reduces but does not eliminate this risk. For strict safety, prefer a CP system (etcd, ZooKeeper).
Clock Drift: Redis TTL relies on wall-clock time. Significant NTP jumps can cause premature expiry or extended leases. Keep clock drift bounds tight (< 1% of TTL).
Crash After Acquire, Before Release: The TTL acts as the safety net. Clients should not hold locks longer than one TTL period; use watchdog threads to renew if needed.

Scalability Considerations

Sharding: Partition the lock namespace by resource prefix so different lock families route to different Redis clusters.
Lock Granularity: Prefer fine-grained locks (per-row, per-entity) over coarse-grained (per-table) to maximise concurrency.
Lease Renewal: Long-running critical sections should use a background thread to renew the TTL via PEXPIRE, preventing spurious expiry.
Observability: Track lock contention rate, wait time, and lease utilization. High contention signals that the critical section is too broad or the TTL is too short.

Summary

A distributed lock service is a deceptively simple component that hides significant failure-mode complexity. For most production workloads, Redis with the Redlock algorithm is sufficient if paired with fencing tokens at the storage layer. When strict linearizability is required, use etcd or ZooKeeper. Always instrument lock acquisition latency and expiry events — silent lock contention is one of the hardest bugs to debug in distributed systems.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a distributed lock service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A distributed lock service is a system that coordinates exclusive access to shared resources across multiple nodes or processes in a distributed system, preventing race conditions and ensuring consistency.”
}
},
{
“@type”: “Question”,
“name”: “How does a distributed lock service handle node failures?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A distributed lock service handles node failures by using lease-based locks with TTLs (time-to-live). If the lock holder crashes before releasing the lock, the lease expires automatically and another node can acquire it, preventing deadlocks.”
}
},
{
“@type”: “Question”,
“name”: “What consensus algorithm is commonly used in distributed lock services?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Paxos and Raft are the most commonly used consensus algorithms in distributed lock services. They ensure that only one node can hold a lock at a time even in the presence of network partitions or node failures.”
}
},
{
“@type”: “Question”,
“name”: “Which companies ask distributed lock service design questions in interviews?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Google, Amazon, and Uber frequently ask candidates to design a distributed lock service in system design interviews. The question tests understanding of consensus, fault tolerance, and distributed systems fundamentals.”
}
}
]
}