What is an LSM tree and why do databases like Cassandra use it?

An LSM (Log-Structured Merge) tree buffers writes in an in-memory table (memtable), then flushes sorted immutable files (SSTables) to disk when the memtable is full. Background compaction merges SSTables to reclaim space. All writes are sequential—no random disk I/O—giving excellent write throughput. Reads check the memtable, bloom filters, and SSTables in order. Cassandra, LevelDB, and RocksDB use LSM trees for write-heavy workloads.

How does consistent hashing minimize data movement when nodes are added or removed?

In consistent hashing, both nodes and keys are mapped to positions on a virtual ring via a hash function. Each key is owned by the nearest node clockwise. When a node joins, it takes over keys from only its clockwise neighbor—roughly 1/n of total keys. When a node leaves, its keys shift to the next node. Compare to modular hashing (key % n) where adding a node remaps nearly all keys, causing a thundering herd on cache misses.

What is quorum consistency and how do you choose W and R values?

With N replicas, writing to W nodes and reading from R nodes guarantees you see the latest write when W + R > N (the read and write sets must overlap). Common configs for N=3: W=2, R=2 (strong consistency, balanced latency); W=1, R=3 (fast writes, slow reads); W=3, R=1 (durable writes, fast reads). For eventual consistency (analytics, counters), use W=1, R=1—maximum performance, possible stale reads.

System Design Interview: Design a Key-Value Store (Redis/DynamoDB)

⏱ 4 min read

System Design: Design a Key-Value Store (like Redis or DynamoDB)

A key-value store is the simplest and most widely-used NoSQL data model. Designing one from scratch is a common system design interview question that tests your knowledge of hashing, replication, consistency, and persistence.

Requirements

Functional: put(key, value), get(key), delete(key). Optional: TTL expiry, range queries, transactions.

Non-functional: low latency (P99 < 10ms), high availability (99.99%), eventual or strong consistency (clarify!), durability (survive restarts), horizontal scalability to terabytes.

Single-Node Design

Start simple: in-memory hash map. O(1) get/put. Issues: limited by RAM, data lost on restart.

Persistence options:

Append-only log (WAL) — write every operation to disk sequentially before returning success. Fast writes, slow recovery (replay entire log).
Periodic snapshot (RDB) — dump full state to disk every N seconds. Fast recovery, potential data loss between snapshots. Redis uses both.
LSM tree + SSTable — write to in-memory memtable, flush sorted to disk as immutable SSTables, compact in background. Used by LevelDB, RocksDB, Cassandra. Excellent write throughput.
B-tree — update data in-place on disk pages. Used by PostgreSQL, MySQL. Better read performance, slower writes due to random I/O.

Distributed Design

Partitioning (Sharding)

Distribute keys across nodes using consistent hashing. Each node owns a range of the hash ring. Virtual nodes (150 per physical node) ensure even distribution. When a node is added, only adjacent keys migrate.

# Key → partition → node mapping
partition_id = hash(key) % num_partitions
node = partition_to_node_map[partition_id]

Replication

Each partition has N replicas (N=3 is common). Two strategies:

Leader-follower (primary-replica): all writes go to leader, replicated to followers. Simple, strong consistency reads from leader. Used by Redis Sentinel, DynamoDB.
Leaderless (Dynamo-style): client writes to W nodes, reads from R nodes. No single point of failure. W + R > N for strong consistency. DynamoDB, Cassandra, Riak.

Consistency Model

Use quorum reads/writes to tune consistency vs availability:

N=3, W=3, R=1 → write to all, fast reads, high write latency
N=3, W=1, R=3 → fast writes, slow reads
N=3, W=2, R=2 → balanced, W+R=4>3 guarantees overlap

Handling Failures

Hinted handoff: when target node is down, another node temporarily stores writes with a hint to forward when target recovers
Anti-entropy / Merkle trees: background process compares tree hashes between replicas and syncs divergent subtrees
Vector clocks: track causality for conflict detection; last-write-wins is simpler but loses data

Architecture Diagram

Client
  │
  ▼
Coordinator Node (any node in cluster)
  │  consistent hash → determines N replica nodes
  ├──▶ Replica 1 (leader/primary)
  ├──▶ Replica 2
  └──▶ Replica 3
       Each replica: WAL + MemTable + SSTables on disk

TTL / Expiry

Two approaches:

Lazy expiry: check expiry on each read, delete if expired. Simple, may keep dead keys in memory.
Active expiry: background thread scans a random sample of keys with TTL and deletes expired ones. Redis uses both.

Comparison: Redis vs DynamoDB vs Cassandra

Dimension	Redis	DynamoDB	Cassandra
Consistency	Strong (single node)	Eventual / strong	Tunable
Data model	Rich types (list, set, sorted set)	Document + KV	Wide column
Persistence	Optional (RDB + AOF)	Always durable	Always durable
Latency	Sub-ms (in-memory)	Single-digit ms	Low ms
Scale	Cluster up to TBs	Petabytes (managed)	Petabytes

Interview Checklist

Clarify: strong vs eventual consistency? Read-heavy or write-heavy? TTL needed?
Start single-node, then add partitioning and replication
Explain your storage engine choice (hash map + WAL, LSM, B-tree) and trade-offs
Discuss quorum W+R>N for consistency guarantees
Address failure handling: hinted handoff, anti-entropy, conflict resolution

Twitter Interview Guide

Coinbase Interview Guide

Cloudflare Interview Guide

Netflix Interview Guide

Databricks Interview Guide

Stripe Interview Guide

🏢 Asked at: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering

🏢 Asked at: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering