A column-family store organises data in a hierarchy: keyspace → table → partition → rows. The partition key determines which node owns the data via consistent hashing, while clustering columns control the physical sort order of rows within each partition. This model makes column-family stores exceptionally well-suited to wide-row workloads such as time-series, user timelines, and IoT sensor data.
Data Model: Keyspace, Table, Partition, and Rows
A keyspace is the top-level namespace, analogous to a database in a relational system. It defines the replication strategy (e.g., NetworkTopologyStrategy) and replication factor. A table defines the schema: partition key columns, clustering columns, and regular columns.
A partition is the unit of data distribution. All rows sharing the same partition key live on the same set of nodes. Wide rows allow millions of columns per partition — a time-series table might store thousands of sensor readings under a single (sensor_id, date_bucket) partition key, with each reading as a row sorted by event_time DESC.
Clustering columns define sort order on disk. Because rows are stored physically sorted by clustering columns, range queries (e.g., “all events between 10:00 and 11:00”) scan contiguously on disk without random I/O.
Write Path: Memtable, Commit Log, and SSTable Flush
- Commit log write: Every write is first appended to the commit log (WAL) for durability. This is a sequential write and is extremely fast.
- Memtable write: The write is applied to an in-memory data structure called the memtable, which keeps rows sorted by clustering columns.
- SSTable flush: When the memtable reaches a size threshold, it is flushed to disk as an immutable SSTable (Sorted String Table). The commit log segment is then eligible for recycling.
SSTables are immutable. Updates do not modify existing SSTables; instead, a new cell with a higher timestamp is written. The latest timestamp wins at read time.
Read Path: Memtable, Bloom Filter, and SSTable Merge
- Memtable check: The memtable is checked first since it holds the most recent writes.
- Row cache (optional): If enabled, a row cache may serve the result directly.
- Bloom filter: Each SSTable has a Bloom filter. Before reading an SSTable, the engine checks the Bloom filter to determine whether the partition key might exist in that SSTable, avoiding unnecessary disk reads.
- SSTable merge: Data from all relevant SSTables (and the memtable) is merged. The cell with the highest timestamp for each column wins. Tombstones shadow older values.
Compaction Strategies
Compaction merges multiple SSTables into fewer, larger ones, discarding overwritten cells and expired tombstones.
Size-Tiered Compaction Strategy (STCS): Groups SSTables of similar size and merges them. Low write amplification makes it ideal for write-heavy append workloads (time-series ingestion). Read amplification is higher because there may be many SSTables to check.
Leveled Compaction Strategy (LCS): Organises SSTables into levels where each level is a multiple of the previous level's size. SSTables within a level are non-overlapping, which means a row exists in at most one SSTable per level. This greatly reduces read amplification at the cost of higher write amplification. Preferred for read-heavy lookup workloads.
Time-Window Compaction Strategy (TWCS): Groups SSTables by write time window. Ideal for time-series with TTL because entire SSTables expire and can be dropped without cell-level tombstone scanning.
TTL and Tombstone Garbage Collection
Columns can have a Time-To-Live (TTL) value in seconds. When a column's TTL expires, it becomes a tombstone. Tombstones are physically removed during compaction only after gc_grace_seconds (default 864,000 seconds / 10 days) have elapsed, ensuring all replicas have received the tombstone before it is removed.
Heavy deletion patterns accumulate tombstones that degrade read performance. Mitigation strategies include using TWCS with TTL (whole SSTables expire without cell-level tombstones), or increasing compaction throughput.
SQL DDL: Relational Analog
-- Time-series events: sensor_id as partition key, event_time as clustering key DESC
CREATE TABLE TimeSeriesEvent (
sensor_id BIGINT NOT NULL,
event_time TIMESTAMPTZ NOT NULL,
value DOUBLE PRECISION NOT NULL,
metadata JSONB,
PRIMARY KEY (sensor_id, event_time)
-- In a column-family store, PRIMARY KEY ((sensor_id), event_time DESC)
);
-- Covering index to support range scans on event_time per sensor
CREATE INDEX idx_ts_event_sensor_time ON TimeSeriesEvent (sensor_id, event_time DESC);
-- User timeline: user_id as partition key, post_time as clustering key DESC
CREATE TABLE UserTimeline (
user_id BIGINT NOT NULL,
post_time TIMESTAMPTZ NOT NULL,
post_id BIGINT NOT NULL,
PRIMARY KEY (user_id, post_time, post_id)
);
CREATE INDEX idx_timeline_user_time ON UserTimeline (user_id, post_time DESC);
Python: Core Operations
import time
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
from datetime import datetime, timezone
cluster = Cluster(['127.0.0.1'])
session = cluster.connect('sensor_keyspace')
INSERT_EVENT = session.prepare(
"INSERT INTO time_series_event (sensor_id, event_time, value, metadata) "
"VALUES (?, ?, ?, ?) USING TTL 2592000"
)
SELECT_EVENTS = session.prepare(
"SELECT event_time, value, metadata FROM time_series_event "
"WHERE sensor_id = ? AND event_time >= ? AND event_time <= ? "
"ORDER BY event_time DESC"
)
SELECT_TIMELINE = session.prepare(
"SELECT post_time, post_id FROM user_timeline "
"WHERE user_id = ? AND post_time None:
"""Write a sensor event. TTL of 30 days applied at column level."""
session.execute(INSERT_EVENT, (sensor_id, event_time, value, None))
def get_events(sensor_id: int, start_time: datetime, end_time: datetime) -> list:
"""Retrieve all events for a sensor within a time range."""
rows = session.execute(SELECT_EVENTS, (sensor_id, start_time, end_time))
return [
{'event_time': row.event_time, 'value': row.value, 'metadata': row.metadata}
for row in rows
]
def get_user_timeline(user_id: int, before_time: datetime, limit: int = 20) -> list:
"""Paginate a user's timeline in reverse chronological order."""
rows = session.execute(SELECT_TIMELINE, (user_id, before_time, limit))
return [{'post_time': row.post_time, 'post_id': row.post_id} for row in rows]
Design Considerations Summary
- Partition key: must distribute load evenly and co-locate query data.
- Clustering columns: choose order (ASC/DESC) based on the most common query pattern.
- Compaction: STCS for write-heavy, LCS for read-heavy, TWCS for time-series with TTL.
- Tombstones: avoid heavy delete patterns; use TTL + TWCS to eliminate tombstone accumulation.
- Wide rows: efficient for range scans but watch partition size limits (e.g., Cassandra's 2 billion cells per partition).
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering