Column-Family Store Low-Level Design: Wide Rows, Partition Keys, Clustering, and Compaction

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you choose a partition key in a column-family store?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A good partition key distributes writes evenly across nodes and groups data that is queried together. Avoid high-cardinality monotonically increasing keys (like raw timestamps) as partition keys because they create hotspots on a single node. For time-series data, composite keys such as (sensor_id, date_bucket) spread load while keeping related rows co-located. The partition key is hashed through consistent hashing to determine the owning node, so the choice directly affects cluster balance.”
}
},
{
“@type”: “Question”,
“name”: “What is the role of clustering columns and how do they affect sort order?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Clustering columns define the physical sort order of rows within a partition. Rows are stored on disk in clustering column order, which makes range scans extremely efficient. Declaring event_time DESC as a clustering column means the most recent events are stored first, so queries like 'get last 100 events' read sequentially from the start of the partition without scanning all rows. You cannot efficiently filter on non-clustering columns within a partition without allowing filtering, which triggers full-partition scans.”
}
},
{
“@type”: “Question”,
“name”: “When should you use size-tiered compaction versus leveled compaction?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Size-Tiered Compaction Strategy (STCS) groups similarly sized SSTables and merges them, which is efficient for write-heavy workloads because it minimises write amplification. However, STCS can result in large SSTables and higher read amplification because a query may need to check multiple SSTables at each tier. Leveled Compaction Strategy (LCS) maintains a fixed number of non-overlapping SSTables per level, which reduces read amplification significantly but increases write amplification. Use STCS for append-heavy time-series ingestion and LCS for read-heavy lookup workloads.”
}
},
{
“@type”: “Question”,
“name”: “Why do tombstones accumulate and how is tombstone garbage collection handled?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “In a column-family store, deletes are not immediate removals. Instead, a tombstone marker is written. The tombstone must propagate to all replicas before the data can be physically removed, which is why gc_grace_seconds exists — it provides a time window (typically 10 days) during which a node that was temporarily down can receive the tombstone before compaction removes it. If compaction runs before gc_grace_seconds elapses, deleted data can reappear on that node. After gc_grace_seconds, compaction physically drops the tombstone and the original cell.”
}
}
]
}

A column-family store organises data in a hierarchy: keyspace → table → partition → rows. The partition key determines which node owns the data via consistent hashing, while clustering columns control the physical sort order of rows within each partition. This model makes column-family stores exceptionally well-suited to wide-row workloads such as time-series, user timelines, and IoT sensor data.

Data Model: Keyspace, Table, Partition, and Rows

A keyspace is the top-level namespace, analogous to a database in a relational system. It defines the replication strategy (e.g., NetworkTopologyStrategy) and replication factor. A table defines the schema: partition key columns, clustering columns, and regular columns.

A partition is the unit of data distribution. All rows sharing the same partition key live on the same set of nodes. Wide rows allow millions of columns per partition — a time-series table might store thousands of sensor readings under a single (sensor_id, date_bucket) partition key, with each reading as a row sorted by event_time DESC.

Clustering columns define sort order on disk. Because rows are stored physically sorted by clustering columns, range queries (e.g., “all events between 10:00 and 11:00”) scan contiguously on disk without random I/O.

Write Path: Memtable, Commit Log, and SSTable Flush

Commit log write: Every write is first appended to the commit log (WAL) for durability. This is a sequential write and is extremely fast.
Memtable write: The write is applied to an in-memory data structure called the memtable, which keeps rows sorted by clustering columns.
SSTable flush: When the memtable reaches a size threshold, it is flushed to disk as an immutable SSTable (Sorted String Table). The commit log segment is then eligible for recycling.

SSTables are immutable. Updates do not modify existing SSTables; instead, a new cell with a higher timestamp is written. The latest timestamp wins at read time.

Read Path: Memtable, Bloom Filter, and SSTable Merge

Memtable check: The memtable is checked first since it holds the most recent writes.
Row cache (optional): If enabled, a row cache may serve the result directly.
Bloom filter: Each SSTable has a Bloom filter. Before reading an SSTable, the engine checks the Bloom filter to determine whether the partition key might exist in that SSTable, avoiding unnecessary disk reads.
SSTable merge: Data from all relevant SSTables (and the memtable) is merged. The cell with the highest timestamp for each column wins. Tombstones shadow older values.

Compaction Strategies

Compaction merges multiple SSTables into fewer, larger ones, discarding overwritten cells and expired tombstones.

Size-Tiered Compaction Strategy (STCS): Groups SSTables of similar size and merges them. Low write amplification makes it ideal for write-heavy append workloads (time-series ingestion). Read amplification is higher because there may be many SSTables to check.

Leveled Compaction Strategy (LCS): Organises SSTables into levels where each level is a multiple of the previous level's size. SSTables within a level are non-overlapping, which means a row exists in at most one SSTable per level. This greatly reduces read amplification at the cost of higher write amplification. Preferred for read-heavy lookup workloads.

Time-Window Compaction Strategy (TWCS): Groups SSTables by write time window. Ideal for time-series with TTL because entire SSTables expire and can be dropped without cell-level tombstone scanning.

TTL and Tombstone Garbage Collection

Columns can have a Time-To-Live (TTL) value in seconds. When a column's TTL expires, it becomes a tombstone. Tombstones are physically removed during compaction only after gc_grace_seconds (default 864,000 seconds / 10 days) have elapsed, ensuring all replicas have received the tombstone before it is removed.

Heavy deletion patterns accumulate tombstones that degrade read performance. Mitigation strategies include using TWCS with TTL (whole SSTables expire without cell-level tombstones), or increasing compaction throughput.

SQL DDL: Relational Analog

-- Time-series events: sensor_id as partition key, event_time as clustering key DESC
CREATE TABLE TimeSeriesEvent (
    sensor_id    BIGINT        NOT NULL,
    event_time   TIMESTAMPTZ   NOT NULL,
    value        DOUBLE PRECISION NOT NULL,
    metadata     JSONB,
    PRIMARY KEY (sensor_id, event_time)
    -- In a column-family store, PRIMARY KEY ((sensor_id), event_time DESC)
);

-- Covering index to support range scans on event_time per sensor
CREATE INDEX idx_ts_event_sensor_time ON TimeSeriesEvent (sensor_id, event_time DESC);

-- User timeline: user_id as partition key, post_time as clustering key DESC
CREATE TABLE UserTimeline (
    user_id      BIGINT        NOT NULL,
    post_time    TIMESTAMPTZ   NOT NULL,
    post_id      BIGINT        NOT NULL,
    PRIMARY KEY (user_id, post_time, post_id)
);

CREATE INDEX idx_timeline_user_time ON UserTimeline (user_id, post_time DESC);

Python: Core Operations

import time
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
from datetime import datetime, timezone

cluster = Cluster(['127.0.0.1'])
session = cluster.connect('sensor_keyspace')

INSERT_EVENT = session.prepare(
    "INSERT INTO time_series_event (sensor_id, event_time, value, metadata) "
    "VALUES (?, ?, ?, ?) USING TTL 2592000"
)

SELECT_EVENTS = session.prepare(
    "SELECT event_time, value, metadata FROM time_series_event "
    "WHERE sensor_id = ? AND event_time >= ? AND event_time <= ? "
    "ORDER BY event_time DESC"
)

SELECT_TIMELINE = session.prepare(
    "SELECT post_time, post_id FROM user_timeline "
    "WHERE user_id = ? AND post_time  None:
    """Write a sensor event. TTL of 30 days applied at column level."""
    session.execute(INSERT_EVENT, (sensor_id, event_time, value, None))

def get_events(sensor_id: int, start_time: datetime, end_time: datetime) -> list:
    """Retrieve all events for a sensor within a time range."""
    rows = session.execute(SELECT_EVENTS, (sensor_id, start_time, end_time))
    return [
        {'event_time': row.event_time, 'value': row.value, 'metadata': row.metadata}
        for row in rows
    ]

def get_user_timeline(user_id: int, before_time: datetime, limit: int = 20) -> list:
    """Paginate a user's timeline in reverse chronological order."""
    rows = session.execute(SELECT_TIMELINE, (user_id, before_time, limit))
    return [{'post_time': row.post_time, 'post_id': row.post_id} for row in rows]

Design Considerations Summary

Partition key: must distribute load evenly and co-locate query data.
Clustering columns: choose order (ASC/DESC) based on the most common query pattern.
Compaction: STCS for write-heavy, LCS for read-heavy, TWCS for time-series with TTL.
Tombstones: avoid heavy delete patterns; use TTL + TWCS to eliminate tombstone accumulation.
Wide rows: efficient for range scans but watch partition size limits (e.g., Cassandra's 2 billion cells per partition).

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How should partition keys be designed for column-family stores?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Partition keys should distribute data evenly across nodes (avoid hotspots), group data that is queried together, and avoid sequential patterns like timestamps as sole partition keys which cause write hotspots on the latest partition.”
}
},
{
“@type”: “Question”,
“name”: “How do clustering columns control row ordering?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Clustering columns define the sort order within a partition; rows are stored on disk in clustering column order, enabling efficient range scans (e.g., get events for sensor_id between time A and time B) without a full partition scan.”
}
},
{
“@type”: “Question”,
“name”: “How does size-tiered compaction differ from leveled compaction?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “STCS merges SSTables of similar size into one larger SSTable — write-efficient but leads to higher read amplification and space amplification; LCS maintains non-overlapping SSTables per level — read-efficient with lower space amplification but higher write amplification.”
}
},
{
“@type”: “Question”,
“name”: “How are tombstones managed to prevent performance degradation?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Deleted rows generate tombstone records that are kept until gc_grace_seconds (default 10 days) elapses, allowing time for all replicas to receive the deletion; after the grace period, compaction removes tombstones; excessive tombstones degrade read performance.”
}
}
]
}