LSM Compaction Strategy Low-Level Design: Size-Tiered, Leveled, FIFO, and Write Amplification Analysis

LSM Tree Structure and the Need for Compaction

A Log-Structured Merge-tree (LSM-tree) accepts all writes into an in-memory buffer (MemTable). When the MemTable fills, it is flushed to disk as a sorted, immutable file called an SSTable (Sorted String Table). Because flushes happen continuously, multiple SSTables accumulate on disk. Without compaction, reads must search every SSTable (since they can overlap in key space), space is wasted by stale versions of overwritten keys, and deleted keys linger indefinitely.

Compaction solves these problems by merging SSTables: reading multiple files, merging their sorted key streams, discarding obsolete versions and tombstoned keys, and writing the result to a new SSTable. The design question is which SSTables to merge and when — this is the compaction strategy.

L0: The Flush Layer

Level 0 (L0) receives all MemTable flushes. L0 SSTables can overlap in key space because each is an independent flush snapshot. A high L0 file count degrades read performance (every L0 file must be checked for a key) and is typically the trigger for compaction work. L0 file count is one of the most important operational metrics in an LSM-based system.

Size-Tiered Compaction Strategy (STCS)

STCS groups SSTables by size into tiers. When a tier accumulates enough files (typically 4), they are merged into a single larger SSTable that promotes to the next tier. The tier ratio T (typically 4) determines how much larger each tier is than the previous.

Write amplification in STCS is approximately T per level — each byte is rewritten once per tier merge. For 4 tiers with T=4, write amplification is roughly 4×. Read amplification is O(levels) because overlapping SSTables exist within each tier and multiple tiers must be searched. STCS is well-suited for write-heavy workloads where read latency is less critical.

Space amplification in STCS can spike to 2× during compaction because the input SSTables and the output SSTable coexist on disk until the compaction completes.

Leveled Compaction Strategy (LCS)

LCS maintains non-overlapping key ranges within each level (L1 and above). Each level is 10× larger than the previous. When L0 exceeds its file threshold, its SSTables are compacted into L1. When L1 exceeds its size threshold, the overlapping key range is compacted into L2, and so on.

Because keys do not overlap within a level, a point read needs to check at most one SSTable per level — read amplification is O(levels), typically O(1) per level in practice. Write amplification is higher: approximately 10× per level because a byte written to L1 may be rewritten 10 times as it progresses through levels. For a 5-level tree, write amplification can reach 30×. LCS is optimal for read-heavy workloads where space efficiency and read latency matter more than write throughput.

FIFO Compaction

FIFO compaction does no merging. When total SSTable size exceeds a configured threshold, the oldest SSTables are simply deleted. This is only appropriate for time-series data with short TTLs where oldest data is genuinely expendable. FIFO has zero write amplification from compaction but provides no key deduplication or tombstone removal.

Write Amplification, Read Amplification, and Space Amplification

These three amplification factors form the fundamental tradeoff triangle for LSM compaction:

Write amplification (WAF) — bytes written to disk per byte ingested. STCS: ~T per level. LCS: ~10× per level.
Read amplification (RAF) — SSTables or I/Os checked per point read. STCS: O(tiers × files_per_tier). LCS: O(levels), typically 1 per level.
Space amplification (SAF) — extra disk space relative to live data size. STCS: up to 2× during compaction. LCS: closer to 1.1× because only one level's worth of overlap exists at a time.

No compaction strategy minimizes all three simultaneously. STCS minimizes WAF at the cost of RAF and SAF. LCS minimizes RAF and SAF at the cost of WAF. RocksDB's Universal Compaction is an intermediate strategy that targets a combined amplification objective.

Compaction Throttling

Compaction is a background I/O operation that competes with foreground reads and writes for disk bandwidth and CPU. Without throttling, a large compaction can saturate disk I/O and cause read latency spikes. Compaction throttling sets a maximum byte rate for compaction I/O (e.g., 100 MB/s). When the compaction rate limit is hit, the compaction thread sleeps until the next rate window. The tradeoff: too aggressive throttling causes L0 file buildup and degraded read performance; too little throttling causes foreground latency spikes.

SQL Data Model

-- Compaction configuration per column family or table
CREATE TABLE CompactionConfig (
    id                      SERIAL PRIMARY KEY,
    table_name              VARCHAR(255) NOT NULL UNIQUE,
    strategy                VARCHAR(32) NOT NULL,   -- STCS, LCS, FIFO
    l0_file_num_threshold   INT NOT NULL DEFAULT 4,
    level_size_multiplier   INT NOT NULL DEFAULT 10,
    max_levels              INT NOT NULL DEFAULT 7,
    max_compaction_mb_per_sec INT NOT NULL DEFAULT 100,
    created_at              TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Compaction statistics per run
CREATE TABLE CompactionStat (
    id                  BIGSERIAL PRIMARY KEY,
    table_name          VARCHAR(255) NOT NULL,
    strategy            VARCHAR(32) NOT NULL,
    write_amp_factor    NUMERIC(8,2),
    read_amp_factor     NUMERIC(8,2),
    space_amp_factor    NUMERIC(8,2),
    bytes_read          BIGINT,
    bytes_written       BIGINT,
    duration_ms         INT,
    measured_at         TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_compstat_table_time ON CompactionStat(table_name, measured_at DESC);

Python Implementation Sketch

import heapq, time
from dataclasses import dataclass, field
from typing import Iterator

@dataclass
class SSTable:
    level: int
    size_bytes: int
    min_key: str
    max_key: str
    file_path: str
    created_at: float = field(default_factory=time.time)

def select_sstables_for_compaction(strategy: str,
                                   levels: list[list[SSTable]]) -> list[SSTable]:
    """Return the SSTables selected for the next compaction run."""
    if strategy == "STCS":
        # Find the level with the most files and group by similar size
        for level_files in levels:
            if len(level_files) >= 4:
                level_files.sort(key=lambda f: f.size_bytes)
                return level_files[:4]
    elif strategy == "LCS":
        # Compact L0 into L1 when L0 exceeds threshold
        if len(levels[0]) >= 4:
            return levels[0][:]
        # Otherwise compact the level that most exceeds its size budget
        for i, level_files in enumerate(levels[1:], 1):
            total = sum(f.size_bytes for f in level_files)
            budget = (10 ** i) * 256 * 1024 * 1024  # 256 MB * 10^level
            if total > budget:
                return level_files[:]
    elif strategy == "FIFO":
        # No merging — mark oldest files for deletion
        all_files = sorted(
            [f for level in levels for f in level],
            key=lambda f: f.created_at
        )
        return all_files[:1] if all_files else []
    return []

def compute_write_amplification(levels: list[list[SSTable]]) -> float:
    """Estimate write amplification from SSTable sizes across levels."""
    if not levels or not levels[0]:
        return 0.0
    total_written = sum(f.size_bytes for level in levels for f in level)
    ingested = sum(f.size_bytes for f in levels[0])
    return total_written / ingested if ingested else 0.0

def throttle_compaction_io(rate_mb_per_sec: float, bytes_written: int,
                           elapsed_sec: float) -> None:
    """Sleep if compaction I/O rate exceeds the configured limit."""
    if elapsed_sec  rate_mb_per_sec:
        sleep_sec = (bytes_written / (rate_mb_per_sec * 1024 * 1024)) - elapsed_sec
        if sleep_sec > 0:
            time.sleep(sleep_sec)

def merge_sstables(tables: list[SSTable]) -> Iterator[tuple[str, bytes]]:
    """K-way merge of sorted SSTables, yielding (key, value) in order."""
    # Placeholder: open each SSTable file and merge with a heap
    iterators = [iter([]) for _ in tables]  # replace with real SSTable iterators
    heap: list[tuple[str, int, bytes]] = []
    for i, it in enumerate(iterators):
        try:
            key, val = next(it)
            heapq.heappush(heap, (key, i, val))
        except StopIteration:
            pass
    while heap:
        key, i, val = heapq.heappop(heap)
        yield key, val
        try:
            nkey, nval = next(iterators[i])
            heapq.heappush(heap, (nkey, i, nval))
        except StopIteration:
            pass

Frequently Asked Questions

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “When should I choose STCS over LCS?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Choose Size-Tiered Compaction (STCS) for write-heavy workloads where write throughput is the primary concern and you can tolerate higher read latency and temporary space amplification. STCS rewrites data less often, so it conserves write I/O bandwidth. Choose Leveled Compaction (LCS) when read latency and space efficiency matter more than write throughput — OLTP workloads with frequent point reads benefit from LCS because at most one SSTable per level contains any given key.”
}
},
{
“@type”: “Question”,
“name”: “How is write amplification factor calculated?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Write amplification factor (WAF) = total bytes written to disk / bytes ingested by the database. For LCS with L levels and a level multiplier of 10, each byte written to L0 may be rewritten once per level compaction it participates in, giving WAF approximately equal to the sum of bytes rewritten across all levels divided by the original write size. In practice, LCS WAF is typically 10-30x for 4-5 levels. You can measure it directly as (compaction_bytes_written + flush_bytes_written) / user_bytes_written over a time window.”
}
},
{
“@type”: “Question”,
“name”: “How does compaction I/O throttling work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Compaction throttling limits the bytes-per-second rate at which the compaction thread reads and writes SSTables. The implementation tracks bytes written in the current time window and sleeps the compaction thread when the rate exceeds the configured limit (e.g., 100 MB/s). RocksDB implements this via its rate limiter. The tradeoff: too tight a limit allows L0 file count to grow (degrading reads); too loose a limit allows compaction to saturate disk I/O (degrading foreground latency).”
}
},
{
“@type”: “Question”,
“name”: “When is FIFO compaction appropriate?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “FIFO compaction is only appropriate for time-series data with a defined TTL where the oldest data is genuinely expendable. It simply deletes the oldest SSTable files when total size exceeds the configured limit, with no merging or deduplication. Use cases include metrics ingestion pipelines, log storage, and event streams where data older than a retention window has no value. Never use FIFO for general-purpose key-value storage — it will silently drop data that is not expired.”
}
}
]
}

STCS vs LCS: How to Choose

Choose STCS for write-heavy workloads where write throughput is the primary concern. Choose LCS when read latency and space efficiency matter more — OLTP workloads with frequent point reads benefit from LCS because at most one SSTable per level contains any given key.

Write Amplification Formula

WAF = total bytes written to disk / bytes ingested. For LCS with L levels and a multiplier of 10, each byte can be rewritten once per level, giving WAF of 10–30× for 4–5 levels. Measure it as (compaction_bytes_written + flush_bytes_written) / user_bytes_written over a time window.

Compaction I/O Throttling

Throttling limits the byte/s rate at which the compaction thread reads and writes. The thread sleeps when it exceeds the configured rate cap. Too tight a limit allows L0 file count to grow; too loose allows compaction to saturate disk I/O and spike foreground latency.

When to Use FIFO Compaction

FIFO is only for time-series data with a defined TTL. It deletes the oldest SSTable files when total size exceeds the limit — no merging, no deduplication. Never use FIFO for general-purpose key-value storage; it will silently drop non-expired data.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is write amplification and how is it measured?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Write amplification is the ratio of bytes written to disk per byte of user data written; an ingested byte that triggers multiple compaction merges across levels gets written multiple times; LCS with 5 levels has write amplification ~30 (10x per level * 3 levels on average).”
}
},
{
“@type”: “Question”,
“name”: “Why is FIFO compaction only suitable for time-series data?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “FIFO simply deletes the oldest SSTables when total size exceeds a threshold without merging or deduplication; this is correct only when old data naturally expires by TTL and there are no updates to existing keys.”
}
},
{
“@type”: “Question”,
“name”: “How does compaction IO throttling work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The compaction scheduler measures I/O bytes written per second by compaction threads; if the rate exceeds the configured limit, compaction threads sleep proportionally to stay within budget, preventing compaction from saturating disk bandwidth at the expense of foreground reads and writes.”
}
},
{
“@type”: “Question”,
“name”: “How does STCS space amplification spike during compaction?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “STCS temporarily needs space for both the input SSTables and the merged output SSTable simultaneously; for a merge of 4 equal-size SSTables into one, peak disk usage is 5x the final output size; this temporary spike can exhaust disk space on write-heavy workloads.”
}
}
]
}