How does LZ77 compression work?

LZ77 maintains a sliding window of previously seen bytes (the dictionary). For each position in the input, it searches the window for the longest match. If a match is found, it outputs a back-reference: (offset from current position, match length, next literal byte). If no match, it outputs a literal byte. The decompressor reconstructs the data by following back-references. Matches longer than 3-4 bytes achieve compression; shorter sequences are output as literals. LZ77 is the basis for Deflate (gzip), LZ4, and Zstandard.

When should you choose Snappy vs LZ4 vs Zstandard?

Snappy: optimized for fast decompression (250-500 MB/s), moderate compression ratio (~2:1 for text). Best for: Cassandra, HBase, Parquet when read speed is critical. LZ4: extremely fast decompression (1-3 GB/s), somewhat lower ratio than Snappy. Best for: real-time systems (ClickHouse, ZFS, Linux kernel), in-memory caches, network transfer where decompression speed is paramount. Zstandard (zstd): much better compression ratio (similar to gzip or better) at LZ4-comparable speed at level 1; higher levels approach bzip2 ratios. Best for: cold storage, Kafka log compression, backups, anywhere you want the best ratio/speed tradeoff. Dictionary-trained zstd compresses small messages (JSON API responses) dramatically better than unadapted algorithms.

How does Huffman coding achieve lossless compression?

Huffman coding assigns variable-length prefix-free binary codes to symbols based on their frequency. More frequent symbols get shorter codes (e.g., space character gets 3 bits), rare symbols get longer codes (e.g., Z gets 12 bits). The total bit length is minimized while ensuring no code is a prefix of another (prefix-free property — enables unambiguous decoding). The Huffman tree is built bottom-up: find the two lowest-frequency nodes, merge them into a parent with combined frequency, repeat until one root. Deflate (gzip) uses Huffman coding for the literal/length symbols after LZ77 back-reference encoding.

How do databases choose which compression algorithm to use?

Different columns benefit from different algorithms. Low-cardinality columns (status, country): dictionary encoding (map values to integers) + RLE (run-length encoding for sorted data) — 10-100x compression. Integer sequences with small deltas (timestamps, monotonic IDs): delta encoding. Float values: XOR compression (Gorilla). General columns: Snappy or LZ4 for speed, Zstd for ratio. Parquet applies column-level encoding automatically based on statistics (dictionary encoding if cardinality < threshold). ClickHouse allows specifying CODEC per column. PostgreSQL uses TOAST with pglz for large values. Choosing the right codec per data type can reduce storage by 5-20x versus a generic row-level algorithm.

Low Level Design: Compression Algorithms

⏱ 9 min read

Compression is one of the highest-leverage primitives in systems design. It reduces storage cost, cuts network bandwidth, and often improves query performance by reducing I/O — even with the CPU overhead of decompression. Understanding how compression algorithms work at a low level lets you make informed choices about which codec to use for which workload.

Lossless vs Lossy Compression

Lossless compression reconstructs the exact original data. Every bit is preserved. Required for text, source code, executables, databases, and any data where approximation is unacceptable. All compression discussed in this post is lossless unless noted.

Lossy compression discards information to achieve higher compression ratios, accepting that the reconstructed data is an approximation of the original. JPEG discards high-frequency image detail that human vision is less sensitive to. MP3 discards audio frequencies the ear doesn’t perceive well. H.264/H.265 exploit temporal redundancy in video, encoding only the differences between frames. Lossy compression achieves 10-100x better ratios than lossless for media but is completely inappropriate for data that must be reconstructed exactly.

Entropy Coding: The Theoretical Foundation

Information theory (Shannon, 1948) establishes that the minimum number of bits needed to represent a symbol is log₂(1/p) where p is the symbol’s probability. A symbol that appears 50% of the time needs only 1 bit. A symbol that appears 0.1% of the time needs ~10 bits. Entropy coding exploits this by assigning shorter codes to more frequent symbols, approaching the theoretical minimum (entropy) of the source.

Two main entropy coding approaches: Huffman coding (practical, widely used) and arithmetic coding (closer to theoretical optimum, more complex). Both are used as a final stage in modern compressors after a dictionary-based pass removes structural redundancy.

Huffman Coding

Huffman coding builds an optimal prefix-free binary code from a symbol frequency table. The algorithm: count frequency of each symbol, build a min-heap of symbols by frequency, repeatedly merge the two lowest-frequency nodes into a parent node (frequency = sum of children), until one tree remains. Assign 0/1 to left/right branches. The resulting codes are prefix-free (no code is a prefix of another — enabling unambiguous decoding) and optimal in the sense that no other prefix-free code assigns shorter average code length given the same frequency table.

A canonical example: if ‘e’ has frequency 0.13 and ‘z’ has frequency 0.0007, Huffman assigns ‘e’ something like "010" (3 bits) and ‘z’ something like "1011101" (7 bits). Average code length approaches entropy. Deflate uses Huffman coding for its literal/length and distance codes after the LZ77 pass finds repeated sequences.

Arithmetic Coding

Arithmetic coding encodes an entire message as a single fraction in the interval [0,1). It maintains a current interval, and for each symbol, narrows the interval proportionally to that symbol’s probability. After processing all symbols, any fraction within the final interval represents the message uniquely.

Arithmetic coding approaches the theoretical entropy limit more closely than Huffman, especially for symbols with skewed probabilities (where Huffman wastes bits rounding to whole numbers). The tradeoff is computational complexity — it requires multiprecision arithmetic and more complex implementation. Used in LZMA (the algorithm behind .xz and 7-Zip), bzip2, and CABAC in H.264 video coding. When maximum compression ratio matters more than speed, arithmetic-coded algorithms are chosen.

Dictionary-Based Compression: LZ77

LZ77 (Ziv and Lempel, 1977) is the foundation of most practical compressors. The key insight: natural data (text, code, binary formats) contains many repeated byte sequences. Instead of encoding a repeated sequence again, encode a back-reference: "go back 150 bytes, copy 12 bytes from there."

LZ77 maintains a sliding window of recently seen bytes (the search buffer, e.g. 32KB). For each position in the input, it finds the longest match anywhere in the sliding window. Output is a triple: (offset back into window, match length, next literal byte that didn’t match). A back-reference of (150, 12, ‘a’) means "copy 12 bytes starting 150 bytes back, then emit literal ‘a’." If no match is found, output just the literal byte.

LZ77 is the basis for Deflate, Zstandard, LZ4, and most other modern compressors. The variants differ in window size, match-finding algorithm (hash chains vs suffix arrays vs hash tables), and the entropy coder applied to the output stream.

LZ78 and LZW

LZ78 builds a dictionary of seen strings dynamically during encoding. Instead of a sliding window, it maintains an explicit dictionary: map from string to integer index. On each step, find the longest dictionary entry matching the current input, output its index, add the matched string + next byte as a new dictionary entry.

LZW (Lempel-Ziv-Welch) is a refinement of LZ78 used in GIF image format and TIFF files. Both LZ78 and LZW have the advantage of a single-pass encoder (no look-ahead search required) but generally achieve worse compression than LZ77-based algorithms on most data. LZW was also encumbered by patents until the early 2000s, which accelerated adoption of Deflate as the universal standard.

Deflate: The Universal Standard

Deflate combines LZ77 (for dictionary-based redundancy removal) with Huffman coding (for entropy coding of the output). It’s the algorithm inside gzip, PNG, zlib, and ZIP files — arguably the most widely deployed compression algorithm in history.

Deflate achieves good compression ratios with moderate speed — suitable for general-purpose compression where both ratio and speed matter. The gzip format adds a header, CRC32 checksum, and length field around a Deflate stream. PNG uses Deflate to compress pixel data within the image format, applied after a delta filter that improves compression of image data.

Zlib is the reference implementation library used everywhere from HTTP servers (Content-Encoding: deflate) to database WAL compression. Deflate has no patent issues and is universally supported — the safe default when compatibility matters most.

Snappy and LZ4: Speed-Optimized Compressors

Snappy was developed at Google with an explicit design goal: prioritize decompression speed over maximum compression ratio. It achieves 250-500 MB/s decompression on modern hardware (vs ~100-150 MB/s for zlib). Compression ratio is typically 1.5-2x on typical data — noticeably worse than Deflate (2-5x) but sufficient to meaningfully reduce I/O. Used in Cassandra for SSTable compression, HBase, Hadoop SequenceFiles, and LevelDB. The right choice when CPU is the bottleneck and you need decompression to keep up with fast SSDs or network.

LZ4 pushes speed even further: 1-2 GB/s decompression, making it faster than memory bandwidth on many systems — the bottleneck shifts back to memory, not decompression. Compression ratio is roughly comparable to Snappy. Used in the Linux kernel (squashfs), ZFS, ClickHouse (default codec), RocksDB, and real-time data pipelines where latency is critical. LZ4 is the right default when you want compression overhead to be essentially zero.

Zstandard: The Modern Default

Zstandard (zstd), developed at Facebook and open-sourced in 2016, achieves the best ratio-to-speed tradeoff of any widely-used general-purpose compressor. At level 3 (default), it matches or beats zlib’s compression ratio while being 3-5x faster to decompress. At level 1 (fastest), it approaches LZ4 speed with better compression. At level 19-22 (maximum), it approaches LZMA ratios with far better speed.

Key features: dictionary pretraining for small data — train a shared dictionary on a sample of similar documents, then compress each document using the dictionary. This dramatically improves compression of small objects (JSON API responses, small log entries) where standard compressors can’t build up context. Tunable compression levels (1-22) let you trade CPU for ratio on the same algorithm.

Adoption: Linux kernel, Facebook’s entire storage infrastructure, Kafka (producer compression), RocksDB (cold data), npm package registry, ArchLinux packages. Zstd has largely displaced gzip and bzip2 as the go-to choice for new systems that don’t need legacy compatibility.

Brotli: HTTP Content Encoding

Brotli was developed at Google specifically for HTTP content encoding — compressing HTML, CSS, and JavaScript served to browsers. It achieves 15-25% better compression than gzip at comparable decompression speed, which translates directly to faster page loads on mobile networks.

Brotli uses a combination of LZ77, Huffman coding, and a built-in static dictionary of common web content strings. The static dictionary gives Brotli a head start compressing typical web content — strings like "Content-Type", "text/html", and common HTML tags are in the dictionary and compress to a few bits. Supported by all major browsers via the Accept-Encoding: br header. The right choice for static web assets served at high volume; not used for database or storage compression where Zstd dominates.

Database and File Format Codec Selection

Columnar file formats (Parquet, ORC) layer multiple compression techniques. At the column encoding level: dictionary encoding for low-cardinality columns (store distinct values once, encode each cell as an index — excellent for status fields, country codes), run-length encoding (RLE) for sorted or repeated values, delta encoding for monotonically increasing integers (timestamps, auto-increment IDs). These encodings often reduce data size by 10-50x before the general-purpose codec even runs. For the codec layer, Snappy is the Parquet default (fast, reasonable ratio), Zstd is increasingly preferred for better ratio with acceptable speed.

PostgreSQL uses TOAST (The Oversized-Attribute Storage Technique) to compress and/or store large column values out-of-line. The default TOAST compression algorithm is pglz (a simple LZ-family compressor). PostgreSQL 14+ supports LZ4 as an alternative TOAST compressor — significantly faster for workloads with heavy large-object access.

ClickHouse uses LZ4 as the default block compression codec — chosen because ClickHouse queries are already extremely fast, and LZ4’s near-zero decompression overhead avoids making compression the bottleneck. For cold data or data that will be queried infrequently, Zstd is recommended as the codec: the higher compression ratio reduces storage cost and I/O, and the slightly higher decompression cost is acceptable when queries run less frequently. ClickHouse also supports CODEC(Delta, LZ4) — apply delta encoding first, then LZ4 — which is extremely effective for time-series data where timestamps differ by small, consistent intervals.