Low Level Design: Probabilistic Data Structures — HyperLogLog, Count-Min Sketch, and More

⏱ 4 min read

Probabilistic data structures trade exactness for dramatic reductions in memory and computation. Instead of exact answers, they provide approximate answers with provable error bounds. HyperLogLog estimates cardinality (count of unique elements) using kilobytes of memory instead of gigabytes. Count-Min Sketch estimates frequency of elements in a stream. Bloom filters test set membership. These structures are foundational in large-scale analytics, stream processing, and database systems — Redis, Google BigQuery, Apache Flink, and Cassandra all use them in production.

HyperLogLog: Counting Unique Elements

Problem: count distinct active users from 1 billion events. Exact solution: store every user_id in a hash set — requires ~8 bytes × 50M unique users = 400MB. HyperLogLog solution: uses ~12KB with 2% error margin. Algorithm: hash each element to a binary string; observe the position of the leftmost 1-bit (leading zeros). Intuitively: seeing 2^k leading zeros in any hash suggests approximately 2^k unique elements have been hashed. HyperLogLog maintains an array of M registers (each tracking the max leading zeros seen in a partition of the hash space). The cardinality estimate is derived from the harmonic mean of 2^(max_leading_zeros) across all registers. M registers reduce the variance — HyperLogLog++ (Google) achieves 0.65%/√M relative standard error. Redis HyperLogLog: each key uses exactly 12KB, handles 2^64 unique elements, and supports PFMERGE (union of multiple HyperLogLog sketches).

# Redis HyperLogLog: count unique visitors per day
# Each PFADD is O(1), PFCOUNT is O(1), memory is always 12KB

# Track unique user IDs visiting a page
PFADD page:home:2024-01-15 "user_12345"
PFADD page:home:2024-01-15 "user_67890"
PFADD page:home:2024-01-15 "user_12345"  # duplicate, ignored

PFCOUNT page:home:2024-01-15  # returns ~2 (small count is exact)

# Merge multiple days to count unique weekly visitors
PFMERGE page:home:week-03 
    page:home:2024-01-15 
    page:home:2024-01-16 
    page:home:2024-01-17
PFCOUNT page:home:week-03  # approximate unique visitors this week

# Count-Min Sketch in Redis (via RedisBloom module):
# Estimate frequency of any element in a stream
CMS.INITBYDIM top_products 2000 5  # 2000 columns, 5 rows (hash functions)
CMS.INCRBY top_products product_123 1
CMS.QUERY top_products product_123  # returns estimated view count

# Bloom filter: test set membership without false negatives
BF.ADD seen_order_ids "order-abc-123"
BF.EXISTS seen_order_ids "order-abc-123"  # returns 1 (definitely seen)
BF.EXISTS seen_order_ids "order-xyz-999"  # returns 0 (definitely not seen) or 1 (false positive ~1%)

Count-Min Sketch: Frequency Estimation

Count-Min Sketch estimates the frequency of elements in a data stream without storing the full dataset. Structure: a 2D array of w × d counters, where w is the width and d is the depth (number of hash functions). For each element: hash with each of d hash functions → increment the corresponding counter in each row. Query: hash the element with each function → return the minimum counter value across all rows (the minimum reduces the overcount caused by hash collisions). Accuracy: width w = ceil(e/ε) gives error at most ε fraction of total elements with probability 1-δ where δ = e^(-d). Example: estimate top-k most viewed products in real-time — increment a Count-Min Sketch for each page view, query the sketch for any product ID. Much smaller than storing exact counts for billions of product views.

Bloom Filter: Set Membership

A Bloom filter tests whether an element is in a set. No false negatives (if it says not present, definitely not present). Small false positive rate (configurable). Structure: a bit array of m bits and k hash functions. Add element: hash with each of k functions, set the corresponding bits to 1. Query: hash with each function, check if all corresponding bits are 1. If any bit is 0, the element is not in the set. If all bits are 1, it is probably in the set (false positive possible). Optimal false positive rate: at 1% FPR with m/n = 9.6 bits per element and k = 6.7 hash functions. Use cases: database page-level cache (check Bloom filter before disk I/O — skip I/O for keys definitely not in a block), LSM-tree SSTables (avoid searching a file for a key that is not there), web crawlers (avoid re-crawling URLs), duplicate request detection.

Key Interview Discussion Points

T-digest: approximates quantiles (P50, P95, P99) from a data stream using adaptive precision — more precise near the tails (P99 matters more than P50 for latency monitoring); used by Elasticsearch, Prometheus, and InfluxDB for percentile aggregations
MinHash for set similarity: estimate Jaccard similarity between two sets without comparing all elements; used by Spotify for playlist recommendation (similar playlist = high Jaccard similarity of songs), near-duplicate document detection
Flajolet-Martin algorithm: precursor to HyperLogLog; both observe leading zeros in hash values; HyperLogLog improved accuracy by using multiple registers and harmonic mean
Cuckoo filter: improves on Bloom filters by supporting deletion (Bloom filters cannot delete items) and slightly better space efficiency at low false positive rates; used by systems that need to remove elements from the filter
Approximate Top-K (heavy hitters): combining Count-Min Sketch with a heap of top-k candidates identifies the most frequent elements in a stream in O(log k) per element — used for trending hashtags, top search queries, most active API keys