System Design: Bloom Filters, Count-Min Sketch, HyperLogLog — Probabilistic Data Structures for Scale

Probabilistic data structures trade perfect accuracy for dramatic space savings. A Bloom filter can test membership in a set of 1 billion items using only 1.2 GB of memory with a 1% false positive rate — compared to 8 GB for a hash set. These data structures appear in system design interviews and power real-world systems at Google, Facebook, and Cassandra. This guide covers Bloom filters, Count-Min Sketch, and HyperLogLog with practical applications.

Bloom Filter: Space-Efficient Membership Testing

A Bloom filter answers the question: “is this element in the set?” with two possible answers: “definitely not in the set” (100% accurate) or “probably in the set” (may be a false positive). It never produces false negatives — if it says no, the element is absolutely not in the set. Implementation: a bit array of m bits, initialized to all zeros, and k independent hash functions. To add an element: compute k hash values, set the corresponding bits to 1. To query: compute k hash values. If all k bits are 1, the element is “probably present.” If any bit is 0, the element is “definitely not present.” False positives occur when all k bit positions happen to be set by other elements. The false positive rate depends on: m (bit array size — larger means fewer collisions), n (number of elements — more elements mean more bits set), and k (number of hash functions — optimal k = (m/n) * ln(2)). For a 1% false positive rate with 1 billion elements: m = ~9.6 bits per element = 1.2 GB. A hash set storing the same data would need 8+ GB.

Bloom Filter Applications

Real-world applications: (1) Database query optimization — before querying a database for a key, check a Bloom filter. If the filter says “not present,” skip the database query entirely. Cassandra uses Bloom filters on each SSTable: before reading an SSTable from disk, check its Bloom filter to avoid unnecessary I/O. This reduces disk reads by 90%+ for point lookups. (2) Web crawler URL deduplication — a web crawler must avoid re-crawling URLs it has already visited. Storing billions of URLs in a hash set is expensive. A Bloom filter efficiently answers “have I seen this URL?” with minimal memory. False positives (occasionally re-crawling a URL) are acceptable. (3) Cache penetration prevention — when a request asks for a key that does not exist in the cache or the database, it is a cache penetration attack. A Bloom filter containing all valid keys rejects requests for keys that definitely do not exist, preventing the database from being hit with invalid queries. (4) Spell checking — check if a word exists in a dictionary. (5) Network routing — check if a packet belongs to a known malicious IP set.

Count-Min Sketch: Approximate Frequency Counting

A Count-Min Sketch estimates the frequency of elements in a data stream. Instead of maintaining an exact count for each element (which requires O(n) space for n unique elements), it uses a 2D array of counters with d rows and w columns, and d hash functions. To increment: hash the element with each of the d functions, increment the corresponding counter in each row. To query the count: hash the element, read the counter from each row, return the minimum. The minimum is the best estimate because other elements may have collided and inflated some counters, but the minimum is least affected by collisions. Space: O(d * w) counters. Error guarantee: with probability >= 1 – delta, the estimate is at most epsilon * N above the true count, where N is the total number of increments, w = ceil(e / epsilon), d = ceil(ln(1 / delta)). Applications: (1) Top-K heavy hitters — find the most frequent items in a stream (most searched queries, most visited URLs) without storing all items. (2) Network traffic analysis — estimate byte counts per source IP for DDoS detection. (3) Database query optimization — estimate the selectivity of query predicates for query plan optimization.

HyperLogLog: Cardinality Estimation

HyperLogLog (HLL) estimates the number of distinct elements in a dataset. Counting distinct elements exactly requires O(n) space (store every element). HLL achieves an estimate within 2% accuracy using only 12 KB of memory, regardless of the dataset size — whether it contains 1 million or 1 billion distinct elements. How it works: hash each element to a uniform binary string. Count the maximum number of leading zeros observed across all hashes. Intuitively: if you have seen a hash with 20 leading zeros, you have probably seen approximately 2^20 = 1 million distinct elements (because the probability of 20 leading zeros is 1/2^20). HLL improves accuracy by using m registers (typically 2^14 = 16384). Each element is mapped to one register based on the first few bits of its hash. Each register tracks the maximum number of leading zeros for elements mapped to it. The cardinality estimate is the harmonic mean of 2^(register value) across all registers. Applications: (1) Unique visitor counting — count unique IPs or user IDs visiting a website without storing all IDs. Redis supports HLL natively (PFADD, PFCOUNT). (2) Database DISTINCT estimation — estimate SELECT COUNT(DISTINCT column) without scanning the full table. (3) Network monitoring — estimate the number of unique source IPs in a traffic flow.

Choosing the Right Probabilistic Data Structure

Decision framework: (1) “Is this element in the set?” — Bloom filter. No false negatives. Space: ~10 bits per element for 1% false positive rate. Use for: database query optimization, cache penetration prevention, URL deduplication. (2) “How many times has this element appeared?” — Count-Min Sketch. Overestimates, never underestimates. Use for: top-K heavy hitters, traffic analysis, frequency estimation. (3) “How many distinct elements are there?” — HyperLogLog. 2% standard error with 12 KB. Use for: unique visitor counting, cardinality estimation, network monitoring. (4) “Is this element in the set, and I need to delete elements?” — Counting Bloom filter (each position is a counter, not a bit) or Cuckoo filter (supports deletion, better space efficiency than counting Bloom filters). In system design interviews: mention Bloom filters for any scenario where you need to check existence before an expensive operation (database lookup, network request). This demonstrates knowledge of space-efficient data structures and practical optimization techniques.

Scroll to Top