Question 1

How does a Bloom filter work and what are false positives?

Accepted Answer

A Bloom filter is a space-efficient data structure for membership testing. It uses a bit array of m bits and k hash functions. To add an element: compute k hash values and set the corresponding bits to 1. To query: compute k hash values and check if all bits are 1. If any bit is 0, the element is definitely not in the set (no false negatives). If all bits are 1, the element is probably in the set (possible false positive). False positives occur because bits set by other elements may coincidentally match the query. The false positive rate depends on the ratio of bits to elements and the number of hash functions. For 1 billion elements with a 1% false positive rate, you need approximately 9.6 bits per element (1.2 GB total). A hash set storing the same data needs 8+ GB. The key insight: Bloom filters never say no incorrectly. They may say yes incorrectly (false positive), but they will never miss an element that was added.

Question 2

What are the practical applications of Bloom filters in system design?

Accepted Answer

Bloom filters are used whenever you need to quickly determine if an element is NOT in a set, avoiding an expensive lookup. Applications: (1) Database query optimization -- Cassandra uses a Bloom filter per SSTable. Before reading an SSTable from disk, check the Bloom filter. If it says the key is not present, skip the disk read entirely. This reduces unnecessary disk I/O by 90%+. (2) Cache penetration prevention -- a Bloom filter containing all valid keys rejects requests for keys that do not exist in the database, preventing attackers from flooding the database with invalid key lookups. (3) Web crawler deduplication -- before crawling a URL, check if it has already been visited. Storing billions of URLs in a hash set is expensive; a Bloom filter uses a fraction of the memory. (4) Spell checking -- check if a word exists in a dictionary. (5) CDN -- check if a URL is in the cache without querying the cache server.

Question 3

What is HyperLogLog and when would you use it?

Accepted Answer

HyperLogLog (HLL) estimates the number of distinct elements (cardinality) in a dataset. Counting exact distinct elements requires O(n) space. HLL achieves approximately 2% accuracy using only 12 KB of memory regardless of dataset size. How it works: hash each element to a binary string and track the maximum number of leading zeros across all hashes. If the maximum is 20 leading zeros, the estimated cardinality is approximately 2^20 (one million). HLL uses multiple registers (typically 16,384) and takes the harmonic mean for accuracy. Use cases: count unique visitors to a website (without storing all visitor IDs), estimate COUNT(DISTINCT column) without a full table scan, and monitor unique source IPs in network traffic. Redis supports HLL natively: PFADD adds elements, PFCOUNT returns the cardinality estimate. The entire counter uses only 12 KB. You can also merge multiple HLL counters (PFMERGE) to count unique elements across time periods.

Question 4

How do you choose between Bloom filter, Count-Min Sketch, and HyperLogLog?

Accepted Answer

Each solves a different question: Bloom filter answers is this element in the set? -- membership testing with no false negatives. Use for: database query optimization, cache penetration prevention, URL deduplication. Space: approximately 10 bits per element for 1% false positive rate. Count-Min Sketch answers how many times has this element appeared? -- approximate frequency counting. It overestimates but never underestimates. Use for: finding top-K heavy hitters (most frequent search queries), network traffic analysis, and estimating query selectivity. Space: configurable based on desired accuracy. HyperLogLog answers how many distinct elements are there? -- cardinality estimation with approximately 2% error. Use for: unique visitor counting, distinct value estimation. Space: 12 KB regardless of dataset size. If you need to both check membership AND count distinct elements, use a Bloom filter and a HyperLogLog together. In system design interviews, mentioning Bloom filters whenever an existence check precedes an expensive operation demonstrates knowledge of practical optimization.

System Design: Bloom Filters, Count-Min Sketch, HyperLogLog — Probabilistic Data Structures for Scale

Bloom Filter: Space-Efficient Membership Testing

Bloom Filter Applications

Count-Min Sketch: Approximate Frequency Counting

HyperLogLog: Cardinality Estimation

Choosing the Right Probabilistic Data Structure