Question 1

What does "no false negatives" mean in a Bloom filter and why does it matter?

Accepted Answer

If a Bloom filter's contains() method returns False, the element is 100% guaranteed to not be in the set — this is the "no false negatives" property. If it returns True, the element is probably in the set but may not be (a false positive). This asymmetry is what makes Bloom filters useful: you can trust a negative answer completely. The practical consequence: use a Bloom filter as a fast pre-check before an expensive operation. If the filter says "not present" (False), skip the expensive lookup entirely — you are guaranteed the lookup would return nothing. Only perform the expensive operation when the filter says "probably present" (True). This eliminates expensive DB queries, cache lookups, or disk reads for the majority of absent-key requests.

Question 2

How do you calculate the optimal Bloom filter size for a given false positive rate?

Accepted Answer

Two formulas: (1) Optimal bit array size: m = -(n × ln(p)) / (ln(2)²), where n = expected number of elements, p = desired false positive rate. For 1 million items at 1% FP rate: m = -(1,000,000 × ln(0.01)) / (ln(2)²) ≈ 9,585,058 bits ≈ 1.1 MB. (2) Optimal number of hash functions: k = (m/n) × ln(2). For the above: k = (9,585,058 / 1,000,000) × ln(2) ≈ 7 hash functions. Key insight: the FP rate is only valid if the filter holds at most n elements. Adding more elements increases the actual FP rate above the target. Monitor items_added/expected_items and rebuild with a larger filter if this ratio exceeds 1.0.

Question 3

Why can't you delete elements from a standard Bloom filter?

Accepted Answer

Each bit in the array may have been set by multiple different elements — any element whose hash functions include that bit position will have set it. Setting a bit to 0 to "delete" an element would potentially clear bits that other elements depend on, causing those elements to appear absent (false negative) — violating the core guarantee. The Counting Bloom filter extends each bit to a counter: add() increments counters, delete() decrements them, contains() checks if all counters are non-zero. This allows deletion at the cost of 4x more memory (4-bit or 8-bit counters instead of 1 bit). For most use cases (crawler visited-URL tracking, username availability), deletions are not needed — elements only grow.

Question 4

How do you use a Bloom filter to prevent cache penetration attacks?

Accepted Answer

Cache penetration: an attacker sends requests for keys that don't exist in cache or DB (e.g., user_id=-1, product_id=999999999). Each request misses the cache, hits the DB, returns nothing, but does not cache the miss. Under high volume, this bypasses the cache entirely and floods the DB. Fix: maintain a Bloom filter of all valid IDs. On each request, check the Bloom filter first. If "definitely not present" (False) — return 404 immediately without touching cache or DB. If "probably present" (True) — proceed with the normal cache-then-DB lookup. False positives are harmless (you do a DB lookup that returns nothing, same as the normal miss path). Populate the filter at startup from the DB and update it when new records are created.

Question 5

How is a Bloom filter used in LSM-tree databases like Cassandra and RocksDB?

Accepted Answer

LSM-tree databases (Cassandra, RocksDB, LevelDB) store data in immutable SSTables on disk. A read for a key must check multiple SSTables to find the most recent version — potentially many disk reads. Each SSTable has a Bloom filter (stored in memory) representing the keys it contains. On a read: check each SSTable's Bloom filter. If "definitely not present" (False) — skip that SSTable entirely. If "probably present" (True) — read the SSTable. A well-tuned Bloom filter with 1% FP rate eliminates 99% of unnecessary SSTable reads. This is why Cassandra's read path is fast despite data being spread across many SSTables — Bloom filters make the disk I/O bounded rather than linear in the number of SSTables.

Bloom Filter Low-Level Design: Probabilistic Set Membership at Scale

Implementation

Key Use Cases

Redis Bloom Filter (Production Use)

Sizing and Tuning

Key Interview Points