Bloom Filter: Design and Applications – Tech Interview Dot Org

A Bloom filter is a space-efficient probabilistic data structure that answers membership queries: “Is this element in the set?” It can return false positives (says “yes” when the element is not present) but never false negatives (if it says “no”, the element is definitely absent). This property — certain negatives, uncertain positives — makes Bloom filters powerful for eliminating expensive lookups for items that definitely do not exist.

How It Works

A Bloom filter is a bit array of m bits, initialized to all zeros, and k hash functions. To insert an element: hash it with all k functions, set the bit at each resulting index to 1. To query: hash the element with all k functions; if any bit is 0, the element is definitely not in the set (return false); if all bits are 1, the element is probably in the set (return true — but it may be a false positive from other elements setting those bits).

Example with m=10 bits and k=3 hash functions: inserting “alice” sets bits at positions {2, 5, 8}; inserting “bob” sets bits at {1, 5, 9}. Querying “alice” checks bits 2, 5, 8 — all 1, so “probably present”. Querying “carol” checks bits 3, 6, 7 — bit 3 is 0, so “definitely not present”. Querying “dave” checks bits 1, 5, 8 — all happen to be 1 (false positive), returns “probably present” incorrectly.

False Positive Rate and Sizing

False positive rate p ≈ (1 – e^(-kn/m))^k, where n is the number of elements inserted. For a target false positive rate, the optimal number of bits is m = -n*ln(p) / (ln(2))^2. For 1 million elements at 1% false positive rate: m ≈ 9.59 million bits ≈ 1.2 MB. A hash set storing 1 million 64-bit integers uses 8MB minimum (without load factor). A Bloom filter provides set membership at 1% FPR using 7x less memory than a hash set.

Core Applications

Database Query Elimination

The most important application: avoid querying the database for keys that definitely do not exist. A Bloom filter over all existing user IDs: before a database lookup for user_id=12345, check the Bloom filter. If it returns “definitely not present”, skip the database query entirely — return 404 immediately. If it returns “probably present”, query the database normally. This eliminates cache misses for non-existent keys, a common attack vector (cache flooding with random IDs) and a natural occurrence in deletion-heavy systems.

LSM-Tree Storage Engines

RocksDB, Cassandra, and LevelDB use Bloom filters per SSTable (Sorted String Table). To find a key, a naive LSM implementation checks every SSTable level — expensive I/O. With per-SSTable Bloom filters: check the filter for the SSTable; if “definitely not present”, skip this SSTable entirely; only read SSTable files where the filter says the key might exist. This reduces point-lookup I/O from O(levels) to O(1) average, making reads fast despite the write-optimized LSM structure.

Duplicate Detection

Web crawlers use Bloom filters to avoid re-crawling URLs already seen: add each crawled URL to the filter; before crawling a URL, check the filter. If “definitely not present”, crawl it. If “probably present”, skip it (accepting rare false positives means occasionally skipping an uncrawled URL — acceptable). A Bloom filter tracking 1 billion crawled URLs requires ~1.2 GB at 1% FPR — far cheaper than a database of 1 billion URLs.

Limitations

No deletion: clearing a bit may clear a bit shared with other elements. Standard Bloom filters do not support deletion. Counting Bloom filters (store a count per bit rather than a single bit) support deletion at 4-8x memory cost. No enumeration: you cannot list the elements in a Bloom filter. False positives grow with fill rate: as the filter fills, false positive rate increases. Size the filter for the expected maximum number of elements, not the initial count. No exact counts: for cardinality estimation, use HyperLogLog instead.

Scalable Bloom Filters

When the element count is unknown, use a Scalable Bloom Filter (SBF): a series of Bloom filters, each twice the size of the previous. When a filter reaches its capacity (too many elements would push FPR above target), create a new larger filter. Insertions go to the most recent filter; queries check all filters. The total false positive rate is bounded by the geometric series of individual filter rates: setting each filter’s rate to p/2 gives a total rate of p.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “How does a Bloom filter work and what are its guarantees?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “A Bloom filter is a bit array of m bits and k hash functions. To insert: hash the element with all k functions, set the bit at each index to 1. To query: hash the element; if any bit is 0, the element is definitely not in the set (no false negatives); if all bits are 1, the element is probably in the set (may be a false positive). The guarantee is asymmetric: ‘no’ answers are always correct; ‘yes’ answers may be wrong. This is valuable for eliminating expensive lookups — if the Bloom filter says ‘no’, skip the database query entirely; only query when it says ‘yes’. The false positive rate depends on the number of elements inserted and the bit array size.”} }, { “@type”: “Question”, “name”: “How do you size a Bloom filter for a target false positive rate?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “For n expected elements and target false positive rate p, the optimal bit array size is m = -n * ln(p) / (ln(2))^2, and the optimal number of hash functions is k = (m/n) * ln(2). Example: 1 million elements at 1% false positive rate requires m = 9.59 million bits = 1.2 MB, with k = 7 hash functions. A standard hash set storing 1 million 64-bit integers requires 8MB minimum (ignoring load factor overhead). The Bloom filter achieves the same membership query functionality at 1% FPR using 7x less memory. At 0.1% FPR, it requires 1.8 MB — still far less than a hash set but catching 10x fewer false positives.”} }, { “@type”: “Question”, “name”: “How do LSM-tree databases use Bloom filters?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “LSM-tree databases (RocksDB, Cassandra, LevelDB) store data in immutable Sorted String Table (SSTable) files across multiple levels. A point lookup for a key must potentially check every SSTable — expensive I/O. Each SSTable has an associated Bloom filter over its keys. On lookup: check the SSTable’s Bloom filter first. If ‘definitely not present’, skip this SSTable entirely (no disk read). Only read SSTables where the filter says the key might be present. This reduces point-lookup I/O from O(number of SSTables) to O(1) average, making reads fast despite LSM’s write-optimized structure. RocksDB’s default 10-bit Bloom filter per key gives ~1% FPR and typically reduces point-lookup disk I/O by 10-100x.”} }, { “@type”: “Question”, “name”: “What are the limitations of a standard Bloom filter?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “Standard Bloom filters cannot: (1) delete elements — clearing a bit may affect other elements that hash to the same position; use Counting Bloom Filters (store counts per bit) for deletions at 4-8x memory cost; (2) return the elements in the set — the filter stores only bits, not the original elements; (3) tell you how many distinct elements were inserted — use HyperLogLog for cardinality estimation; (4) grow without rebuilding — the bit array is fixed size; use a Scalable Bloom Filter (a series of growing filters) when element count is unpredictable. Also: false positive rate increases as more elements are inserted past the designed capacity — size for the maximum expected count, not the initial count.”} } ] }