A Bloom filter is a probabilistic data structure that tests set membership with false positives but no false negatives. “Definitely not in the set” is always correct; “probably in the set” may be wrong. This allows eliminating expensive lookups for items known to be absent, at the cost of occasionally checking for items that turn out to be absent.
How a Bloom Filter Works
A Bloom filter is a bit array of m bits, initially all 0. To add an element, hash it with k independent hash functions and set the bits at positions h1(x), h2(x), …, hk(x) to 1. To test membership, hash the element and check all k positions. If any bit is 0, the element is definitely not in the set. If all bits are 1, the element is probably in the set (false positive possible). Deletion requires a counting Bloom filter (bits become counters, not bits).
False Positive Rate
False positive probability: (1 – e^(-kn/m))^k where n is the number of elements inserted. For 1% false positive rate with n=1 million elements: m = n * ln(1/p) / (ln 2)^2 ≈ 9.6 million bits (1.2 MB). The optimal number of hash functions: k = (m/n) * ln 2. Trade off space for accuracy by increasing m. A Bloom filter for 10 million URLs with 1% FPR requires ~12 MB — far smaller than storing the URLs themselves.
Cache Pre-Check (Avoiding Cache Stampede)
Before querying the database for a key, check a Bloom filter containing all known keys. If the Bloom filter says “definitely not in the set,” skip the database query. This prevents cache-miss attacks: an attacker sends requests for random non-existent keys, causing database queries that miss cache. With a Bloom filter, non-existent keys never reach the database. Used in Redis (as a module) and Cassandra for key existence pre-checks.
Duplicate Detection
Web crawlers use Bloom filters to track visited URLs and avoid re-crawling. Spam filters use Bloom filters to track seen email hashes. Event deduplication in stream processing uses Bloom filters for approximate deduplication of event IDs across a sliding window. The Bloom filter provides a first-pass filter; exact deduplication (database or Redis SET) handles the false positives from the Bloom filter check.
LSM Tree and Bloom Filters
LSM-based databases (Cassandra, RocksDB, LevelDB) use Bloom filters per SSTable (sorted string table). A read for a key must check all SSTables in all levels — expensive if the key doesn't exist. A per-SSTable Bloom filter answers “is this key in this SSTable?” If the filter says no, skip the SSTable entirely. This reduces read amplification from O(levels) disk reads to O(false positive rate * levels) for missing keys.
Counting Bloom Filter
A standard Bloom filter does not support deletion: clearing a bit might affect other elements sharing that bit position. A counting Bloom filter replaces each bit with a small counter (4 bits). Addition increments counters; deletion decrements. Deletion is now safe as long as counters don't overflow. Counter overflow is rare if the filter is sized appropriately. Counting Bloom filters use 4x more space than standard filters but enable membership set modification.
Scalable Bloom Filter
When the number of elements to insert is unknown, a scalable Bloom filter grows dynamically. It chains multiple filters: when the current filter's false positive rate exceeds a threshold (the filter is “full”), add a new filter with a lower false positive target and continue inserting into the new filter. Membership tests check all filters in the chain. The total false positive rate is bounded by the sum of individual filter FPRs.
Cuckoo Filter
The cuckoo filter improves on the Bloom filter by supporting deletion, achieving better space efficiency for false positive rates below 3%, and supporting lookup with two memory accesses (vs k accesses for Bloom filter). It stores fingerprints in a cuckoo hash table. A fingerprint collision can cause false positives. Cuckoo filters are preferred over counting Bloom filters when deletion support and lower memory usage are both required.