Low Level Design: Bloom Filters — Design and Applications

⏱ 3 min read

A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It can return false positives (reporting an element is in the set when it is not) but never false negatives (if it says an element is not in the set, it is definitely not there). This asymmetry makes Bloom filters powerful for pre-filtering: eliminate definite misses cheaply before expensive lookups. Applications include database query optimization, caching systems, network routing, and malicious URL detection.

How a Bloom Filter Works

A Bloom filter is a bit array of m bits, initially all 0, plus k independent hash functions. Insert: hash the element with all k functions, set bits at the k resulting positions to 1. Query: hash the element with all k functions, check all k positions — if any bit is 0, the element is definitely not in the set; if all bits are 1, the element is probably in the set. False positive rate: p ≈ (1 - e^(-kn/m))^k where n is number of inserted elements. Optimal k for given m and n: k = (m/n) × ln(2).

type BloomFilter struct {
    bits     []uint64  // bit array (packed into 64-bit words)
    numBits  uint
    numHash  uint
}

func (bf *BloomFilter) Add(item []byte) {
    h1, h2 := murmur3.Sum128(item)
    for i := uint(0); i < bf.numHash; i++ {
        // Double hashing: combine h1 and h2 to simulate k independent hashes
        pos := (h1 + uint64(i)*h2) % uint64(bf.numBits)
        bf.bits[pos/64] |= 1 << (pos % 64)
    }
}

func (bf *BloomFilter) MightContain(item []byte) bool {
    h1, h2 := murmur3.Sum128(item)
    for i := uint(0); i < bf.numHash; i++ {
        pos := (h1 + uint64(i)*h2) % uint64(bf.numBits)
        if bf.bits[pos/64] & (1 << (pos % 64)) == 0 {
            return false  // definitely not present
        }
    }
    return true  // probably present
}

Sizing a Bloom Filter

To achieve a false positive rate p for n elements: m = -n × ln(p) / (ln(2))^2. For 1 million elements and 1% false positive rate: m = 9,585,058 bits ≈ 1.14 MB, with k = 7 hash functions. Compare to storing 1M strings (average 20 bytes each) = 20 MB — the Bloom filter is 17x smaller. For 0.1% false positive rate, m grows to ~1.7 MB — still dramatically smaller than the raw set. This makes Bloom filters ideal when memory is the constraint and rare false positives are acceptable.

Application: Database Query Optimization (Cassandra, RocksDB)

LSM-tree databases (Cassandra, RocksDB, LevelDB) maintain multiple SSTables (sorted string tables) on disk. A key lookup must search SSTables from newest to oldest. Without optimization, reading a non-existent key scans all SSTables — expensive. Each SSTable maintains a Bloom filter of its keys. Before searching an SSTable, check the Bloom filter. If it returns negative, skip the SSTable entirely (no I/O). Positive results proceed to disk. This reduces read amplification for non-existent keys from O(num_sstables) to O(1) in the best case.

Application: Cache Pre-Filtering (Redis, CDN)

Cache stampede and one-hit wonders: a request for an item that has never been cached and never will be (a bot scraping random URLs) triggers a cache miss and a database lookup every time. Use a Bloom filter of recently-seen URLs. On cache miss, check the filter: if not seen before, add to filter and serve without caching (likely a one-hit wonder). If seen before (probably a repeat request), fetch from DB and cache normally. This prevents caching ephemeral content and reduces DB load from unique-key scans.

Counting Bloom Filters and Deletions

Standard Bloom filters do not support deletion — you cannot unset a bit without potentially affecting other elements sharing that bit. Counting Bloom filters replace each bit with a counter (4-bit or larger). Increment counters on add, decrement on delete. This allows deletion at the cost of 4x memory usage. Alternatively, use a Cuckoo filter (a different probabilistic structure) that supports deletion with lower false positive rate and better cache performance than Bloom filters for high occupancy rates.

Key Interview Discussion Points

False positive vs. false negative: a Bloom filter has zero false negatives — if it says no, the answer is definitely no
Scalable Bloom filters: when n is unknown, use a series of Bloom filters (each 2x larger) and query all — Scalable Bloom Filter (SBF)
Distributed Bloom filters: store the bit array in Redis (GETBIT/SETBIT) for shared access across services
Chrome uses a Bloom filter to check URLs against a local database of malicious sites before making a network call to Google Safe Browsing
When NOT to use: when false positives are unacceptable (financial transactions, security decisions) or when deletions are frequent