Bloom Filter Low-Level Design: Probabilistic Set Membership at Scale

A Bloom filter is a space-efficient probabilistic data structure that answers the question “is this element in the set?” with two possible answers: “definitely not” or “probably yes.” The “probably yes” answer has a configurable false-positive rate — the filter may report an element is present when it isn’t. But it never produces false negatives: if the filter says an element is absent, it is guaranteed to be absent. This asymmetry makes Bloom filters ideal for eliminating unnecessary expensive lookups: check the filter first; only go to the database if the filter says the element might exist.

Implementation

import math
import mmh3  # MurmurHash3 — fast, good distribution

class BloomFilter:
    def __init__(self, expected_items: int, false_positive_rate: float = 0.01):
        # Calculate optimal bit array size and number of hash functions
        self.size = self._optimal_size(expected_items, false_positive_rate)
        self.hash_count = self._optimal_hash_count(self.size, expected_items)
        self.bit_array = bytearray(math.ceil(self.size / 8))
        self.items_added = 0

    def _optimal_size(self, n: int, p: float) -> int:
        """m = -(n * ln(p)) / (ln(2)^2)"""
        return int(-n * math.log(p) / (math.log(2) ** 2))

    def _optimal_hash_count(self, m: int, n: int) -> int:
        """k = (m/n) * ln(2)"""
        return max(1, int((m / n) * math.log(2)))

    def _get_positions(self, item: str) -> list[int]:
        positions = []
        for seed in range(self.hash_count):
            # Use different seeds to simulate k independent hash functions
            pos = mmh3.hash(item, seed) % self.size
            positions.append(abs(pos))
        return positions

    def add(self, item: str):
        for pos in self._get_positions(item):
            byte_idx, bit_idx = divmod(pos, 8)
            self.bit_array[byte_idx] |= (1 < bool:
        """Returns False = definitely not present. True = probably present."""
        for pos in self._get_positions(item):
            byte_idx, bit_idx = divmod(pos, 8)
            if not (self.bit_array[byte_idx] & (1 < float:
        """Current FP rate given items added so far."""
        k, m, n = self.hash_count, self.size, self.items_added
        return (1 - math.exp(-k * n / m)) ** k

# Example: 1M items, 1% FP rate
# size = 9,585,058 bits (~1.1 MB)  vs  8 bytes/item in a hash set (~8 MB)
bf = BloomFilter(expected_items=1_000_000, false_positive_rate=0.01)
print(f"Bit array size: {bf.size:,} bits ({bf.size//8:,} bytes)")
print(f"Hash functions: {bf.hash_count}")
# Output:
# Bit array size: 9,585,058 bits (1,198,132 bytes ~= 1.1 MB)
# Hash functions: 7

Key Use Cases

Username availability check: before querying the database to check if a username is taken, check the Bloom filter. If the filter says “definitely not present,” skip the DB query entirely. If “probably present,” query the DB to confirm. This eliminates ~99% of DB lookups for usernames that are not taken (the common case during sign-up).

class UsernameChecker:
    def __init__(self):
        self.bloom = BloomFilter(expected_items=50_000_000, false_positive_rate=0.001)
        self._load_existing_usernames()  # populate from DB at startup

    def _load_existing_usernames(self):
        for username in db.fetchall_scalar("SELECT username FROM User"):
            self.bloom.add(username.lower())

    def is_available(self, username: str) -> bool:
        if not self.bloom.contains(username.lower()):
            return True   # definitely available -- skip DB query
        # Bloom says "maybe taken" -- must verify with DB
        return not db.fetchone("SELECT 1 FROM User WHERE username=%s",
                               [username.lower()])

Preventing duplicate URL crawl: a web crawler maintains a Bloom filter of visited URLs. Before adding a URL to the crawl queue, check the filter. If “definitely not visited,” add to queue. False positives (filter says visited but it’s not) result in skipping a URL — acceptable for a crawler. Storage: a Bloom filter for 1 billion URLs at 1% FP rate requires ~1.2 GB; a hash set would require ~40 GB.

Cache negative caching (cache penetration prevention): when a cache miss occurs and the DB also returns nothing, add the key to a Bloom filter. On future requests for that key, check the filter first — if “probably present in filter” means “definitely does not exist in DB,” return empty immediately without hitting cache or DB.

Redis Bloom Filter (Production Use)

# RedisBloom module (available in Redis Stack and Redis Cloud)
# No need to implement from scratch in production

import redis
r = redis.Redis()

# Create a Bloom filter with 1M capacity and 1% error rate
r.execute_command('BF.RESERVE', 'usernames', 0.01, 1_000_000)

# Add items
r.execute_command('BF.ADD', 'usernames', 'alice')
r.execute_command('BF.MADD', 'usernames', 'bob', 'charlie', 'diana')

# Query
result = r.execute_command('BF.EXISTS', 'usernames', 'alice')   # 1 (probably present)
result = r.execute_command('BF.EXISTS', 'usernames', 'unknown') # 0 (definitely absent)

# The filter persists in Redis, survives restarts, and can be replicated

Sizing and Tuning

# Relationship between size, items, and false positive rate:
# p = (1 - e^(-k*n/m))^k
# where m = bits, n = items, k = hash functions

# For 1% FP rate:
# 10K items  -> 95,851 bits   (11 KB)
# 1M items   -> 9,585,058 bits (1.1 MB)
# 100M items -> 958,505,856 bits (115 MB)
# 1B items   -> ~1.1 GB

# For 0.1% FP rate (10x stricter):
# 1M items   -> 14,377,588 bits (1.7 MB)  -- only 1.5x larger for 10x fewer false positives

Key Interview Points

No false negatives — if contains() returns False, the element is 100% not in the set. This is the guarantee that makes Bloom filters useful.
Cannot delete elements — setting bits to 0 would affect other elements that hash to the same positions. Use a Counting Bloom filter (store counts instead of bits) if deletions are needed.
Space efficiency: at 1% false positive rate, a Bloom filter uses ~10 bits per element regardless of element size. A hash set uses 8+ bytes per element for just the pointer, plus the element itself.
The false positive rate increases as elements are added beyond the initial capacity estimate. Monitor items_added vs expected_items and rebuild with a larger filter if the ratio exceeds 1.0.
Hash function choice matters: use independent hash functions (MurmurHash3 with different seeds, or FNV) for good distribution. MD5/SHA-256 are too slow; use them only if cryptographic strength is required.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What does "no false negatives" mean in a Bloom filter and why does it matter?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”If a Bloom filter’s contains() method returns False, the element is 100% guaranteed to not be in the set — this is the "no false negatives" property. If it returns True, the element is probably in the set but may not be (a false positive). This asymmetry is what makes Bloom filters useful: you can trust a negative answer completely. The practical consequence: use a Bloom filter as a fast pre-check before an expensive operation. If the filter says "not present" (False), skip the expensive lookup entirely — you are guaranteed the lookup would return nothing. Only perform the expensive operation when the filter says "probably present" (True). This eliminates expensive DB queries, cache lookups, or disk reads for the majority of absent-key requests.”}},{“@type”:”Question”,”name”:”How do you calculate the optimal Bloom filter size for a given false positive rate?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Two formulas: (1) Optimal bit array size: m = -(n × ln(p)) / (ln(2)²), where n = expected number of elements, p = desired false positive rate. For 1 million items at 1% FP rate: m = -(1,000,000 × ln(0.01)) / (ln(2)²) ≈ 9,585,058 bits ≈ 1.1 MB. (2) Optimal number of hash functions: k = (m/n) × ln(2). For the above: k = (9,585,058 / 1,000,000) × ln(2) ≈ 7 hash functions. Key insight: the FP rate is only valid if the filter holds at most n elements. Adding more elements increases the actual FP rate above the target. Monitor items_added/expected_items and rebuild with a larger filter if this ratio exceeds 1.0.”}},{“@type”:”Question”,”name”:”Why can’t you delete elements from a standard Bloom filter?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Each bit in the array may have been set by multiple different elements — any element whose hash functions include that bit position will have set it. Setting a bit to 0 to "delete" an element would potentially clear bits that other elements depend on, causing those elements to appear absent (false negative) — violating the core guarantee. The Counting Bloom filter extends each bit to a counter: add() increments counters, delete() decrements them, contains() checks if all counters are non-zero. This allows deletion at the cost of 4x more memory (4-bit or 8-bit counters instead of 1 bit). For most use cases (crawler visited-URL tracking, username availability), deletions are not needed — elements only grow.”}},{“@type”:”Question”,”name”:”How do you use a Bloom filter to prevent cache penetration attacks?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Cache penetration: an attacker sends requests for keys that don’t exist in cache or DB (e.g., user_id=-1, product_id=999999999). Each request misses the cache, hits the DB, returns nothing, but does not cache the miss. Under high volume, this bypasses the cache entirely and floods the DB. Fix: maintain a Bloom filter of all valid IDs. On each request, check the Bloom filter first. If "definitely not present" (False) — return 404 immediately without touching cache or DB. If "probably present" (True) — proceed with the normal cache-then-DB lookup. False positives are harmless (you do a DB lookup that returns nothing, same as the normal miss path). Populate the filter at startup from the DB and update it when new records are created.”}},{“@type”:”Question”,”name”:”How is a Bloom filter used in LSM-tree databases like Cassandra and RocksDB?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”LSM-tree databases (Cassandra, RocksDB, LevelDB) store data in immutable SSTables on disk. A read for a key must check multiple SSTables to find the most recent version — potentially many disk reads. Each SSTable has a Bloom filter (stored in memory) representing the keys it contains. On a read: check each SSTable’s Bloom filter. If "definitely not present" (False) — skip that SSTable entirely. If "probably present" (True) — read the SSTable. A well-tuned Bloom filter with 1% FP rate eliminates 99% of unnecessary SSTable reads. This is why Cassandra’s read path is fast despite data being spread across many SSTables — Bloom filters make the disk I/O bounded rather than linear in the number of SSTables.”}}]}

Bloom filter and probabilistic data structure design is discussed in Google system design interview questions.

Bloom filter and large-scale data lookup optimization is covered in Databricks system design interview preparation.

Bloom filter and username availability system design is discussed in Meta system design interview guide.