Bloom Filter Low-Level Design: Probabilistic Set Membership at Scale

A Bloom filter is a space-efficient probabilistic data structure that answers the question “is this element in the set?” with two possible answers: “definitely not” or “probably yes.” The “probably yes” answer has a configurable false-positive rate — the filter may report an element is present when it isn’t. But it never produces false negatives: if the filter says an element is absent, it is guaranteed to be absent. This asymmetry makes Bloom filters ideal for eliminating unnecessary expensive lookups: check the filter first; only go to the database if the filter says the element might exist.

Implementation

import math
import mmh3  # MurmurHash3 — fast, good distribution

class BloomFilter:
    def __init__(self, expected_items: int, false_positive_rate: float = 0.01):
        # Calculate optimal bit array size and number of hash functions
        self.size = self._optimal_size(expected_items, false_positive_rate)
        self.hash_count = self._optimal_hash_count(self.size, expected_items)
        self.bit_array = bytearray(math.ceil(self.size / 8))
        self.items_added = 0

    def _optimal_size(self, n: int, p: float) -> int:
        """m = -(n * ln(p)) / (ln(2)^2)"""
        return int(-n * math.log(p) / (math.log(2) ** 2))

    def _optimal_hash_count(self, m: int, n: int) -> int:
        """k = (m/n) * ln(2)"""
        return max(1, int((m / n) * math.log(2)))

    def _get_positions(self, item: str) -> list[int]:
        positions = []
        for seed in range(self.hash_count):
            # Use different seeds to simulate k independent hash functions
            pos = mmh3.hash(item, seed) % self.size
            positions.append(abs(pos))
        return positions

    def add(self, item: str):
        for pos in self._get_positions(item):
            byte_idx, bit_idx = divmod(pos, 8)
            self.bit_array[byte_idx] |= (1 < bool:
        """Returns False = definitely not present. True = probably present."""
        for pos in self._get_positions(item):
            byte_idx, bit_idx = divmod(pos, 8)
            if not (self.bit_array[byte_idx] & (1 < float:
        """Current FP rate given items added so far."""
        k, m, n = self.hash_count, self.size, self.items_added
        return (1 - math.exp(-k * n / m)) ** k

# Example: 1M items, 1% FP rate
# size = 9,585,058 bits (~1.1 MB)  vs  8 bytes/item in a hash set (~8 MB)
bf = BloomFilter(expected_items=1_000_000, false_positive_rate=0.01)
print(f"Bit array size: {bf.size:,} bits ({bf.size//8:,} bytes)")
print(f"Hash functions: {bf.hash_count}")
# Output:
# Bit array size: 9,585,058 bits (1,198,132 bytes ~= 1.1 MB)
# Hash functions: 7

Key Use Cases

Username availability check: before querying the database to check if a username is taken, check the Bloom filter. If the filter says “definitely not present,” skip the DB query entirely. If “probably present,” query the DB to confirm. This eliminates ~99% of DB lookups for usernames that are not taken (the common case during sign-up).

class UsernameChecker:
    def __init__(self):
        self.bloom = BloomFilter(expected_items=50_000_000, false_positive_rate=0.001)
        self._load_existing_usernames()  # populate from DB at startup

    def _load_existing_usernames(self):
        for username in db.fetchall_scalar("SELECT username FROM User"):
            self.bloom.add(username.lower())

    def is_available(self, username: str) -> bool:
        if not self.bloom.contains(username.lower()):
            return True   # definitely available -- skip DB query
        # Bloom says "maybe taken" -- must verify with DB
        return not db.fetchone("SELECT 1 FROM User WHERE username=%s",
                               [username.lower()])

Preventing duplicate URL crawl: a web crawler maintains a Bloom filter of visited URLs. Before adding a URL to the crawl queue, check the filter. If “definitely not visited,” add to queue. False positives (filter says visited but it’s not) result in skipping a URL — acceptable for a crawler. Storage: a Bloom filter for 1 billion URLs at 1% FP rate requires ~1.2 GB; a hash set would require ~40 GB.

Cache negative caching (cache penetration prevention): when a cache miss occurs and the DB also returns nothing, add the key to a Bloom filter. On future requests for that key, check the filter first — if “probably present in filter” means “definitely does not exist in DB,” return empty immediately without hitting cache or DB.

Redis Bloom Filter (Production Use)

# RedisBloom module (available in Redis Stack and Redis Cloud)
# No need to implement from scratch in production

import redis
r = redis.Redis()

# Create a Bloom filter with 1M capacity and 1% error rate
r.execute_command('BF.RESERVE', 'usernames', 0.01, 1_000_000)

# Add items
r.execute_command('BF.ADD', 'usernames', 'alice')
r.execute_command('BF.MADD', 'usernames', 'bob', 'charlie', 'diana')

# Query
result = r.execute_command('BF.EXISTS', 'usernames', 'alice')   # 1 (probably present)
result = r.execute_command('BF.EXISTS', 'usernames', 'unknown') # 0 (definitely absent)

# The filter persists in Redis, survives restarts, and can be replicated

Sizing and Tuning

# Relationship between size, items, and false positive rate:
# p = (1 - e^(-k*n/m))^k
# where m = bits, n = items, k = hash functions

# For 1% FP rate:
# 10K items  -> 95,851 bits   (11 KB)
# 1M items   -> 9,585,058 bits (1.1 MB)
# 100M items -> 958,505,856 bits (115 MB)
# 1B items   -> ~1.1 GB

# For 0.1% FP rate (10x stricter):
# 1M items   -> 14,377,588 bits (1.7 MB)  -- only 1.5x larger for 10x fewer false positives

Key Interview Points

  • No false negatives — if contains() returns False, the element is 100% not in the set. This is the guarantee that makes Bloom filters useful.
  • Cannot delete elements — setting bits to 0 would affect other elements that hash to the same positions. Use a Counting Bloom filter (store counts instead of bits) if deletions are needed.
  • Space efficiency: at 1% false positive rate, a Bloom filter uses ~10 bits per element regardless of element size. A hash set uses 8+ bytes per element for just the pointer, plus the element itself.
  • The false positive rate increases as elements are added beyond the initial capacity estimate. Monitor items_added vs expected_items and rebuild with a larger filter if the ratio exceeds 1.0.
  • Hash function choice matters: use independent hash functions (MurmurHash3 with different seeds, or FNV) for good distribution. MD5/SHA-256 are too slow; use them only if cryptographic strength is required.

Bloom filter and probabilistic data structure design is discussed in Google system design interview questions.

Bloom filter and large-scale data lookup optimization is covered in Databricks system design interview preparation.

Bloom filter and username availability system design is discussed in Meta system design interview guide.

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

Scroll to Top