A Bloom filter is a space-efficient probabilistic data structure that answers the question “is this element in the set?” with two possible answers: “definitely not” or “probably yes.” The “probably yes” answer has a configurable false-positive rate — the filter may report an element is present when it isn’t. But it never produces false negatives: if the filter says an element is absent, it is guaranteed to be absent. This asymmetry makes Bloom filters ideal for eliminating unnecessary expensive lookups: check the filter first; only go to the database if the filter says the element might exist.
Implementation
import math
import mmh3 # MurmurHash3 — fast, good distribution
class BloomFilter:
def __init__(self, expected_items: int, false_positive_rate: float = 0.01):
# Calculate optimal bit array size and number of hash functions
self.size = self._optimal_size(expected_items, false_positive_rate)
self.hash_count = self._optimal_hash_count(self.size, expected_items)
self.bit_array = bytearray(math.ceil(self.size / 8))
self.items_added = 0
def _optimal_size(self, n: int, p: float) -> int:
"""m = -(n * ln(p)) / (ln(2)^2)"""
return int(-n * math.log(p) / (math.log(2) ** 2))
def _optimal_hash_count(self, m: int, n: int) -> int:
"""k = (m/n) * ln(2)"""
return max(1, int((m / n) * math.log(2)))
def _get_positions(self, item: str) -> list[int]:
positions = []
for seed in range(self.hash_count):
# Use different seeds to simulate k independent hash functions
pos = mmh3.hash(item, seed) % self.size
positions.append(abs(pos))
return positions
def add(self, item: str):
for pos in self._get_positions(item):
byte_idx, bit_idx = divmod(pos, 8)
self.bit_array[byte_idx] |= (1 < bool:
"""Returns False = definitely not present. True = probably present."""
for pos in self._get_positions(item):
byte_idx, bit_idx = divmod(pos, 8)
if not (self.bit_array[byte_idx] & (1 < float:
"""Current FP rate given items added so far."""
k, m, n = self.hash_count, self.size, self.items_added
return (1 - math.exp(-k * n / m)) ** k
# Example: 1M items, 1% FP rate
# size = 9,585,058 bits (~1.1 MB) vs 8 bytes/item in a hash set (~8 MB)
bf = BloomFilter(expected_items=1_000_000, false_positive_rate=0.01)
print(f"Bit array size: {bf.size:,} bits ({bf.size//8:,} bytes)")
print(f"Hash functions: {bf.hash_count}")
# Output:
# Bit array size: 9,585,058 bits (1,198,132 bytes ~= 1.1 MB)
# Hash functions: 7
Key Use Cases
Username availability check: before querying the database to check if a username is taken, check the Bloom filter. If the filter says “definitely not present,” skip the DB query entirely. If “probably present,” query the DB to confirm. This eliminates ~99% of DB lookups for usernames that are not taken (the common case during sign-up).
class UsernameChecker:
def __init__(self):
self.bloom = BloomFilter(expected_items=50_000_000, false_positive_rate=0.001)
self._load_existing_usernames() # populate from DB at startup
def _load_existing_usernames(self):
for username in db.fetchall_scalar("SELECT username FROM User"):
self.bloom.add(username.lower())
def is_available(self, username: str) -> bool:
if not self.bloom.contains(username.lower()):
return True # definitely available -- skip DB query
# Bloom says "maybe taken" -- must verify with DB
return not db.fetchone("SELECT 1 FROM User WHERE username=%s",
[username.lower()])
Preventing duplicate URL crawl: a web crawler maintains a Bloom filter of visited URLs. Before adding a URL to the crawl queue, check the filter. If “definitely not visited,” add to queue. False positives (filter says visited but it’s not) result in skipping a URL — acceptable for a crawler. Storage: a Bloom filter for 1 billion URLs at 1% FP rate requires ~1.2 GB; a hash set would require ~40 GB.
Cache negative caching (cache penetration prevention): when a cache miss occurs and the DB also returns nothing, add the key to a Bloom filter. On future requests for that key, check the filter first — if “probably present in filter” means “definitely does not exist in DB,” return empty immediately without hitting cache or DB.
Redis Bloom Filter (Production Use)
# RedisBloom module (available in Redis Stack and Redis Cloud)
# No need to implement from scratch in production
import redis
r = redis.Redis()
# Create a Bloom filter with 1M capacity and 1% error rate
r.execute_command('BF.RESERVE', 'usernames', 0.01, 1_000_000)
# Add items
r.execute_command('BF.ADD', 'usernames', 'alice')
r.execute_command('BF.MADD', 'usernames', 'bob', 'charlie', 'diana')
# Query
result = r.execute_command('BF.EXISTS', 'usernames', 'alice') # 1 (probably present)
result = r.execute_command('BF.EXISTS', 'usernames', 'unknown') # 0 (definitely absent)
# The filter persists in Redis, survives restarts, and can be replicated
Sizing and Tuning
# Relationship between size, items, and false positive rate:
# p = (1 - e^(-k*n/m))^k
# where m = bits, n = items, k = hash functions
# For 1% FP rate:
# 10K items -> 95,851 bits (11 KB)
# 1M items -> 9,585,058 bits (1.1 MB)
# 100M items -> 958,505,856 bits (115 MB)
# 1B items -> ~1.1 GB
# For 0.1% FP rate (10x stricter):
# 1M items -> 14,377,588 bits (1.7 MB) -- only 1.5x larger for 10x fewer false positives
Key Interview Points
- No false negatives — if contains() returns False, the element is 100% not in the set. This is the guarantee that makes Bloom filters useful.
- Cannot delete elements — setting bits to 0 would affect other elements that hash to the same positions. Use a Counting Bloom filter (store counts instead of bits) if deletions are needed.
- Space efficiency: at 1% false positive rate, a Bloom filter uses ~10 bits per element regardless of element size. A hash set uses 8+ bytes per element for just the pointer, plus the element itself.
- The false positive rate increases as elements are added beyond the initial capacity estimate. Monitor items_added vs expected_items and rebuild with a larger filter if the ratio exceeds 1.0.
- Hash function choice matters: use independent hash functions (MurmurHash3 with different seeds, or FNV) for good distribution. MD5/SHA-256 are too slow; use them only if cryptographic strength is required.
Bloom filter and probabilistic data structure design is discussed in Google system design interview questions.
Bloom filter and large-scale data lookup optimization is covered in Databricks system design interview preparation.
Bloom filter and username availability system design is discussed in Meta system design interview guide.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering