A bloom filter is a space-efficient probabilistic data structure that answers membership queries with certainty about absence and high confidence about presence. It is one of the most practical tools in distributed systems design: a 100-million-element set can be represented in roughly 120 MB with a 1% false positive rate, versus gigabytes for an exact set. This post covers the complete low-level design of a bloom filter service.
Core Mechanics
A bloom filter consists of an m-bit array initialized to all zeros, and k independent hash functions each mapping an item to a position in [0, m).
Insert: for a given item, compute k hash positions and set each bit to 1.
Query: compute the same k hash positions. If any bit is 0, the item is definitely absent — it could not have been inserted without setting all its bits. If all bits are 1, the item is probably present — there is a small probability that the bits were set by different items.
False negatives are impossible. False positives occur when all k bit positions happen to be set by the combined effect of other insertions.
Optimal Parameter Selection
Given expected element count n and desired false positive rate p, the optimal bit array size and number of hash functions are:
m = -n * ln(p) / (ln(2))^2 # bits
k = (m / n) * ln(2) # hash functions
For n = 10,000,000 and p = 0.01 (1% false positive rate): m ≈ 95,850,584 bits (~11.5 MB), k ≈ 7.
For p = 0.001 (0.1%): m ≈ 143,775,876 bits (~17.2 MB), k ≈ 10.
Scalable Bloom Filter
A standard bloom filter has a fixed capacity. As elements are added beyond n, the false positive rate rises above the target. A Scalable Bloom Filter (SBF) chains multiple filters: when the current filter reaches its target fill ratio (typically 50%), a new filter is added with a stricter false positive rate (e.g., multiplied by r = 0.5 per stage). The effective combined false positive rate is bounded by the geometric series sum.
Query: an item is present if any filter in the chain reports present. Insert: always insert into the most recent filter.
Counting Bloom Filter
Standard bloom filters do not support deletion: clearing a bit set by item A might remove a bit needed for item B. A counting bloom filter replaces each bit with a small counter (typically 4 bits). Insert increments the k counters; delete decrements them. Query checks that all k counters are non-zero.
Counters can overflow (at 4 bits, saturates at 15 insertions of the same hash position). Use 8-bit counters for high-cardinality keys.
Redis-Backed Distributed Implementation
Redis natively supports bit array operations via SETBIT / GETBIT. A distributed bloom filter:
- Compute
khash positions for the item - Pipeline
kSETBIT commands in a single round trip for insert - Pipeline
kGETBIT commands for query; all must return 1 for “possibly present” - Use a Redis Cluster with key
{filter_name}:bits— shard by filter name for locality
Redis also provides native RedisBloom module with BF.ADD / BF.EXISTS / BF.RESERVE commands if available.
Use Cases
- Cache pre-check: before querying a DB for a key, check the bloom filter. If absent, skip the DB query entirely — eliminates cache-miss DB reads for definitely-non-existent keys.
- Email duplicate detection: before processing an inbound email event, check if the message-id has been seen. Prevents double-processing with near-zero memory per tracked message.
- Web crawler URL dedup: Googlebot-scale crawlers store visited URLs in a distributed bloom filter. A 1 billion URL set at 1% FP fits in ~1.2 GB.
SQL Schema
-- Bloom filter configuration registry
CREATE TABLE BloomFilterConfig (
id BIGSERIAL PRIMARY KEY,
name TEXT NOT NULL UNIQUE,
expected_elements BIGINT NOT NULL,
fp_rate FLOAT NOT NULL,
m_bits BIGINT NOT NULL,
k_hashes INTEGER NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Periodic stats snapshots
CREATE TABLE BloomFilterStats (
id BIGSERIAL PRIMARY KEY,
filter_id BIGINT NOT NULL REFERENCES BloomFilterConfig(id),
estimated_elements BIGINT,
false_positive_estimate FLOAT,
sampled_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_bfs_filter_sampled ON BloomFilterStats(filter_id, sampled_at DESC);
Python Implementation
import math
import hashlib
from typing import Optional
def compute_optimal_params(n: int, p: float) -> tuple[int, int]:
"""Return (m_bits, k_hashes) for n expected elements and false positive rate p."""
m = int(-n * math.log(p) / (math.log(2) ** 2))
k = int((m / n) * math.log(2))
return m, max(1, k)
class BloomFilter:
def __init__(self, expected_elements: int, fp_rate: float = 0.01):
self.m, self.k = compute_optimal_params(expected_elements, fp_rate)
self.bits = bytearray((self.m + 7) // 8)
self.count = 0
def _hash_positions(self, item: str) -> list[int]:
positions = []
for seed in range(self.k):
h = int(hashlib.sha256(f"{seed}:{item}".encode()).hexdigest(), 16)
positions.append(h % self.m)
return positions
def add(self, item: str) -> None:
for pos in self._hash_positions(item):
self.bits[pos // 8] |= (1 < bool:
"""Returns True if item is possibly present, False if definitely absent."""
for pos in self._hash_positions(item):
if not (self.bits[pos // 8] & (1 < float:
set_bits = sum(bin(b).count("1") for b in self.bits)
return set_bits / self.m
def estimated_fp_rate(self) -> float:
return (1 - math.exp(-self.k * self.count / self.m)) ** self.k
class ScalableBloomFilter:
"""Chained bloom filters that scale as capacity is exceeded."""
FILL_RATIO_THRESHOLD = 0.5
FP_TIGHTENING_RATIO = 0.5
def __init__(self, initial_capacity: int = 1_000_000, initial_fp_rate: float = 0.01):
self.initial_capacity = initial_capacity
self.initial_fp_rate = initial_fp_rate
self.filters: list[BloomFilter] = [
BloomFilter(initial_capacity, initial_fp_rate)
]
def add(self, item: str) -> None:
current = self.filters[-1]
if current.fill_ratio() >= self.FILL_RATIO_THRESHOLD:
new_fp = self.initial_fp_rate * (self.FP_TIGHTENING_RATIO ** len(self.filters))
new_capacity = self.initial_capacity * (2 ** len(self.filters))
self.filters.append(BloomFilter(new_capacity, max(new_fp, 1e-6)))
self.filters[-1].add(item)
def contains(self, item: str) -> bool:
return any(f.contains(item) for f in self.filters)
class RedisBloomFilter:
"""Redis-backed distributed bloom filter using SETBIT/GETBIT pipelines."""
def __init__(self, redis_client, filter_name: str, m_bits: int, k_hashes: int):
self.redis = redis_client
self.key = f"bloom:{filter_name}:bits"
self.m = m_bits
self.k = k_hashes
def _positions(self, item: str) -> list[int]:
return [
int(hashlib.sha256(f"{s}:{item}".encode()).hexdigest(), 16) % self.m
for s in range(self.k)
]
def add(self, item: str) -> None:
pipe = self.redis.pipeline(transaction=False)
for pos in self._positions(item):
pipe.setbit(self.key, pos, 1)
pipe.execute()
def contains(self, item: str) -> bool:
pipe = self.redis.pipeline(transaction=False)
for pos in self._positions(item):
pipe.getbit(self.key, pos)
results = pipe.execute()
return all(results)
False Positive Rate Implications
In a cache pre-check scenario, a 1% false positive rate means 1% of “definitely absent” misses still hit the database — acceptable overhead. In a security context (e.g., checking revoked tokens), a false positive means incorrectly accepting a revoked credential, which may be unacceptable. Always calibrate the false positive rate to the cost of an incorrect “present” answer in your specific use case.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety