A bloom filter is a space-efficient probabilistic data structure that answers membership queries with certainty about absence and high confidence about presence. It is one of the most practical tools in distributed systems design: a 100-million-element set can be represented in roughly 120 MB with a 1% false positive rate, versus gigabytes for an exact set. This post covers the complete low-level design of a bloom filter service.
Core Mechanics
A bloom filter consists of an m-bit array initialized to all zeros, and k independent hash functions each mapping an item to a position in [0, m).
Insert: for a given item, compute k hash positions and set each bit to 1.
Query: compute the same k hash positions. If any bit is 0, the item is definitely absent — it could not have been inserted without setting all its bits. If all bits are 1, the item is probably present — there is a small probability that the bits were set by different items.
False negatives are impossible. False positives occur when all k bit positions happen to be set by the combined effect of other insertions.
Optimal Parameter Selection
Given expected element count n and desired false positive rate p, the optimal bit array size and number of hash functions are:
m = -n * ln(p) / (ln(2))^2 # bits
k = (m / n) * ln(2) # hash functions
For n = 10,000,000 and p = 0.01 (1% false positive rate): m ≈ 95,850,584 bits (~11.5 MB), k ≈ 7.
For p = 0.001 (0.1%): m ≈ 143,775,876 bits (~17.2 MB), k ≈ 10.
Scalable Bloom Filter
A standard bloom filter has a fixed capacity. As elements are added beyond n, the false positive rate rises above the target. A Scalable Bloom Filter (SBF) chains multiple filters: when the current filter reaches its target fill ratio (typically 50%), a new filter is added with a stricter false positive rate (e.g., multiplied by r = 0.5 per stage). The effective combined false positive rate is bounded by the geometric series sum.
Query: an item is present if any filter in the chain reports present. Insert: always insert into the most recent filter.
Counting Bloom Filter
Standard bloom filters do not support deletion: clearing a bit set by item A might remove a bit needed for item B. A counting bloom filter replaces each bit with a small counter (typically 4 bits). Insert increments the k counters; delete decrements them. Query checks that all k counters are non-zero.
Counters can overflow (at 4 bits, saturates at 15 insertions of the same hash position). Use 8-bit counters for high-cardinality keys.
Redis-Backed Distributed Implementation
Redis natively supports bit array operations via SETBIT / GETBIT. A distributed bloom filter:
- Compute
khash positions for the item - Pipeline
kSETBIT commands in a single round trip for insert - Pipeline
kGETBIT commands for query; all must return 1 for “possibly present” - Use a Redis Cluster with key
{filter_name}:bits— shard by filter name for locality
Redis also provides native RedisBloom module with BF.ADD / BF.EXISTS / BF.RESERVE commands if available.
Use Cases
- Cache pre-check: before querying a DB for a key, check the bloom filter. If absent, skip the DB query entirely — eliminates cache-miss DB reads for definitely-non-existent keys.
- Email duplicate detection: before processing an inbound email event, check if the message-id has been seen. Prevents double-processing with near-zero memory per tracked message.
- Web crawler URL dedup: Googlebot-scale crawlers store visited URLs in a distributed bloom filter. A 1 billion URL set at 1% FP fits in ~1.2 GB.
SQL Schema
-- Bloom filter configuration registry
CREATE TABLE BloomFilterConfig (
id BIGSERIAL PRIMARY KEY,
name TEXT NOT NULL UNIQUE,
expected_elements BIGINT NOT NULL,
fp_rate FLOAT NOT NULL,
m_bits BIGINT NOT NULL,
k_hashes INTEGER NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Periodic stats snapshots
CREATE TABLE BloomFilterStats (
id BIGSERIAL PRIMARY KEY,
filter_id BIGINT NOT NULL REFERENCES BloomFilterConfig(id),
estimated_elements BIGINT,
false_positive_estimate FLOAT,
sampled_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_bfs_filter_sampled ON BloomFilterStats(filter_id, sampled_at DESC);
Python Implementation
import math
import hashlib
from typing import Optional
def compute_optimal_params(n: int, p: float) -> tuple[int, int]:
"""Return (m_bits, k_hashes) for n expected elements and false positive rate p."""
m = int(-n * math.log(p) / (math.log(2) ** 2))
k = int((m / n) * math.log(2))
return m, max(1, k)
class BloomFilter:
def __init__(self, expected_elements: int, fp_rate: float = 0.01):
self.m, self.k = compute_optimal_params(expected_elements, fp_rate)
self.bits = bytearray((self.m + 7) // 8)
self.count = 0
def _hash_positions(self, item: str) -> list[int]:
positions = []
for seed in range(self.k):
h = int(hashlib.sha256(f"{seed}:{item}".encode()).hexdigest(), 16)
positions.append(h % self.m)
return positions
def add(self, item: str) -> None:
for pos in self._hash_positions(item):
self.bits[pos // 8] |= (1 < bool:
"""Returns True if item is possibly present, False if definitely absent."""
for pos in self._hash_positions(item):
if not (self.bits[pos // 8] & (1 < float:
set_bits = sum(bin(b).count("1") for b in self.bits)
return set_bits / self.m
def estimated_fp_rate(self) -> float:
return (1 - math.exp(-self.k * self.count / self.m)) ** self.k
class ScalableBloomFilter:
"""Chained bloom filters that scale as capacity is exceeded."""
FILL_RATIO_THRESHOLD = 0.5
FP_TIGHTENING_RATIO = 0.5
def __init__(self, initial_capacity: int = 1_000_000, initial_fp_rate: float = 0.01):
self.initial_capacity = initial_capacity
self.initial_fp_rate = initial_fp_rate
self.filters: list[BloomFilter] = [
BloomFilter(initial_capacity, initial_fp_rate)
]
def add(self, item: str) -> None:
current = self.filters[-1]
if current.fill_ratio() >= self.FILL_RATIO_THRESHOLD:
new_fp = self.initial_fp_rate * (self.FP_TIGHTENING_RATIO ** len(self.filters))
new_capacity = self.initial_capacity * (2 ** len(self.filters))
self.filters.append(BloomFilter(new_capacity, max(new_fp, 1e-6)))
self.filters[-1].add(item)
def contains(self, item: str) -> bool:
return any(f.contains(item) for f in self.filters)
class RedisBloomFilter:
"""Redis-backed distributed bloom filter using SETBIT/GETBIT pipelines."""
def __init__(self, redis_client, filter_name: str, m_bits: int, k_hashes: int):
self.redis = redis_client
self.key = f"bloom:{filter_name}:bits"
self.m = m_bits
self.k = k_hashes
def _positions(self, item: str) -> list[int]:
return [
int(hashlib.sha256(f"{s}:{item}".encode()).hexdigest(), 16) % self.m
for s in range(self.k)
]
def add(self, item: str) -> None:
pipe = self.redis.pipeline(transaction=False)
for pos in self._positions(item):
pipe.setbit(self.key, pos, 1)
pipe.execute()
def contains(self, item: str) -> bool:
pipe = self.redis.pipeline(transaction=False)
for pos in self._positions(item):
pipe.getbit(self.key, pos)
results = pipe.execute()
return all(results)
False Positive Rate Implications
In a cache pre-check scenario, a 1% false positive rate means 1% of “definitely absent” misses still hit the database — acceptable overhead. In a security context (e.g., checking revoked tokens), a false positive means incorrectly accepting a revoked credential, which may be unacceptable. Always calibrate the false positive rate to the cost of an incorrect “present” answer in your specific use case.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you calculate the optimal false positive rate for a bloom filter?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The optimal bit array size is m = -n * ln(p) / (ln(2))^2 where n is expected element count and p is the target false positive rate. The optimal number of hash functions is k = (m/n) * ln(2), which rounds to approximately 7 for a 1% false positive rate. These formulas minimize the false positive rate for a given memory budget.”
}
},
{
“@type”: “Question”,
“name”: “Why can a bloom filter have false positives but never false negatives?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A bloom filter sets bits on insert and checks those same bits on query. If an item was inserted, its bits are guaranteed to be set, so the query always returns present — no false negatives. False positives occur when all k bit positions for a queried item happen to be set by the cumulative effect of other items being inserted, causing the filter to incorrectly report the item as present.”
}
},
{
“@type”: “Question”,
“name”: “When should you use a scalable bloom filter instead of a standard one?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use a scalable bloom filter when the total number of elements to be inserted is not known in advance. A standard bloom filter has a fixed capacity: exceeding the expected element count causes the false positive rate to rise above the target. A scalable bloom filter chains additional filters as capacity is reached, maintaining the target false positive rate at the cost of slightly higher query time (checking each filter in the chain).”
}
},
{
“@type”: “Question”,
“name”: “How does a counting bloom filter support deletion?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A counting bloom filter replaces each bit with a small counter (typically 4 bits). On insert, increment the k counters at the hash positions. On delete, decrement them. On query, check that all k counters are greater than zero. Deletion is safe as long as items are not deleted more times than they were inserted. Counter overflow is a concern for 4-bit counters (max value 15); use 8-bit counters for datasets with frequent re-insertion of the same items.”
}
}
]
}
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is the optimal number of hash functions k determined?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “k = (m/n) * ln(2) where m is the bit array size and n is the expected number of elements; this minimizes the false positive rate for the given m and n; too few hash functions cause high false positives, too many cause too many bits to be set.”
}
},
{
“@type”: “Question”,
“name”: “Why can bloom filters have false positives but not false negatives?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Membership is tested by checking all k hash positions; if any bit is 0, the element is definitely absent (no false negatives); if all bits are 1, the element is probably present but could be a coincidental collision from other elements (false positive).”
}
},
{
“@type”: “Question”,
“name”: “How does a scalable bloom filter handle dynamic growth?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A scalable bloom filter chains multiple bloom filters with increasing capacity; when the current filter's estimated occupancy exceeds a threshold, a new larger filter is added; membership queries check all filters in the chain.”
}
},
{
“@type”: “Question”,
“name”: “What is a counting bloom filter and when is deletion needed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A counting bloom filter replaces each bit with a counter; increment on add, decrement on remove; this supports deletion at the cost of more memory (typically 4 bits per counter); standard bloom filters cannot support deletion.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety