Bloom Filter Service Low-Level Design: Bit Array Design, Hash Functions, Scalable Variants, and False Positive Trade-offs

A bloom filter is a space-efficient probabilistic data structure that answers membership queries with certainty about absence and high confidence about presence. It is one of the most practical tools in distributed systems design: a 100-million-element set can be represented in roughly 120 MB with a 1% false positive rate, versus gigabytes for an exact set. This post covers the complete low-level design of a bloom filter service.

Core Mechanics

A bloom filter consists of an m-bit array initialized to all zeros, and k independent hash functions each mapping an item to a position in [0, m).

Insert: for a given item, compute k hash positions and set each bit to 1.

Query: compute the same k hash positions. If any bit is 0, the item is definitely absent — it could not have been inserted without setting all its bits. If all bits are 1, the item is probably present — there is a small probability that the bits were set by different items.

False negatives are impossible. False positives occur when all k bit positions happen to be set by the combined effect of other insertions.

Optimal Parameter Selection

Given expected element count n and desired false positive rate p, the optimal bit array size and number of hash functions are:

m = -n * ln(p) / (ln(2))^2     # bits
k = (m / n) * ln(2)            # hash functions

For n = 10,000,000 and p = 0.01 (1% false positive rate): m ≈ 95,850,584 bits (~11.5 MB), k ≈ 7.

For p = 0.001 (0.1%): m ≈ 143,775,876 bits (~17.2 MB), k ≈ 10.

Scalable Bloom Filter

A standard bloom filter has a fixed capacity. As elements are added beyond n, the false positive rate rises above the target. A Scalable Bloom Filter (SBF) chains multiple filters: when the current filter reaches its target fill ratio (typically 50%), a new filter is added with a stricter false positive rate (e.g., multiplied by r = 0.5 per stage). The effective combined false positive rate is bounded by the geometric series sum.

Query: an item is present if any filter in the chain reports present. Insert: always insert into the most recent filter.

Counting Bloom Filter

Standard bloom filters do not support deletion: clearing a bit set by item A might remove a bit needed for item B. A counting bloom filter replaces each bit with a small counter (typically 4 bits). Insert increments the k counters; delete decrements them. Query checks that all k counters are non-zero.

Counters can overflow (at 4 bits, saturates at 15 insertions of the same hash position). Use 8-bit counters for high-cardinality keys.

Redis-Backed Distributed Implementation

Redis natively supports bit array operations via SETBIT / GETBIT. A distributed bloom filter:

  • Compute k hash positions for the item
  • Pipeline k SETBIT commands in a single round trip for insert
  • Pipeline k GETBIT commands for query; all must return 1 for “possibly present”
  • Use a Redis Cluster with key {filter_name}:bits — shard by filter name for locality

Redis also provides native RedisBloom module with BF.ADD / BF.EXISTS / BF.RESERVE commands if available.

Use Cases

  • Cache pre-check: before querying a DB for a key, check the bloom filter. If absent, skip the DB query entirely — eliminates cache-miss DB reads for definitely-non-existent keys.
  • Email duplicate detection: before processing an inbound email event, check if the message-id has been seen. Prevents double-processing with near-zero memory per tracked message.
  • Web crawler URL dedup: Googlebot-scale crawlers store visited URLs in a distributed bloom filter. A 1 billion URL set at 1% FP fits in ~1.2 GB.

SQL Schema

-- Bloom filter configuration registry
CREATE TABLE BloomFilterConfig (
    id                BIGSERIAL PRIMARY KEY,
    name              TEXT NOT NULL UNIQUE,
    expected_elements BIGINT NOT NULL,
    fp_rate           FLOAT NOT NULL,
    m_bits            BIGINT NOT NULL,
    k_hashes          INTEGER NOT NULL,
    created_at        TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Periodic stats snapshots
CREATE TABLE BloomFilterStats (
    id                      BIGSERIAL PRIMARY KEY,
    filter_id               BIGINT NOT NULL REFERENCES BloomFilterConfig(id),
    estimated_elements      BIGINT,
    false_positive_estimate FLOAT,
    sampled_at              TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_bfs_filter_sampled ON BloomFilterStats(filter_id, sampled_at DESC);

Python Implementation

import math
import hashlib
from typing import Optional


def compute_optimal_params(n: int, p: float) -> tuple[int, int]:
    """Return (m_bits, k_hashes) for n expected elements and false positive rate p."""
    m = int(-n * math.log(p) / (math.log(2) ** 2))
    k = int((m / n) * math.log(2))
    return m, max(1, k)


class BloomFilter:
    def __init__(self, expected_elements: int, fp_rate: float = 0.01):
        self.m, self.k = compute_optimal_params(expected_elements, fp_rate)
        self.bits = bytearray((self.m + 7) // 8)
        self.count = 0

    def _hash_positions(self, item: str) -> list[int]:
        positions = []
        for seed in range(self.k):
            h = int(hashlib.sha256(f"{seed}:{item}".encode()).hexdigest(), 16)
            positions.append(h % self.m)
        return positions

    def add(self, item: str) -> None:
        for pos in self._hash_positions(item):
            self.bits[pos // 8] |= (1 < bool:
        """Returns True if item is possibly present, False if definitely absent."""
        for pos in self._hash_positions(item):
            if not (self.bits[pos // 8] & (1 < float:
        set_bits = sum(bin(b).count("1") for b in self.bits)
        return set_bits / self.m

    def estimated_fp_rate(self) -> float:
        return (1 - math.exp(-self.k * self.count / self.m)) ** self.k


class ScalableBloomFilter:
    """Chained bloom filters that scale as capacity is exceeded."""
    FILL_RATIO_THRESHOLD = 0.5
    FP_TIGHTENING_RATIO = 0.5

    def __init__(self, initial_capacity: int = 1_000_000, initial_fp_rate: float = 0.01):
        self.initial_capacity = initial_capacity
        self.initial_fp_rate = initial_fp_rate
        self.filters: list[BloomFilter] = [
            BloomFilter(initial_capacity, initial_fp_rate)
        ]

    def add(self, item: str) -> None:
        current = self.filters[-1]
        if current.fill_ratio() >= self.FILL_RATIO_THRESHOLD:
            new_fp = self.initial_fp_rate * (self.FP_TIGHTENING_RATIO ** len(self.filters))
            new_capacity = self.initial_capacity * (2 ** len(self.filters))
            self.filters.append(BloomFilter(new_capacity, max(new_fp, 1e-6)))
        self.filters[-1].add(item)

    def contains(self, item: str) -> bool:
        return any(f.contains(item) for f in self.filters)


class RedisBloomFilter:
    """Redis-backed distributed bloom filter using SETBIT/GETBIT pipelines."""
    def __init__(self, redis_client, filter_name: str, m_bits: int, k_hashes: int):
        self.redis = redis_client
        self.key = f"bloom:{filter_name}:bits"
        self.m = m_bits
        self.k = k_hashes

    def _positions(self, item: str) -> list[int]:
        return [
            int(hashlib.sha256(f"{s}:{item}".encode()).hexdigest(), 16) % self.m
            for s in range(self.k)
        ]

    def add(self, item: str) -> None:
        pipe = self.redis.pipeline(transaction=False)
        for pos in self._positions(item):
            pipe.setbit(self.key, pos, 1)
        pipe.execute()

    def contains(self, item: str) -> bool:
        pipe = self.redis.pipeline(transaction=False)
        for pos in self._positions(item):
            pipe.getbit(self.key, pos)
        results = pipe.execute()
        return all(results)

False Positive Rate Implications

In a cache pre-check scenario, a 1% false positive rate means 1% of “definitely absent” misses still hit the database — acceptable overhead. In a security context (e.g., checking revoked tokens), a false positive means incorrectly accepting a revoked credential, which may be unacceptable. Always calibrate the false positive rate to the cost of an incorrect “present” answer in your specific use case.

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

Scroll to Top