Counting the number of distinct elements in a large stream — cardinality estimation — seems to require storing every unique element seen. HyperLogLog achieves this with kilobytes of memory and sub-1% error for billions of elements. It is behind Redis PFCOUNT, BigQuery's APPROX_COUNT_DISTINCT, and Spark's countApproxDistinct. This post covers the complete low-level design.
The Cardinality Estimation Problem
Exact distinct counting requires a hash set of every unique element seen, growing linearly with the true cardinality. For 1 billion unique visitors per day, that is tens of gigabytes. HyperLogLog uses a fixed-size register array (12 KB for p=14) regardless of stream size, with a standard error of ~0.81%.
The HLL Algorithm
Core insight: hash each element to a uniformly random bit string. The probability that the maximum run of leading zeros across all hashes is at least r is approximately 1 / 2^r. If the max leading zeros is r, the estimated cardinality is 2^r. One register is too noisy; HLL averages across many buckets.
Algorithm:
- Choose precision
b(typically 4–16). There are2^bbuckets. - Hash each element to a 64-bit value.
- Use the first
bbits to select a bucket index. - Count the position of the leftmost 1-bit in the remaining bits (leading zeros + 1).
- Update the bucket register to the maximum of its current value and this count.
Estimator: the estimated cardinality is:
E = alpha_m * m^2 * (sum of 2^(-register[j]) for j in 0..m-1)^(-1)
where m = 2^b and alpha_m is a bias correction constant (~0.7213 for large m).
Precision, Memory, and Error
| Precision b | Registers m | Memory | Std Error |
|---|---|---|---|
| 10 | 1,024 | 768 B | 3.25% |
| 12 | 4,096 | 3 KB | 1.625% |
| 14 | 16,384 | 12 KB | 0.8125% |
| 16 | 65,536 | 48 KB | 0.4% |
Standard error = 1.04 / sqrt(m). For most applications p=14 gives a good balance: 12 KB and sub-1% error for cardinalities from 0 to hundreds of billions.
Small and Large Range Corrections
The base estimator breaks down at extremes:
- Small range: when many registers are zero, switch to linear counting — count zero registers
V, estimate =m * ln(m / V). This is more accurate for sparse register arrays. - Large range: when the estimate approaches
2^32(32-bit hash saturation), apply a correction to account for hash collisions. With 64-bit hashes this correction is practically unnecessary.
Merge for Set Union Cardinality
The HLL register array is mergeable: the union cardinality of two sets is estimated by taking the component-wise maximum of their register arrays, then running the estimator on the merged array.
merged[j] = max(hll_a[j], hll_b[j]) for j in 0..m-1
This is exact for the HLL estimator and enables distributed cardinality computation: each shard maintains its own HLL, the coordinator merges register arrays, and the merged estimate equals what you would get from a single HLL that saw all elements.
Redis Integration
Redis implements HyperLogLog natively:
PFADD key elem1 elem2 ...— add elementsPFCOUNT key [key2 ...]— estimate cardinality (merge if multiple keys)PFMERGE dest src1 src2 ...— merge HLLs into a new key
Each Redis HLL is stored in 12 KB (dense format) or as a sparse encoding for low-cardinality HLLs (< ~300 elements). The dense format uses 6 bits per register (max run length = 63 for 64-bit hashes).
Use Cases
- Unique visitor counting: track daily active users per product feature with a per-day-per-feature HLL key. PFMERGE to get weekly or monthly uniques.
- Distinct query counting: count unique query patterns hitting a search engine or database — identifies cardinality of the query space.
- A/B experiment reach: estimate distinct users exposed to each variant without storing user IDs.
SQL Schema
-- HLL counter registry
CREATE TABLE HLLCounter (
id BIGSERIAL PRIMARY KEY,
name TEXT NOT NULL UNIQUE,
precision INTEGER NOT NULL DEFAULT 14, -- b value, registers = 2^precision
registers BYTEA, -- serialized register array
last_updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Daily cardinality snapshots
CREATE TABLE HLLDailySnapshot (
id BIGSERIAL PRIMARY KEY,
counter_id BIGINT NOT NULL REFERENCES HLLCounter(id),
date DATE NOT NULL,
estimated_cardinality BIGINT NOT NULL,
serialized BYTEA NOT NULL,
UNIQUE(counter_id, date)
);
CREATE INDEX idx_hll_snap_counter ON HLLDailySnapshot(counter_id, date DESC);
Python Implementation
import math
import hashlib
import struct
from typing import Optional
class HyperLogLog:
"""
HyperLogLog cardinality estimator.
precision b: number of leading bits for bucket selection.
m = 2^b registers, standard error = 1.04/sqrt(m).
"""
ALPHA = {
4: 0.673,
5: 0.697,
6: 0.709,
}
ALPHA_DEFAULT = 0.7213
def __init__(self, precision: int = 14):
if not (4 <= precision <= 18):
raise ValueError("precision must be between 4 and 18")
self.b = precision
self.m = 1 < None:
h = int(hashlib.sha256(item.encode()).hexdigest(), 16)
# Use top b bits for bucket index
bucket = h >> (128 - self.b)
# Count leading zeros in remaining bits + 1
remaining = h & ((1 < self.registers[bucket]:
self.registers[bucket] = run_length
def estimate(self) -> int:
"""Return estimated cardinality."""
m = self.m
raw = self._alpha * m * m / sum(2.0 ** (-r) for r in self.registers)
# Small range correction: linear counting
zero_count = self.registers.count(0)
if raw 0:
return int(m * math.log(m / zero_count))
# Large range correction (for 32-bit hashes only, skip for 128-bit)
return int(raw)
def merge(self, other: "HyperLogLog") -> "HyperLogLog":
"""Return new HLL representing the union of this and other."""
if self.b != other.b:
raise ValueError("Cannot merge HLLs with different precision")
merged = HyperLogLog(self.b)
for i in range(self.m):
merged.registers[i] = max(self.registers[i], other.registers[i])
return merged
def serialize(self) -> bytes:
header = struct.pack(">B", self.b)
return header + bytes(self.registers)
@classmethod
def deserialize(cls, data: bytes) -> "HyperLogLog":
precision = struct.unpack(">B", data[:1])[0]
hll = cls(precision)
hll.registers = bytearray(data[1:])
return hll
@staticmethod
def _count_leading_zeros(value: int, total_bits: int) -> int:
if value == 0:
return total_bits
count = 0
mask = 1 <>= 1
return count
class RedisHLL:
"""Thin wrapper around Redis native HyperLogLog commands."""
def __init__(self, redis_client, key: str):
self.redis = redis_client
self.key = key
def add(self, *items: str) -> bool:
return bool(self.redis.pfadd(self.key, *items))
def estimate(self) -> int:
return self.redis.pfcount(self.key)
def merge_into(self, dest_key: str, *other_keys: str) -> None:
self.redis.pfmerge(dest_key, self.key, *other_keys)
Why Harmonic Mean?
The HLL estimator uses the harmonic mean of 2^register[j] values across buckets rather than the arithmetic mean. The arithmetic mean is skewed by outlier registers with large values (high run lengths from lucky hashes). The harmonic mean suppresses these outliers, giving a more stable and accurate estimate of the true cardinality.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture