How does HyperLogLog estimate cardinality?

HyperLogLog hashes each element and uses the position of the leading zero bits as a measure of rarity. It maintains M registers (M = 2^b), assigning each element to a register using the first b bits of its hash, then storing the maximum number of leading zeros seen in that register. The cardinality estimate is derived from the harmonic mean of 2 raised to each register's value, multiplied by a bias correction constant. With 2048 registers (12KB), error is approximately 1.6%.

What is the error rate and memory usage of HyperLogLog?

HyperLogLog's standard error is 1.04/sqrt(M), where M is the number of registers. With M=16 (64 bytes), error is 26%. With M=1024 (4KB), error is 3.25%. With M=4096 (16KB), error is 1.6%. This is dramatically more memory-efficient than exact counting, which requires O(N) memory proportional to cardinality. Redis PFADD uses 12KB per HyperLogLog (2^14 = 16384 registers with 6-bit registers) for ~0.81% error.

How does the Count-Min Sketch differ from HyperLogLog?

HyperLogLog estimates the count of distinct elements (cardinality) across a stream. Count-Min Sketch estimates the frequency of individual elements. Count-Min Sketch uses a 2D array of counters (d rows × w columns) and d independent hash functions. On insert, it increments all d buckets. On query, it returns the minimum across all d corresponding buckets. The frequency estimate is guaranteed to overcount (never undercount) by at most epsilon * total_elements, with probability 1 - delta, configurable via d and w.

How can you merge HyperLogLog structures?

Two HyperLogLog structures over disjoint streams can be merged by taking the element-wise maximum of their register arrays. The merged structure estimates the cardinality of the union of the two streams, with the same error guarantees as a single HyperLogLog. This property enables distributed cardinality computation: each shard maintains a local HyperLogLog, all shards periodically send their register arrays to a central aggregator, which merges them. This is how distributed analytics systems (BigQuery, Presto) compute COUNT(DISTINCT) across partitions.

What are practical applications of probabilistic data structures in production systems?

Bloom filters prevent unnecessary disk reads (Cassandra, HBase, LevelDB each check the Bloom filter before reading an SSTable). HyperLogLog counts unique visitors per page (analytics pipelines, Redis), unique search queries, and distinct IPs for DDoS detection. Count-Min Sketch tracks top-K frequent items in data streams (finding hot keys in Redis, trending hashtags). These structures enable real-time analytics at web scale (billions of events/day) with bounded memory, where exact computation would require terabytes of state.

Low Level Design: HyperLogLog and Approximate Counting

⏱ 8 min read

Counting distinct elements is one of the most common operations in analytics: how many unique visitors hit your site today, how many distinct search terms were entered this week, how many unique IP addresses made requests to your API? The naive approach — store every element in a hash set and count — requires O(N) memory proportional to the cardinality. For billions of events per day, that means gigabytes of RAM per counter. Probabilistic data structures like HyperLogLog solve this by trading a small, bounded error for dramatic memory savings.

Why Exact Counting Fails at Scale

Consider counting unique visitors across a web property serving 1 billion requests per day. Each visitor ID might be a 64-bit integer or a UUID. A hash set storing all unique IDs requires memory proportional to the number of distinct values. If there are 100 million unique visitors and each ID is 8 bytes, the set takes roughly 800MB — just for one counter, just for one day. Multiply that by thousands of metrics across dozens of time windows and the memory requirement becomes untenable. The key insight behind approximate counting is that we don’t need exact answers. Analytics use cases tolerate 1–3% error. If we can get that error bound with kilobytes instead of gigabytes, it’s an obvious win.

FM Sketch: The Foundation (Flajolet-Martin 1983)

The Flajolet-Martin sketch was the first practical probabilistic cardinality estimator. The core observation: if you hash each element to a uniform random bit string, the probability that a hash has at least k leading zeros is 2^(-k). If you’ve seen n distinct elements, the maximum number of leading zeros you’d expect to observe is approximately log2(n). The algorithm: maintain a single integer representing the maximum number of leading zeros seen across all hashed elements. At query time, estimate cardinality as 2^(max_leading_zeros). This uses O(log log N) bits — essentially constant memory. The problem is high variance: a single unlucky hash can wildly distort the estimate. The fix is averaging: run multiple independent hash functions and average their estimates. This reduces variance but increases complexity. HyperLogLog builds on this foundation with a more principled approach to averaging.

HyperLogLog: Register-Based Estimation

HyperLogLog (Flajolet et al., 2007) uses M = 2^b registers to reduce variance. For each element: hash it to a 64-bit value, use the first b bits to select a register index (0 to M-1), count the number of leading zeros in the remaining bits plus one, update the selected register to max(current_value, leading_zeros + 1). To estimate cardinality, compute the harmonic mean of 2^register_value across all M registers and apply a bias correction constant. The harmonic mean is critical: it down-weights outlier registers with anomalously high values, giving a more stable estimate than the arithmetic mean. The formula is: estimate = alpha_m * M^2 * (sum of 2^(-register_i))^(-1), where alpha_m is a small correction constant that depends on M.

Error Rate and Memory Trade-off

The standard error of HyperLogLog is 1.04 / sqrt(M). With M = 2048 registers (b = 11), the error is approximately 1.04 / sqrt(2048) ≈ 2.3%. Each register needs only 5–6 bits to store values up to 64 (since log2(64) = 6). So 2048 registers require 2048 * 6 bits = 12,288 bits = 1.5 KB. Compare that to a hash set storing 100 million 64-bit IDs: 800 MB. That’s a compression ratio of over 500,000:1 for a 2.3% error bound. Increasing registers reduces error: 16,384 registers (b = 14) gives ~0.8% error at 12 KB. The relationship is clear — doubling registers halves the error, and the memory cost is negligible at any practical register count.

HyperLogLog++: Google’s Improvements

Google’s HyperLogLog++ paper (Heule et al., 2013) introduced two significant improvements. First, sparse representation: for small cardinalities, most registers are zero. HyperLogLog++ uses a compressed sparse encoding for registers until the cardinality grows large enough that dense representation becomes more efficient. This eliminates the fixed 12 KB overhead for small counters — a critical optimization when you’re maintaining millions of per-user or per-session counters. Second, empirical bias correction: the original HyperLogLog estimator has systematic bias at very low and very high cardinalities. HyperLogLog++ applies empirically derived correction tables to reduce this bias. The result is better accuracy across the full range of cardinalities, not just the middle range where the theoretical error bound applies. HyperLogLog++ is the implementation behind BigQuery’s COUNT(DISTINCT) and Redis’s HyperLogLog commands.

Merging HyperLogLogs

One of HyperLogLog’s most powerful properties is mergeability. To compute the union cardinality of two HyperLogLog sketches — for example, combining counts from two database shards — take the element-wise maximum of their register arrays. The result is a valid HyperLogLog representing the union of both input sets. This is exact: max(a_i, b_i) for each register i gives the same result as if you had inserted all elements from both sets into a single HyperLogLog from the start. This enables distributed cardinality estimation: run HyperLogLog locally on each shard, ship the small register arrays to a coordinator, merge with element-wise max, read off the estimate. No need to centralize raw data. This is how Presto’s APPROX_DISTINCT and most distributed analytics systems implement cross-shard distinct counts.

Count-Min Sketch: Frequency Estimation

While HyperLogLog answers "how many distinct elements?", Count-Min Sketch answers "how often does element X appear?". The structure is a d × w array of counters with d independent hash functions. On insert(x): for each of the d hash functions h_i, increment counter[i][h_i(x) mod w]. On query(x): return min(counter[i][h_i(x) mod w]) across all d rows. The minimum is an overestimate because hash collisions cause some counters to be incremented by unrelated elements, but taking the minimum across d independent hash functions bounds the overcounting. Error bound: with probability 1 – delta, the estimate is within epsilon * N of the true count, where w = ceil(e / epsilon) and d = ceil(ln(1 / delta)). For 1% error with 99% confidence: w ≈ 272, d ≈ 7. Memory: about 1,900 counters — trivial. Applications include tracking heavy hitters in network traffic, per-IP rate limiting, and frequency estimation in stream processing.

Production Applications

Redis implements HyperLogLog natively with PFADD (add elements) and PFCOUNT (estimate cardinality) commands, using 12 KB per sketch with ~0.81% error. PFMERGE merges multiple sketches into one. BigQuery’s COUNT(DISTINCT col) uses HyperLogLog++ internally with configurable precision. Presto/Trino expose APPROX_DISTINCT(x, e) where e is the desired max standard error. Database query optimizers use HyperLogLog sketches stored in table statistics to estimate cardinality for join reordering — a critical input to the cost-based optimizer. Apache Spark’s DataFrame API exposes approx_count_distinct with a configurable relative standard deviation parameter. The common thread: wherever exact distinct counting would be too expensive and small errors are acceptable, HyperLogLog is the standard tool.

Implementation Considerations

Choosing b (the number of register bits, which sets M = 2^b): b = 10 gives ~3.25% error at 768 bytes; b = 14 gives ~0.82% error at 12 KB. Most production systems default to b = 14. Hash function choice matters: use a high-quality 64-bit hash like MurmurHash3 or xxHash. Weak hashes lead to poor register distribution and degraded accuracy. For integer keys, avoid identity hashing — apply a mixing function first. Thread safety: HyperLogLog registers need atomic updates in multi-threaded environments. Redis sidesteps this by being single-threaded. Distributed systems merge at read time rather than locking shared registers. Serialization: the register array is compact and easily serialized for storage or network transfer. A 12 KB sketch fits in a single network packet. Compression (e.g., with zstd) can reduce it further for archival storage.

When Not to Use Approximate Counting

Approximate counting is the wrong tool when exact answers are required: billing systems, fraud detection with hard thresholds, deduplication before writing to a database, or any system where a 2% error translates to real money or correctness violations. It’s also the wrong tool when cardinality is small enough that a hash set is cheap — for fewer than ~10,000 elements, a simple set is both exact and inexpensive. The sweet spot is high-cardinality counting in analytics, monitoring, and query optimization contexts where approximate answers are explicitly acceptable and the memory savings matter.