Counting distinct elements is one of the most common operations in analytics: how many unique visitors hit your site today, how many distinct search terms were entered this week, how many unique IP addresses made requests to your API? The naive approach — store every element in a hash set and count — requires O(N) memory proportional to the cardinality. For billions of events per day, that means gigabytes of RAM per counter. Probabilistic data structures like HyperLogLog solve this by trading a small, bounded error for dramatic memory savings.
Why Exact Counting Fails at Scale
Consider counting unique visitors across a web property serving 1 billion requests per day. Each visitor ID might be a 64-bit integer or a UUID. A hash set storing all unique IDs requires memory proportional to the number of distinct values. If there are 100 million unique visitors and each ID is 8 bytes, the set takes roughly 800MB — just for one counter, just for one day. Multiply that by thousands of metrics across dozens of time windows and the memory requirement becomes untenable. The key insight behind approximate counting is that we don’t need exact answers. Analytics use cases tolerate 1–3% error. If we can get that error bound with kilobytes instead of gigabytes, it’s an obvious win.
FM Sketch: The Foundation (Flajolet-Martin 1983)
The Flajolet-Martin sketch was the first practical probabilistic cardinality estimator. The core observation: if you hash each element to a uniform random bit string, the probability that a hash has at least k leading zeros is 2^(-k). If you’ve seen n distinct elements, the maximum number of leading zeros you’d expect to observe is approximately log2(n). The algorithm: maintain a single integer representing the maximum number of leading zeros seen across all hashed elements. At query time, estimate cardinality as 2^(max_leading_zeros). This uses O(log log N) bits — essentially constant memory. The problem is high variance: a single unlucky hash can wildly distort the estimate. The fix is averaging: run multiple independent hash functions and average their estimates. This reduces variance but increases complexity. HyperLogLog builds on this foundation with a more principled approach to averaging.
HyperLogLog: Register-Based Estimation
HyperLogLog (Flajolet et al., 2007) uses M = 2^b registers to reduce variance. For each element: hash it to a 64-bit value, use the first b bits to select a register index (0 to M-1), count the number of leading zeros in the remaining bits plus one, update the selected register to max(current_value, leading_zeros + 1). To estimate cardinality, compute the harmonic mean of 2^register_value across all M registers and apply a bias correction constant. The harmonic mean is critical: it down-weights outlier registers with anomalously high values, giving a more stable estimate than the arithmetic mean. The formula is: estimate = alpha_m * M^2 * (sum of 2^(-register_i))^(-1), where alpha_m is a small correction constant that depends on M.
Error Rate and Memory Trade-off
The standard error of HyperLogLog is 1.04 / sqrt(M). With M = 2048 registers (b = 11), the error is approximately 1.04 / sqrt(2048) ≈ 2.3%. Each register needs only 5–6 bits to store values up to 64 (since log2(64) = 6). So 2048 registers require 2048 * 6 bits = 12,288 bits = 1.5 KB. Compare that to a hash set storing 100 million 64-bit IDs: 800 MB. That’s a compression ratio of over 500,000:1 for a 2.3% error bound. Increasing registers reduces error: 16,384 registers (b = 14) gives ~0.8% error at 12 KB. The relationship is clear — doubling registers halves the error, and the memory cost is negligible at any practical register count.
HyperLogLog++: Google’s Improvements
Google’s HyperLogLog++ paper (Heule et al., 2013) introduced two significant improvements. First, sparse representation: for small cardinalities, most registers are zero. HyperLogLog++ uses a compressed sparse encoding for registers until the cardinality grows large enough that dense representation becomes more efficient. This eliminates the fixed 12 KB overhead for small counters — a critical optimization when you’re maintaining millions of per-user or per-session counters. Second, empirical bias correction: the original HyperLogLog estimator has systematic bias at very low and very high cardinalities. HyperLogLog++ applies empirically derived correction tables to reduce this bias. The result is better accuracy across the full range of cardinalities, not just the middle range where the theoretical error bound applies. HyperLogLog++ is the implementation behind BigQuery’s COUNT(DISTINCT) and Redis’s HyperLogLog commands.
Merging HyperLogLogs
One of HyperLogLog’s most powerful properties is mergeability. To compute the union cardinality of two HyperLogLog sketches — for example, combining counts from two database shards — take the element-wise maximum of their register arrays. The result is a valid HyperLogLog representing the union of both input sets. This is exact: max(a_i, b_i) for each register i gives the same result as if you had inserted all elements from both sets into a single HyperLogLog from the start. This enables distributed cardinality estimation: run HyperLogLog locally on each shard, ship the small register arrays to a coordinator, merge with element-wise max, read off the estimate. No need to centralize raw data. This is how Presto’s APPROX_DISTINCT and most distributed analytics systems implement cross-shard distinct counts.
Count-Min Sketch: Frequency Estimation
While HyperLogLog answers "how many distinct elements?", Count-Min Sketch answers "how often does element X appear?". The structure is a d × w array of counters with d independent hash functions. On insert(x): for each of the d hash functions h_i, increment counter[i][h_i(x) mod w]. On query(x): return min(counter[i][h_i(x) mod w]) across all d rows. The minimum is an overestimate because hash collisions cause some counters to be incremented by unrelated elements, but taking the minimum across d independent hash functions bounds the overcounting. Error bound: with probability 1 – delta, the estimate is within epsilon * N of the true count, where w = ceil(e / epsilon) and d = ceil(ln(1 / delta)). For 1% error with 99% confidence: w ≈ 272, d ≈ 7. Memory: about 1,900 counters — trivial. Applications include tracking heavy hitters in network traffic, per-IP rate limiting, and frequency estimation in stream processing.
Production Applications
Redis implements HyperLogLog natively with PFADD (add elements) and PFCOUNT (estimate cardinality) commands, using 12 KB per sketch with ~0.81% error. PFMERGE merges multiple sketches into one. BigQuery’s COUNT(DISTINCT col) uses HyperLogLog++ internally with configurable precision. Presto/Trino expose APPROX_DISTINCT(x, e) where e is the desired max standard error. Database query optimizers use HyperLogLog sketches stored in table statistics to estimate cardinality for join reordering — a critical input to the cost-based optimizer. Apache Spark’s DataFrame API exposes approx_count_distinct with a configurable relative standard deviation parameter. The common thread: wherever exact distinct counting would be too expensive and small errors are acceptable, HyperLogLog is the standard tool.
Implementation Considerations
Choosing b (the number of register bits, which sets M = 2^b): b = 10 gives ~3.25% error at 768 bytes; b = 14 gives ~0.82% error at 12 KB. Most production systems default to b = 14. Hash function choice matters: use a high-quality 64-bit hash like MurmurHash3 or xxHash. Weak hashes lead to poor register distribution and degraded accuracy. For integer keys, avoid identity hashing — apply a mixing function first. Thread safety: HyperLogLog registers need atomic updates in multi-threaded environments. Redis sidesteps this by being single-threaded. Distributed systems merge at read time rather than locking shared registers. Serialization: the register array is compact and easily serialized for storage or network transfer. A 12 KB sketch fits in a single network packet. Compression (e.g., with zstd) can reduce it further for archival storage.
When Not to Use Approximate Counting
Approximate counting is the wrong tool when exact answers are required: billing systems, fraud detection with hard thresholds, deduplication before writing to a database, or any system where a 2% error translates to real money or correctness violations. It’s also the wrong tool when cardinality is small enough that a hash set is cheap — for fewer than ~10,000 elements, a simple set is both exact and inexpensive. The sweet spot is high-cardinality counting in analytics, monitoring, and query optimization contexts where approximate answers are explicitly acceptable and the memory savings matter.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture