Low Level Design: Content Fingerprinting and Near-Duplicate Detection

⏱ 3 min read

Content fingerprinting detects duplicate or near-duplicate content at scale: identifying web pages that have been copied, finding similar images across billions of photos, or detecting reposted videos. Exact duplicate detection is solved by hashing. Near-duplicate detection — documents that are 95% similar — requires locality-sensitive hashing (LSH) techniques that are computationally efficient even for trillion-scale corpora.

Exact Duplicate Detection with Hashing

For exact duplicates (identical byte sequences): compute a cryptographic hash (MD5, SHA-256) of the content. Store hashes in a hash set or database. Lookup is O(1). Collision probability for MD5: 1 in 2^128 — effectively zero. For images and videos, hash the compressed bytes after normalizing format and metadata (strip EXIF, normalize to a canonical resolution). For text: normalize whitespace and encoding before hashing. Exact hash matching catches direct copies without any near-duplicate false positives.

Simhash for Text Near-Duplicates

Simhash produces a 64-bit fingerprint for a document such that similar documents have fingerprints with few differing bits (low Hamming distance). Process: extract features (shingles: overlapping n-grams of 3-5 words), hash each feature to a 64-bit number, accumulate bit positions (add +1 if bit is set, -1 if not, weighted by feature frequency), and set the final fingerprint bit to 1 if the accumulated value is positive. Two documents with >90% similarity typically differ by 3 or fewer bits. Google used Simhash to detect near-duplicate web pages.

MinHash and Jaccard Similarity

MinHash estimates Jaccard similarity (intersection over union of document feature sets) without comparing all feature pairs. For each of k hash functions, compute the minimum hash value over all shingles in the document. Two documents' MinHash signatures are similar in proportion to their Jaccard similarity. With k=200 hash functions, estimate Jaccard similarity with ~7% variance. Store MinHash signatures in a matrix; row-wise comparison gives pairwise similarity estimates for all document pairs without O(n^2) full comparisons.

Locality-Sensitive Hashing (LSH)

LSH converts the problem “find all documents within Jaccard similarity threshold T” into efficient approximate lookups. Divide the MinHash signature into b bands of r rows each. Hash each band independently. Two documents that are truly similar have a high probability of sharing at least one band hash bucket. Candidates (documents in the same bucket for at least one band) are verified with exact similarity computation. LSH provides a O(n) scan instead of O(n^2) pairwise comparison to find near-duplicate candidates.

Perceptual Hashing for Images

Perceptual hashing (pHash) generates a fingerprint based on visual content rather than bytes. Process: resize image to 32×32, convert to grayscale, apply DCT (discrete cosine transform), take the top-left 8×8 DCT coefficients (low frequencies representing structure), compare each coefficient to the median — produce a 64-bit fingerprint. Two visually similar images (same photo with different compression, slight crop, brightness adjustment) have pHashes within 8-10 bits of Hamming distance. Used by major platforms for detecting reposted images.

Video Fingerprinting

YouTube Content ID and similar systems use audio and video fingerprinting to detect re-uploads. Video fingerprinting: sample frames at regular intervals, compute perceptual hash per frame, store as a time-indexed sequence. Match by comparing frame hash sequences. Audio fingerprinting (Shazam algorithm): convert audio to spectrograms, detect peak frequency-time pairs, form “constellation maps,” hash pairs of peaks as fingerprints. A 3-second audio sample produces dozens of fingerprints that can be matched against a corpus of billions in milliseconds.

Scalability Considerations

At web scale (billion documents), pairwise comparison is infeasible. Fingerprinting pipelines must be incremental: compute the fingerprint for each new document and query the fingerprint index for near-matches rather than reprocessing all existing documents. Store fingerprints in an inverted index keyed by fingerprint subsets (LSH bands or Simhash prefix). Distributed fingerprint indexes partition by hash range across machines. New content is fingerprinted asynchronously after ingestion, with a delay of seconds to minutes before deduplication results are available.