Deduplication Service Low-Level Design: Content Hashing, Fingerprint Index, and Storage Savings

What Is a Deduplication Service?

A deduplication service eliminates redundant copies of data by identifying identical content and storing only one physical copy regardless of how many logical references point to it. This is fundamental in backup systems, file sync services, and object stores where users frequently upload the same files. The service computes content hashes, maintains a fingerprint index mapping hashes to storage locations, and handles chunk-level deduplication for large objects where partial overlap is common. Storage savings reporting closes the feedback loop for capacity planning.

Requirements

Functional Requirements

  • Compute SHA-256 content hash for each uploaded object or chunk.
  • Maintain a fingerprint index mapping content hash to physical storage location and reference count.
  • Chunk-level deduplication: split large objects into variable-length chunks using content-defined chunking (CDC) so partial edits do not invalidate the entire object hash.
  • Reference counting: track how many logical objects share each physical chunk.
  • Storage savings reporting: expose bytes deduplicated, deduplication ratio, and savings by namespace or time period.

Non-Functional Requirements

  • Fingerprint lookup must be sub-millisecond to not add perceptible latency to upload paths.
  • Index must scale to billions of fingerprints without degrading lookup performance.
  • Garbage collection must safely delete physical chunks only when reference count reaches zero.
  • Hash collision probability must be negligible (SHA-256 is sufficient for practical deduplication).

Data Model

  • FingerprintEntry — content_hash (PK, SHA-256 hex), storage_node_id, byte_offset, size_bytes, reference_count, created_at, last_referenced_at.
  • ObjectManifest — object_id (PK), namespace_id, total_size_bytes, chunk_hashes (ordered list of SHA-256), created_at.
  • DeduplicationStats — namespace_id, date, logical_bytes_written, physical_bytes_stored, bytes_deduplicated, dedup_ratio.
  • GCQueue — content_hash, enqueued_at, deletion_eligible_at — holds hashes whose reference_count has dropped to zero pending safe deletion.

FingerprintEntry is the hot path data structure and must live in a hash-indexed store with O(1) lookups — a distributed hash table like Redis Cluster or a purpose-built LSM-tree store (RocksDB) partitioned by the first N bits of the hash.

Core Algorithms

Content-Defined Chunking

Variable-length chunking using Rabin fingerprinting finds natural chunk boundaries within file content. The algorithm slides a rolling hash window over the byte stream and declares a chunk boundary whenever the hash modulo a target chunk size equals a magic value. This produces chunks averaging a target size (e.g., 4MB) but with boundaries determined by content rather than fixed offsets. The key property: if a file is edited in the middle, only the chunks containing the edit change — all preceding and following chunks retain the same hash and deduplicate against stored copies.

Fingerprint Lookup and Write Path

For each chunk hash the service performs a point lookup in FingerprintEntry. On a hit (hash exists) the chunk is already stored: the service increments reference_count and records the hash in the ObjectManifest without writing any bytes to the storage node. This is a zero-byte write for the deduped chunk. On a miss the service writes the chunk to a storage node, creates a FingerprintEntry with reference_count=1, and records it in the manifest. The reference count increment must be atomic to prevent a race where two concurrent uploads of the same chunk both miss the index and write two copies.

Garbage Collection

When an object is deleted the service decrements reference_count for each hash in its ObjectManifest. Hashes reaching zero are added to GCQueue with a deletion_eligible_at timestamp set to now plus a safety delay (e.g., 1 hour). The delay provides a window to abort GC if a concurrent upload re-references the hash before physical deletion. The GC worker processes GCQueue entries past their deletion_eligible_at time, verifies reference_count is still zero (re-checking atomically), deletes the physical chunk from the storage node, and removes the FingerprintEntry.

API Design

  • POST /dedup/chunk-boundaries — Given an object stream or size hint, returns recommended chunk boundaries (client-side CDC) or accepts a byte stream and performs server-side CDC.
  • POST /dedup/fingerprints/lookup — Batch lookup of chunk hashes; returns which hashes are already stored (dedup hits) and which need upload.
  • POST /dedup/fingerprints/{hash}/reference — Increment reference count for an existing hash (called when a new object reuses a known chunk).
  • DELETE /dedup/objects/{object_id} — Decrement references for all chunks in the object manifest; enqueue zero-ref chunks for GC.
  • GET /dedup/stats?namespace_id=X&start=Y&end=Z — Returns deduplication ratio and bytes saved for the namespace over the time range.

Scalability and Reliability

Index Partitioning at Billion-Fingerprint Scale

The FingerprintEntry index is partitioned by the first 16 bits of the content hash, yielding 65536 logical shards. Each shard is served by a node in a distributed hash table. Consistent hashing with virtual nodes allows adding capacity without rehashing the entire index. Bloom filters per shard provide a fast probabilistic miss check before the full index lookup, reducing disk IOPS for rare hashes that are unlikely to be in the index.

Atomic Reference Count Updates

Reference count increments and decrements are performed using atomic compare-and-swap operations (Redis INCR/DECR or database atomic update). The GC check (reference_count == 0 before deletion) is also atomic with the deletion to prevent a race condition where a concurrent upload increments the count to 1 after GC reads 0 but before GC deletes the entry. If the atomic check-and-delete fails because the count is no longer zero, GC aborts the deletion and removes the hash from GCQueue.

Storage Savings Reporting

On every upload the service records logical_bytes_written (the uncompressed object size) and physical_bytes_stored (the bytes actually written to storage nodes, zero for dedup hits). DeduplicationStats are aggregated by namespace and day in a time-series table. The deduplication ratio is logical_bytes_written divided by physical_bytes_stored. A dashboard surfaces per-namespace ratios, trending over time, enabling capacity planning teams to project storage growth and cost savings from deduplication investments.

Interview Tips

Key interview probes: how do you prevent a hash collision from serving wrong data? (verify chunk content on read by recomputing the hash — a mismatch triggers an error and re-fetch; SHA-256 collisions are computationally infeasible but verifying is cheap). How do you handle deduplication across namespaces (cross-user dedup)? (single global fingerprint index enables it; trade-off is privacy — one user can detect if another user has uploaded the same file by observing upload speed. Most services restrict dedup to within a single user namespace). How does CDC interact with encrypted uploads? (client-side encryption before chunking prevents server-side dedup because the same plaintext encrypted with different keys produces different ciphertext — converged encryption or client-side dedup are required).

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is SHA-256 content hashing used to detect duplicate data?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each chunk or object is hashed with SHA-256 before write. The resulting 256-bit digest is used as the lookup key in the fingerprint index. If the index already contains that hash, the incoming data is a duplicate and the write is short-circuited''only a reference to the existing chunk is stored rather than a new copy. SHA-256's collision resistance (2^128 security level) makes accidental hash collisions astronomically unlikely in practice.”
}
},
{
“@type”: “Question”,
“name”: “How is a chunk-level fingerprint index designed for high-throughput lookups?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The fingerprint index is a distributed hash table keyed on the SHA-256 digest. Each node owns a shard of the keyspace (consistent hashing). Lookups are a single-hop read; inserts add a (hash, chunk_location) entry with a reference count. A bloom filter layer in front of the index handles the common cold-miss case in O(1) without a network round-trip, drastically reducing lookup latency for unique chunks.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between inline and post-process deduplication?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Inline deduplication checks the fingerprint index synchronously during the write path: if a duplicate is found the write completes immediately without storing new bytes, giving instant space savings but adding latency to every write. Post-process deduplication writes data first to primary storage, then a background job computes hashes, detects duplicates, rewrites references, and reclaims space. Post-process dedup is simpler to implement and doesn't slow the write path, but storage savings are delayed.”
}
},
{
“@type”: “Question”,
“name”: “How do you calculate and report storage savings from deduplication?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Storage savings are derived from reference counts in the fingerprint index. Each chunk's logical size is multiplied by (reference_count – 1) to get saved bytes. Total logical size minus total physical size gives the absolute savings; dividing by logical size yields the dedup ratio. These metrics are aggregated per tenant and globally, exported to a metrics system, and surfaced on a dashboard. A scheduled job also reconciles reference counts against the chunk store to detect any index drift.”
}
}
]
}

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top