Question 1

Why is SHA-256 sufficient for deduplication keys instead of a faster hash?

Accepted Answer

SHA-256 produces a 256-bit hash — the probability of a collision among 10 billion files is approximately 10^-58, making it effectively zero for practical use. Faster hashes like MD5 (128-bit) or CRC32 (32-bit) have higher collision probabilities and known collision attacks. In a deduplication system, a false positive (two different files with the same hash) means data corruption: user A's file is silently served when user B requests their file. SHA-256 eliminates this risk at the cost of ~2x slower hashing vs. MD5 — negligible for files where upload I/O is the bottleneck. For performance-sensitive chunk-level dedup, xxHash128 is used in some systems because it provides 128-bit output with no known attacks and is ~10x faster than SHA-256.

Question 2

How does the reference count prevent premature S3 deletion?

Accepted Answer

Each FileBlob row has a ref_count tracking how many FileRecord rows point to it. When a file is uploaded and matched to an existing blob, ref_count is incremented. When a file is deleted, ref_count is decremented. S3 deletion only happens when ref_count reaches 0 — no FileRecord references the blob. Critically, the S3 deletion is asynchronous: it's enqueued as a background job with a safety delay (e.g., 2 hours). This delay handles the race condition where: (1) user A deletes their copy (ref_count → 0, deletion enqueued); (2) user B uploads the same file and inserts a new FileBlob row (ref_count → 1) within 2 hours; (3) the deletion job checks ref_count before deleting — finds ref_count > 0 and aborts. Without the delay, user B gets a dangling S3 reference.

Question 3

How does Content-Defined Chunking improve dedup ratios over fixed-size chunking?

Accepted Answer

Fixed 4MiB chunks are sensitive to prepend insertions: inserting 1 byte at the start of a 100MiB file shifts all chunk boundaries, making every chunk unique even though 99.99% of content is unchanged — 0% dedup for a nearly identical file. Content-Defined Chunking (CDC) uses a rolling hash (Rabin-Karp fingerprint) over a sliding window (typically 64 bytes) to identify natural split points in the content — places where the hash matches a bitmask pattern. These split points are content-anchored rather than position-anchored, so inserting 1 byte only affects the chunk containing the insertion; all other chunks remain identical. Borg, Restic, and Casync all use CDC. Typical dedup improvement: CDC achieves 70%+ dedup on modified files; fixed-size achieves 10–40%.

Question 4

How do you handle the race condition when two users upload the same file simultaneously?

Accepted Answer

Both clients compute SHA-256, both query FileBlob and find no row, both attempt to INSERT. Without guards, the second insert fails with a unique constraint violation on sha256_hash. Correct handling: INSERT INTO FileBlob (...) ON CONFLICT (sha256_hash) DO UPDATE SET ref_count = ref_count + 1 RETURNING blob_id, ref_count. The first insert wins; the second hits the conflict and increments ref_count instead. Both transactions get the correct blob_id and proceed to create their FileRecord. The S3 upload of the same key from both clients is also safe: S3 PUT is idempotent — uploading the same content to the same key twice results in no data corruption, just a redundant upload. In production, add a pre-signed URL upload path that checks FileBlob existence before issuing the S3 upload URL.

Question 5

How do you calculate and report storage savings from deduplication?

Accepted Answer

Storage used without dedup = SUM(size_bytes) across all FileRecord rows (one row per user upload). Storage used with dedup = SUM(size_bytes) across all FileBlob rows (one row per unique content hash). Savings = FileRecord total - FileBlob total. Savings ratio = (FileRecord total - FileBlob total) / FileRecord total. Track this in a daily report: SELECT (SELECT SUM(size_bytes) FROM FileRecord WHERE deleted_at IS NULL) AS logical_bytes, (SELECT SUM(size_bytes) FROM FileBlob WHERE ref_count > 0) AS physical_bytes. For per-user reporting: a user who uploads a 1GB file that already exists in the system uses 0 physical bytes of new storage but consumes 1GB of logical quota. Charge users based on logical bytes (fair); save physical costs via dedup.

File Deduplication Low-Level Design: Content Hashing, Reference Counting, and Chunk-Level Dedup

Core Data Model

Upload Flow: Hash, Check, Upload

Delete Flow: Reference Counting and S3 Cleanup

Chunk-Level Deduplication for Large Files

Key Interview Points