Question 1

What are the benefits and limitations of content-addressable storage for deduplication?

Accepted Answer

Content-addressable storage (CAS) derives the object key from a cryptographic hash of the content (SHA-256 is standard). This enables byte-level global deduplication: if two users upload identical files, only one blob is stored and both metadata records point to the same key. Storage savings can be dramatic for backup and sync workloads where the same files exist across many accounts. Limitations: (1) Hash computation adds CPU overhead on the upload path. (2) A hash collision, while astronomically unlikely with SHA-256, would silently serve wrong data — mitigate with a secondary length check. (3) Deduplication is defeated by encryption-at-rest per user if each user has a unique key. (4) Partial deduplication (block-level, as in rsync or Dropbox) requires chunking logic but yields better ratios for large mutable files.

Question 2

How do you design a resumable upload protocol with chunk tracking?

Accepted Answer

Split the file into fixed-size chunks (e.g., 5 MB) on the client. The client first POSTs metadata (filename, total_size, chunk_count, file_hash) to the Upload Service, which creates an upload session record in the database with status=pending and returns a session_id. For each chunk, the client PUTs /uploads/{session_id}/chunks/{chunk_index} with the chunk bytes and a chunk checksum. The server stores the chunk to object storage (or a temp prefix in S3) and marks that chunk index as received. On network failure, the client queries GET /uploads/{session_id} to learn which chunks are missing, then retransmits only those. After all chunks are received, a background job assembles them (S3 multipart complete or server-side concatenation), verifies the whole-file hash, and moves the object to its final key. The TUS open protocol is a popular standard for this pattern.

Question 3

How do you model access control for shared files with permission inheritance?

Accepted Answer

Model resources as a tree (Drive -> Folder -> File). Each node has an ACL table: (resource_id, principal_id, principal_type, permission_level). Permission resolution walks the tree upward: a file inherits its parent folder's ACL unless it has an explicit override. To avoid expensive tree traversals on every access check, materialize effective permissions into a flattened cache (Redis HSET keyed by resource_id:principal_id). Invalidate on any ACL write. Support three principal types: user, group, and public. Groups resolve membership via a Group Service. For share links, generate a signed token encoding (resource_id, expiry, permission_level) — validate at the edge without a DB lookup. Enforce at the API gateway layer, not just the UI, to prevent BOLA attacks.

Question 4

How do you integrate virus scanning without blocking the upload path?

Accepted Answer

Use an async scanning pipeline to avoid adding AV latency to the user-facing upload. After chunks are assembled and the object lands in a quarantine prefix in object storage, publish a scan_requested event to a queue (SQS or Kafka). AV worker instances (running ClamAV or a commercial engine like Trend Micro File Security) pull from the queue, download the object, scan it, and publish a scan_result event. The File Metadata Service subscribes and sets the file status to clean (moves to the public prefix) or infected (deletes and notifies the user). During the pending window, the file is accessible only to the uploader with a warning badge. For large files, scan in streaming fashion to avoid loading the full file into memory. Cache hash-based clean results to skip rescanning known-good content.

Question 5

How do you automate storage tiering for cost optimization?

Accepted Answer

Classify objects by access pattern using metadata logged in an access events table (object_id, accessed_at). A nightly tiering job computes last_accessed_at per object and applies policy rules: objects not accessed in 30 days move from S3 Standard to S3 Standard-IA (40% cost reduction), those not accessed in 90 days move to S3 Glacier Instant Retrieval, and after 365 days to S3 Glacier Deep Archive. On S3, implement this via Lifecycle Policies without custom code. For reads from cold tiers, trigger a restore job and return HTTP 202 Accepted with a Retry-After header; poll until the object is promoted back to hot storage. Track tier distribution in a dashboard to tune thresholds. Account for retrieval fees — Glacier Deep Archive retrieval can cost more than the storage savings if access patterns are misclassified.

Low Level Design: File Sharing Service

Chunked Upload Protocol

Content-Addressable Storage

Metadata Store

Access Control

Virus Scanning

CDN Delivery

Storage Tiering