Low Level Design: File Sharing Service

Chunked Upload Protocol

Uploading large files over HTTP without chunking is fragile. A 4GB video file uploaded in a single request will fail and require a complete restart if the connection drops at 99% completion. Chunked upload solves this by having the client split the file into fixed-size pieces — typically 5MB — and upload each chunk independently with metadata identifying the file_id, the chunk_index, and the total chunk count. The server stores each chunk as it arrives and tracks which chunks have been received in a manifest. When the client sends a completion signal, the server verifies all chunks are present, assembles the file, and transitions it to the finalized state. Resumability comes from the server’s ability to report the last confirmed chunk index in response to a resume query, allowing the client to restart from that point rather than from the beginning. The client computes a checksum for each chunk before uploading and the server validates it on receipt, catching corruption during transit. If validation fails the server rejects that chunk and the client re-uploads it. This protocol tolerates dropped connections, app restarts, and even device reboots between chunks, making it suitable for mobile clients on unreliable networks. The chunk size is a tuneable parameter: smaller chunks improve resumability granularity but increase request overhead; larger chunks are more efficient on stable connections but require longer re-uploads on failure.

Content-Addressable Storage

Computing a SHA-256 hash of the complete file content and using that hash as the storage key unlocks a powerful deduplication property: if two users upload identical files, the blob is stored only once. When a file upload completes, the server computes the SHA-256 of the assembled content and checks whether a blob with that hash already exists in object storage. If it does, the server skips the write entirely and records only a metadata entry pointing at the existing blob. This deduplication is transparent to the uploader: they receive a successful upload response and their file is accessible through their account even though no new bytes were written to disk. Content-addressable storage also simplifies integrity verification: any time a file is downloaded, recomputing the SHA-256 and comparing it to the stored hash proves the content is intact. The hash serves as a natural cache key for CDN layers: two different users requesting the same content by different names still hit the same edge cache entry keyed on the content hash. The main design constraint of content-addressable storage is that it is immutable by definition: editing a file creates a new blob with a new hash. Versioning and edit history are therefore modeled as a sequence of metadata records pointing at different blob hashes rather than as mutations to the blob itself.

Metadata Store

The blob storage layer is responsible for durably storing raw bytes, but it knows nothing about file names, owners, sharing permissions, or upload timestamps. All of that lives in a relational metadata store. The core files table has columns for file_id (a UUID generated at upload initiation), owner_id (the authenticated user who initiated the upload), sha256 (the content hash linking to the blob), size in bytes, original filename, MIME type, upload timestamp, and a status column that tracks the lifecycle from in-progress through finalized to deleted. The separation of metadata from blob storage is intentional and important: it means that deleting a file only requires marking the metadata record as deleted; the blob can be garbage collected asynchronously after verifying no other metadata records reference that SHA-256. It also means the metadata store can be queried, indexed, and backed up on its own schedule independently of the large and expensive blob storage layer. File versioning is modeled as a versions table with version_id, file_id, sha256, and created_at — each edit creates a new version record pointing at a new blob. The metadata store also holds the mapping between share tokens and file_ids, the per-file permission grants, and the virus scan results, all of which are small relational records that benefit from SQL joins rather than key-value lookups.

Share Link Generation

A share link must be unguessable, scoped to a specific file with specific permissions, optionally time-limited, and efficient to validate on every access. The core primitive is a 128-bit cryptographically random token generated server-side using a CSPRNG. This token is stored in a share_links table alongside the file_id it grants access to, the permission level (view-only, download-only, or edit), an optional expiry timestamp, and an enabled flag that allows the owner to revoke access without deleting the record. When a recipient clicks the share link, the server looks up the token, verifies the expiry and enabled flag, and then either serves the file directly or generates a presigned URL for the CDN to serve it. Presigned URLs are signed with a short TTL — typically five to fifteen minutes — so that even if the presigned URL is leaked, it expires quickly. The share link itself is permanent until revoked or expired; the presigned URL is ephemeral and generated fresh on each access. This design means the CDN can serve the file at high bandwidth without any per-request authentication call to the origin server, while the share link validation is centralized and revocable. Password-protected share links are implemented by storing a bcrypt hash of the password in the share_links record and requiring the recipient to prove knowledge of the password before the presigned URL is generated.

Access Control

Every API request that touches a file must pass through an access control check before any operation is performed. The permission model supports three roles: owner (full control including sharing and deletion), editor (can upload new versions and rename, cannot delete or manage shares), and viewer (can download and preview, cannot modify). Permissions are stored in a file_permissions table with file_id, principal_id (a user_id or a group_id), and role. The access control check is a single SQL query that joins the file record with the permissions table for the requesting user, returning the effective permission level. This check must happen on every request, not just the first, because permissions can be revoked at any time. Public share links are a special case: they bypass the user authentication step but still go through the share_links validation logic that checks expiry and the enabled flag. The owner role cannot be transferred; it is set at creation time and changing it requires an explicit ownership transfer operation that sends an email confirmation to the new owner. Group permissions allow an organization to grant access to an entire team without enumerating individual users, with the group membership resolution performed at access check time by joining against the group_memberships table. Permission checks are logged to an audit trail table for compliance purposes, recording who accessed what file at what time with what permission.

Virus Scanning

Accepting arbitrary file uploads without scanning them makes the service a malware distribution vector. Virus scanning runs asynchronously after the upload is finalized to avoid adding latency to the upload completion response. When a file transitions to finalized status, the server publishes a scan-required event containing the file_id and sha256 to a scanning queue. A pool of scanner workers consumes this queue, retrieves the blob from storage, passes it through ClamAV or a cloud scanning API such as the Google Cloud Virus Scanning API, and records the result — clean or quarantined — back in the metadata store. If the scan result is clean the file becomes accessible to anyone with a valid share link or permission grant. If the scan detects malware the file is quarantined: its status is set to quarantined, all share links for it are disabled, the owner is notified, and the file is flagged for manual review. Crucially, scan results are cached per SHA-256. If the same blob has already been scanned and found clean, subsequent uploads of identical content skip the scanning worker entirely. This cache has a TTL of several days rather than being permanent, because virus signature databases update and a file that was clean yesterday might be detected tomorrow. The scanning infrastructure is isolated from the main application: scanner workers run with network access to the blob storage but no access to the metadata database or the share link system, limiting the blast radius if a malicious file somehow exploits the scanner process.

CDN Delivery

Direct download from origin object storage at scale is expensive and slow for geographically distributed users. A CDN layer absorbs the majority of download traffic by caching file content at edge nodes close to users. Downloads work through a two-step process: the application server performs the access control check and then generates a presigned CDN URL that encodes the content hash, a timestamp, and an HMAC signature. The CDN validates this signature without calling back to the origin, serving the cached content if available. Because the cache key is the SHA-256 content hash rather than the file name or user-specific URL, all users downloading the same file content share a single cache entry at each edge node, achieving very high cache hit rates for popular files. Large files are served using HTTP range requests: the client specifies a byte range, the CDN serves that range from cache or fetches it from origin using a matching range request, and clients can download different ranges in parallel for maximum throughput. This is the same mechanism that video players use for adaptive bitrate streaming. The CDN also handles MIME type headers correctly using the stored MIME type from the metadata store, ensuring that browsers render images and PDFs inline rather than forcing a download dialog. Cache invalidation on file deletion or quarantine is handled by purging the CDN cache entry for the affected SHA-256 using the CDN’s purge API.

Storage Tiering

Storing all files on high-performance object storage regardless of access frequency is wasteful. A tiering policy balances cost and access latency by moving files between storage classes based on access patterns. Hot storage — standard object storage with millisecond access latency and higher per-GB cost — holds files that have been accessed within the last 90 days. Cold storage — infrequent access or archive tiers offered by every major cloud provider — costs a fraction of hot storage per GB but imposes a retrieval latency of seconds to hours and a per-GB retrieval fee. A lifecycle policy job runs nightly, queries the metadata store for files whose last_accessed_at timestamp is older than 90 days, and submits a storage class transition request for each. When a user accesses a cold-tier file, the first request triggers a retrieval operation. If the retrieval is instant the response is served normally. If the retrieval takes time — as with glacier-class storage — the server responds with a 202 Accepted and sends the user a notification when the file is ready. Access to a cold-tier file also re-warms it to hot storage, restarting the 90-day timer. Files marked as permanently deleted are not tiered; they are scheduled for deletion from all storage classes after a grace period that allows recovery from accidental deletion. The grace period is typically 30 days, after which the blob is deleted if no other metadata record references the same SHA-256.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering

See also: Shopify Interview Guide

Scroll to Top