What Is a Document Management System?
A document management system (DMS) stores, organizes, and retrieves files in a way that supports versioning, access control, search, and retention governance. It is a staple low-level design topic because it combines blob storage, relational metadata, search indexing, and permission modeling into one coherent service.
Core Requirements
Functional
- Upload, download, and delete documents.
- Automatic versioning: each upload creates a new version; prior versions remain accessible.
- Metadata: title, tags, custom key-value attributes per document.
- Hierarchical folder organization.
- Role-based and ACL-based access control at the document and folder level.
- Full-text search over document content and metadata.
- Retention policies: auto-archive or delete documents after a configured period.
Non-Functional
- Upload latency: large files streamed directly to object storage; metadata write < 100 ms.
- Download: pre-signed URLs served within < 50 ms.
- Search: sub-second for most queries.
- Durability: documents stored with at least 11 nines durability (S3-class).
Data Model
folder
CREATE TABLE folder (
id UUID PRIMARY KEY,
parent_id UUID REFERENCES folder(id),
owner_id UUID NOT NULL,
name VARCHAR(255) NOT NULL,
path TEXT NOT NULL, -- materialized path, e.g. /root/legal/contracts
created_at TIMESTAMP NOT NULL
);
document
CREATE TABLE document (
id UUID PRIMARY KEY,
folder_id UUID NOT NULL REFERENCES folder(id),
owner_id UUID NOT NULL,
title VARCHAR(512) NOT NULL,
current_version INT NOT NULL DEFAULT 1,
status VARCHAR(32) NOT NULL DEFAULT 'active',
-- 'active' | 'archived' | 'deleted'
retention_days INT,
created_at TIMESTAMP NOT NULL,
updated_at TIMESTAMP NOT NULL
);
document_version
CREATE TABLE document_version (
id UUID PRIMARY KEY,
document_id UUID NOT NULL REFERENCES document(id),
version_number INT NOT NULL,
storage_key TEXT NOT NULL, -- S3 object key
mime_type VARCHAR(128),
size_bytes BIGINT,
checksum_sha256 CHAR(64),
uploader_id UUID NOT NULL,
created_at TIMESTAMP NOT NULL,
UNIQUE (document_id, version_number)
);
document_metadata
CREATE TABLE document_metadata (
document_id UUID NOT NULL REFERENCES document(id),
meta_key VARCHAR(128) NOT NULL,
meta_value TEXT,
PRIMARY KEY (document_id, meta_key)
);
document_tag
CREATE TABLE document_tag (
document_id UUID NOT NULL REFERENCES document(id),
tag VARCHAR(128) NOT NULL,
PRIMARY KEY (document_id, tag)
);
permission
CREATE TABLE permission (
id UUID PRIMARY KEY,
resource_type VARCHAR(16) NOT NULL, -- 'document' | 'folder'
resource_id UUID NOT NULL,
principal_type VARCHAR(16) NOT NULL, -- 'user' | 'group'
principal_id UUID NOT NULL,
access_level VARCHAR(16) NOT NULL, -- 'viewer' | 'editor' | 'admin'
inherited BOOLEAN NOT NULL DEFAULT FALSE
);
API Design
POST /v1/folders -- create folder
POST /v1/documents -- create document record + get upload URL
PUT /v1/documents/{id}/versions -- upload new version
GET /v1/documents/{id} -- fetch metadata + current version info
GET /v1/documents/{id}/versions -- list all versions
GET /v1/documents/{id}/versions/{v}/url -- get pre-signed download URL
DELETE /v1/documents/{id} -- soft delete
PATCH /v1/documents/{id}/metadata -- upsert metadata / tags
GET /v1/search?q=...&folder=... -- full-text search
POST /v1/documents/{id}/permissions -- grant access
DELETE /v1/documents/{id}/permissions/{pid} -- revoke access
Upload Flow
- Client calls
POST /v1/documentswith title, folder, and MIME type. - Service creates a
documentrow and adocument_versionrow (status: pending). - Service generates a pre-signed S3 PUT URL for the version's
storage_keyand returns it to the client. - Client uploads bytes directly to S3.
- S3 triggers an event (SNS/SQS) on successful object creation.
- Service marks the version active, updates
document.current_version, computes checksum, and enqueues a text-extraction job for search indexing.
This pattern keeps large binary data off the application tier entirely.
Versioning
Each upload increments version_number using an optimistic lock or a SELECT ... FOR UPDATE on the document row. Old versions retain their storage_key in S3 (do not delete). To restore a previous version, insert a new document_version row pointing to the old storage_key and update current_version.
Access Control
Permission resolution order:
- Check explicit permissions on the document itself.
- Walk up the folder tree checking folder permissions (inherited).
- Deny by default.
Cache resolved permissions per (principal, resource) with a short TTL (e.g., 30 s) to avoid repeated tree walks. Invalidate on any permission change.
Full-Text Search
Use Elasticsearch or OpenSearch as the search backend:
- Index document:
id,title,tags,metadata, extracted text content,folder_id, ACL list. - Include the ACL (list of principal IDs with access) in the index document so search can apply a document-level filter at query time, avoiding a separate permission check for each result.
- Text extraction: run Tika or a Lambda function to extract plain text from PDF, DOCX, etc., then update the index.
- Re-index on new version upload and on metadata change.
Retention Policies
Each document may have a retention_days value. A background job runs daily:
- Select documents where
status = 'active'andcreated_at + retention_days <= now(). - Move them to
archived(ordeletedper policy). - For hard deletes, remove S3 objects after a grace period and mark versions as purged.
- Write an audit event for every retention action.
For legal hold: add a legal_hold flag that overrides retention deletion.
Scalability Considerations
- Shard the
document_versiontable bydocument_idif version counts grow large. - Use S3 multipart upload for files > 100 MB.
- CDN (CloudFront) in front of S3 for frequently downloaded public documents.
- Materialized path on
folder.pathmakes subtree queries a simpleLIKE /root/legal/%without recursive CTEs. - Background text extraction is async; search results may lag by seconds after upload.
Common Interview Follow-Ups
- Deduplication: Store SHA-256 checksum; if a matching hash already exists in storage, reuse the S3 key (content-addressable storage).
- Collaborative editing: Integrate an OT/CRDT layer (e.g., Yjs) for real-time co-editing; save a new version on each explicit save or periodic checkpoint.
- Virus scanning: Trigger a scanner lambda on the S3 event before marking the version active.
- Quota enforcement: Track bytes used per owner; reject uploads that exceed quota before issuing the pre-signed URL.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a document management system and how does file versioning work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A document management system (DMS) stores, organizes, retrieves, and governs documents and files. Versioning works by treating each upload or edit as a new immutable version object linked to a parent document record. The document record holds a pointer to the current version; prior versions are retained in object storage (e.g. S3) with their own keys. Users can view version history, diff metadata, or restore a previous version by updating the current-version pointer, all without deleting any stored content.”
}
},
{
“@type”: “Question”,
“name”: “How does access control work for documents and folders in a DMS?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Access control uses a combination of role-based (RBAC) and resource-level ACLs. Each document and folder has an ACL listing principals (users or groups) with permissions (view, comment, edit, delete, share). Permissions can be inherited from a parent folder or explicitly overridden. On every access request the service evaluates the effective permission by walking the folder hierarchy and merging inherited and explicit ACL entries. A permission cache keyed by (user_id, resource_id) with short TTL avoids repeated hierarchy traversals.”
}
},
{
“@type”: “Question”,
“name”: “How is full-text search implemented in a document management system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When a document is uploaded or a new version is created, the DMS extracts text (via PDF parser, DOCX extractor, or OCR for images) and sends it to a search index (e.g. Elasticsearch or OpenSearch). The index stores document ID, version ID, extracted text, and metadata fields. Search queries are executed against the index with ACL post-filtering or query-time ACL injection to ensure users only see results they are authorized to access. Incremental indexing updates only changed versions.”
}
},
{
“@type”: “Question”,
“name”: “How does a document management system enforce retention policies?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Retention policies define minimum and maximum hold periods for document categories (e.g. contracts held 7 years). Each document is tagged with a retention class at creation. A retention engine periodically scans documents whose retention period has expired and either archives them to cold storage or permanently deletes them according to policy. Legal holds can be placed on individual documents or entire custodians, suspending deletion regardless of policy until the hold is released. All retention actions are written to an immutable audit log.”
}
}
]
}
See also: Atlassian Interview Guide
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering