What is a document management system and how does file versioning work?

A document management system (DMS) stores, organizes, retrieves, and governs documents and files. Versioning works by treating each upload or edit as a new immutable version object linked to a parent document record. The document record holds a pointer to the current version; prior versions are retained in object storage (e.g. S3) with their own keys. Users can view version history, diff metadata, or restore a previous version by updating the current-version pointer, all without deleting any stored content.

How does access control work for documents and folders in a DMS?

Access control uses a combination of role-based (RBAC) and resource-level ACLs. Each document and folder has an ACL listing principals (users or groups) with permissions (view, comment, edit, delete, share). Permissions can be inherited from a parent folder or explicitly overridden. On every access request the service evaluates the effective permission by walking the folder hierarchy and merging inherited and explicit ACL entries. A permission cache keyed by (user_id, resource_id) with short TTL avoids repeated hierarchy traversals.

How is full-text search implemented in a document management system?

When a document is uploaded or a new version is created, the DMS extracts text (via PDF parser, DOCX extractor, or OCR for images) and sends it to a search index (e.g. Elasticsearch or OpenSearch). The index stores document ID, version ID, extracted text, and metadata fields. Search queries are executed against the index with ACL post-filtering or query-time ACL injection to ensure users only see results they are authorized to access. Incremental indexing updates only changed versions.

How does a document management system enforce retention policies?

Retention policies define minimum and maximum hold periods for document categories (e.g. contracts held 7 years). Each document is tagged with a retention class at creation. A retention engine periodically scans documents whose retention period has expired and either archives them to cold storage or permanently deletes them according to policy. Legal holds can be placed on individual documents or entire custodians, suspending deletion regardless of policy until the hold is released. All retention actions are written to an immutable audit log.

Low Level Design: Document Management System

⏱ 6 min read

What Is a Document Management System?

A document management system (DMS) stores, organizes, and retrieves files in a way that supports versioning, access control, search, and retention governance. It is a staple low-level design topic because it combines blob storage, relational metadata, search indexing, and permission modeling into one coherent service.

Core Requirements

Functional

Upload, download, and delete documents.
Automatic versioning: each upload creates a new version; prior versions remain accessible.
Metadata: title, tags, custom key-value attributes per document.
Hierarchical folder organization.
Role-based and ACL-based access control at the document and folder level.
Full-text search over document content and metadata.
Retention policies: auto-archive or delete documents after a configured period.

Non-Functional

Upload latency: large files streamed directly to object storage; metadata write < 100 ms.
Download: pre-signed URLs served within < 50 ms.
Search: sub-second for most queries.
Durability: documents stored with at least 11 nines durability (S3-class).

Data Model

folder

CREATE TABLE folder (
  id          UUID PRIMARY KEY,
  parent_id   UUID REFERENCES folder(id),
  owner_id    UUID NOT NULL,
  name        VARCHAR(255) NOT NULL,
  path        TEXT NOT NULL,   -- materialized path, e.g. /root/legal/contracts
  created_at  TIMESTAMP NOT NULL
);

document

CREATE TABLE document (
  id              UUID PRIMARY KEY,
  folder_id       UUID NOT NULL REFERENCES folder(id),
  owner_id        UUID NOT NULL,
  title           VARCHAR(512) NOT NULL,
  current_version INT NOT NULL DEFAULT 1,
  status          VARCHAR(32) NOT NULL DEFAULT 'active',
  -- 'active' | 'archived' | 'deleted'
  retention_days  INT,
  created_at      TIMESTAMP NOT NULL,
  updated_at      TIMESTAMP NOT NULL
);

document_version

CREATE TABLE document_version (
  id              UUID PRIMARY KEY,
  document_id     UUID NOT NULL REFERENCES document(id),
  version_number  INT NOT NULL,
  storage_key     TEXT NOT NULL,   -- S3 object key
  mime_type       VARCHAR(128),
  size_bytes      BIGINT,
  checksum_sha256 CHAR(64),
  uploader_id     UUID NOT NULL,
  created_at      TIMESTAMP NOT NULL,
  UNIQUE (document_id, version_number)
);

document_metadata

CREATE TABLE document_metadata (
  document_id  UUID NOT NULL REFERENCES document(id),
  meta_key     VARCHAR(128) NOT NULL,
  meta_value   TEXT,
  PRIMARY KEY (document_id, meta_key)
);

document_tag

CREATE TABLE document_tag (
  document_id  UUID NOT NULL REFERENCES document(id),
  tag          VARCHAR(128) NOT NULL,
  PRIMARY KEY (document_id, tag)
);

permission

CREATE TABLE permission (
  id           UUID PRIMARY KEY,
  resource_type VARCHAR(16) NOT NULL,  -- 'document' | 'folder'
  resource_id  UUID NOT NULL,
  principal_type VARCHAR(16) NOT NULL, -- 'user' | 'group'
  principal_id UUID NOT NULL,
  access_level VARCHAR(16) NOT NULL,   -- 'viewer' | 'editor' | 'admin'
  inherited    BOOLEAN NOT NULL DEFAULT FALSE
);

API Design

POST   /v1/folders                          -- create folder
POST   /v1/documents                        -- create document record + get upload URL
PUT    /v1/documents/{id}/versions          -- upload new version
GET    /v1/documents/{id}                   -- fetch metadata + current version info
GET    /v1/documents/{id}/versions          -- list all versions
GET    /v1/documents/{id}/versions/{v}/url  -- get pre-signed download URL
DELETE /v1/documents/{id}                   -- soft delete
PATCH  /v1/documents/{id}/metadata          -- upsert metadata / tags
GET    /v1/search?q=...&folder=...          -- full-text search
POST   /v1/documents/{id}/permissions       -- grant access
DELETE /v1/documents/{id}/permissions/{pid} -- revoke access

Upload Flow

Client calls POST /v1/documents with title, folder, and MIME type.
Service creates a document row and a document_version row (status: pending).
Service generates a pre-signed S3 PUT URL for the version's storage_key and returns it to the client.
Client uploads bytes directly to S3.
S3 triggers an event (SNS/SQS) on successful object creation.
Service marks the version active, updates document.current_version, computes checksum, and enqueues a text-extraction job for search indexing.

This pattern keeps large binary data off the application tier entirely.

Versioning

Each upload increments version_number using an optimistic lock or a SELECT ... FOR UPDATE on the document row. Old versions retain their storage_key in S3 (do not delete). To restore a previous version, insert a new document_version row pointing to the old storage_key and update current_version.

Access Control

Permission resolution order:

Check explicit permissions on the document itself.
Walk up the folder tree checking folder permissions (inherited).
Deny by default.

Cache resolved permissions per (principal, resource) with a short TTL (e.g., 30 s) to avoid repeated tree walks. Invalidate on any permission change.

Full-Text Search

Use Elasticsearch or OpenSearch as the search backend:

Index document: id, title, tags, metadata, extracted text content, folder_id, ACL list.
Include the ACL (list of principal IDs with access) in the index document so search can apply a document-level filter at query time, avoiding a separate permission check for each result.
Text extraction: run Tika or a Lambda function to extract plain text from PDF, DOCX, etc., then update the index.
Re-index on new version upload and on metadata change.

Retention Policies

Each document may have a retention_days value. A background job runs daily:

Select documents where status = 'active' and created_at + retention_days <= now().
Move them to archived (or deleted per policy).
For hard deletes, remove S3 objects after a grace period and mark versions as purged.
Write an audit event for every retention action.

For legal hold: add a legal_hold flag that overrides retention deletion.

Scalability Considerations

Shard the document_version table by document_id if version counts grow large.
Use S3 multipart upload for files > 100 MB.
CDN (CloudFront) in front of S3 for frequently downloaded public documents.
Materialized path on folder.path makes subtree queries a simple LIKE /root/legal/% without recursive CTEs.
Background text extraction is async; search results may lag by seconds after upload.

Common Interview Follow-Ups

Deduplication: Store SHA-256 checksum; if a matching hash already exists in storage, reuse the S3 key (content-addressable storage).
Collaborative editing: Integrate an OT/CRDT layer (e.g., Yjs) for real-time co-editing; save a new version on each explicit save or periodic checkpoint.
Virus scanning: Trigger a scanner lambda on the S3 event before marking the version active.
Quota enforcement: Track bytes used per owner; reject uploads that exceed quota before issuing the pre-signed URL.