Low Level Design: Document Management System

What Is a Document Management System?

A document management system (DMS) stores, organizes, and retrieves files in a way that supports versioning, access control, search, and retention governance. It is a staple low-level design topic because it combines blob storage, relational metadata, search indexing, and permission modeling into one coherent service.

Core Requirements

Functional

  • Upload, download, and delete documents.
  • Automatic versioning: each upload creates a new version; prior versions remain accessible.
  • Metadata: title, tags, custom key-value attributes per document.
  • Hierarchical folder organization.
  • Role-based and ACL-based access control at the document and folder level.
  • Full-text search over document content and metadata.
  • Retention policies: auto-archive or delete documents after a configured period.

Non-Functional

  • Upload latency: large files streamed directly to object storage; metadata write < 100 ms.
  • Download: pre-signed URLs served within < 50 ms.
  • Search: sub-second for most queries.
  • Durability: documents stored with at least 11 nines durability (S3-class).

Data Model

folder

CREATE TABLE folder (
  id          UUID PRIMARY KEY,
  parent_id   UUID REFERENCES folder(id),
  owner_id    UUID NOT NULL,
  name        VARCHAR(255) NOT NULL,
  path        TEXT NOT NULL,   -- materialized path, e.g. /root/legal/contracts
  created_at  TIMESTAMP NOT NULL
);

document

CREATE TABLE document (
  id              UUID PRIMARY KEY,
  folder_id       UUID NOT NULL REFERENCES folder(id),
  owner_id        UUID NOT NULL,
  title           VARCHAR(512) NOT NULL,
  current_version INT NOT NULL DEFAULT 1,
  status          VARCHAR(32) NOT NULL DEFAULT 'active',
  -- 'active' | 'archived' | 'deleted'
  retention_days  INT,
  created_at      TIMESTAMP NOT NULL,
  updated_at      TIMESTAMP NOT NULL
);

document_version

CREATE TABLE document_version (
  id              UUID PRIMARY KEY,
  document_id     UUID NOT NULL REFERENCES document(id),
  version_number  INT NOT NULL,
  storage_key     TEXT NOT NULL,   -- S3 object key
  mime_type       VARCHAR(128),
  size_bytes      BIGINT,
  checksum_sha256 CHAR(64),
  uploader_id     UUID NOT NULL,
  created_at      TIMESTAMP NOT NULL,
  UNIQUE (document_id, version_number)
);

document_metadata

CREATE TABLE document_metadata (
  document_id  UUID NOT NULL REFERENCES document(id),
  meta_key     VARCHAR(128) NOT NULL,
  meta_value   TEXT,
  PRIMARY KEY (document_id, meta_key)
);

document_tag

CREATE TABLE document_tag (
  document_id  UUID NOT NULL REFERENCES document(id),
  tag          VARCHAR(128) NOT NULL,
  PRIMARY KEY (document_id, tag)
);

permission

CREATE TABLE permission (
  id           UUID PRIMARY KEY,
  resource_type VARCHAR(16) NOT NULL,  -- 'document' | 'folder'
  resource_id  UUID NOT NULL,
  principal_type VARCHAR(16) NOT NULL, -- 'user' | 'group'
  principal_id UUID NOT NULL,
  access_level VARCHAR(16) NOT NULL,   -- 'viewer' | 'editor' | 'admin'
  inherited    BOOLEAN NOT NULL DEFAULT FALSE
);

API Design

POST   /v1/folders                          -- create folder
POST   /v1/documents                        -- create document record + get upload URL
PUT    /v1/documents/{id}/versions          -- upload new version
GET    /v1/documents/{id}                   -- fetch metadata + current version info
GET    /v1/documents/{id}/versions          -- list all versions
GET    /v1/documents/{id}/versions/{v}/url  -- get pre-signed download URL
DELETE /v1/documents/{id}                   -- soft delete
PATCH  /v1/documents/{id}/metadata          -- upsert metadata / tags
GET    /v1/search?q=...&folder=...          -- full-text search
POST   /v1/documents/{id}/permissions       -- grant access
DELETE /v1/documents/{id}/permissions/{pid} -- revoke access

Upload Flow

  1. Client calls POST /v1/documents with title, folder, and MIME type.
  2. Service creates a document row and a document_version row (status: pending).
  3. Service generates a pre-signed S3 PUT URL for the version's storage_key and returns it to the client.
  4. Client uploads bytes directly to S3.
  5. S3 triggers an event (SNS/SQS) on successful object creation.
  6. Service marks the version active, updates document.current_version, computes checksum, and enqueues a text-extraction job for search indexing.

This pattern keeps large binary data off the application tier entirely.

Versioning

Each upload increments version_number using an optimistic lock or a SELECT ... FOR UPDATE on the document row. Old versions retain their storage_key in S3 (do not delete). To restore a previous version, insert a new document_version row pointing to the old storage_key and update current_version.

Access Control

Permission resolution order:

  1. Check explicit permissions on the document itself.
  2. Walk up the folder tree checking folder permissions (inherited).
  3. Deny by default.

Cache resolved permissions per (principal, resource) with a short TTL (e.g., 30 s) to avoid repeated tree walks. Invalidate on any permission change.

Use Elasticsearch or OpenSearch as the search backend:

  • Index document: id, title, tags, metadata, extracted text content, folder_id, ACL list.
  • Include the ACL (list of principal IDs with access) in the index document so search can apply a document-level filter at query time, avoiding a separate permission check for each result.
  • Text extraction: run Tika or a Lambda function to extract plain text from PDF, DOCX, etc., then update the index.
  • Re-index on new version upload and on metadata change.

Retention Policies

Each document may have a retention_days value. A background job runs daily:

  1. Select documents where status = 'active' and created_at + retention_days <= now().
  2. Move them to archived (or deleted per policy).
  3. For hard deletes, remove S3 objects after a grace period and mark versions as purged.
  4. Write an audit event for every retention action.

For legal hold: add a legal_hold flag that overrides retention deletion.

Scalability Considerations

  • Shard the document_version table by document_id if version counts grow large.
  • Use S3 multipart upload for files > 100 MB.
  • CDN (CloudFront) in front of S3 for frequently downloaded public documents.
  • Materialized path on folder.path makes subtree queries a simple LIKE /root/legal/% without recursive CTEs.
  • Background text extraction is async; search results may lag by seconds after upload.

Common Interview Follow-Ups

  • Deduplication: Store SHA-256 checksum; if a matching hash already exists in storage, reuse the S3 key (content-addressable storage).
  • Collaborative editing: Integrate an OT/CRDT layer (e.g., Yjs) for real-time co-editing; save a new version on each explicit save or periodic checkpoint.
  • Virus scanning: Trigger a scanner lambda on the S3 event before marking the version active.
  • Quota enforcement: Track bytes used per owner; reject uploads that exceed quota before issuing the pre-signed URL.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Atlassian Interview Guide

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

Scroll to Top