System Design: Collaborative Document Editing — Operational Transformation and CRDT (2025)

Requirements and Core Challenge

Functional: multiple users edit the same document simultaneously, changes from all users are reflected in real time, document state is eventually consistent across all clients, full edit history preserved. Non-functional: < 100ms latency for local edits to appear, eventual consistency across all connected clients, handle network partitions (offline editing + sync on reconnect), support documents up to 1M characters. The core challenge: concurrent edits from different users can conflict. User A inserts "X" at position 5 while User B deletes character at position 5 – who wins, and how does each client converge to the same final state?

Operational Transformation (OT)

OT approach: each edit is an Operation (INSERT char at position, DELETE char at position). When two operations are concurrent (happened before either was acknowledged by the server), transform one against the other to adjust positions. Example: A inserts at pos 5, B deletes at pos 5. After A’s insert: B’s delete should now target pos 6 (shifted by the insert). OT requires a central server to serialize operations and broadcast transformed versions. Google Docs uses OT.

def transform(op1, op2):
    # op1 is applied first (server-received order), transform op2 against it
    if op1["type"] == "insert" and op2["type"] == "insert":
        if op1["pos"] <= op2["pos"]:
            op2["pos"] += 1  # op1 inserted before op2: shift op2 right
    elif op1["type"] == "insert" and op2["type"] == "delete":
        if op1["pos"] <= op2["pos"]:
            op2["pos"] += 1
    elif op1["type"] == "delete" and op2["type"] == "insert":
        if op1["pos"] < op2["pos"]:
            op2["pos"] -= 1
    elif op1["type"] == "delete" and op2["type"] == "delete":
        if op1["pos"] < op2["pos"]:
            op2["pos"] -= 1
        elif op1["pos"] == op2["pos"]:
            op2["type"] = "noop"  # both deleted same char
    return op2

CRDT (Conflict-free Replicated Data Type)

CRDT approach: design the data structure so that any order of operation application converges to the same result – no transformation needed. For text editing: assign each character a globally unique ID (user_id + logical_clock). Characters are ordered by their IDs, not positions. Insert: “insert character C after character with ID X”. Delete: mark character C as deleted (tombstone) – never actually remove it (order is preserved). Since IDs are globally unique and total order on IDs is defined, any permutation of applying operations produces the same final document state. Figma, Notion use CRDT-based approaches. Advantage over OT: works peer-to-peer without central server; supports offline edits that sync on reconnect. Disadvantage: tombstoned characters accumulate (garbage collection needed), more complex implementation.

Architecture: Server-Side OT with WebSocket

# Server maintains authoritative document state
class DocumentServer:
    def __init__(self, doc_id: str):
        self.doc_id = doc_id
        self.content = []   # list of chars
        self.history = []   # list of operations (revision log)
        self.revision = 0

    def apply_operation(self, op: dict, client_revision: int) -> dict:
        # Transform op against all operations since client_revision
        for server_op in self.history[client_revision:]:
            op = transform(op, server_op)

        # Apply to document
        if op["type"] == "insert":
            self.content.insert(op["pos"], op["char"])
        elif op["type"] == "delete" and op["type"] != "noop":
            if op["pos"] < len(self.content):
                del self.content[op["pos"]]

        self.history.append(op)
        self.revision += 1
        return {"op": op, "revision": self.revision}

    def broadcast(self, op: dict, sender_conn_id: str):
        # Send transformed op to all other connected clients
        for conn_id, conn in self.connections.items():
            if conn_id != sender_conn_id:
                conn.send({"type": "op", "op": op, "revision": self.revision})

Presence, Cursors, and Persistence

Cursor presence: each client broadcasts cursor position (user_id, position, color) on each keystroke via WebSocket. Server fans out to other clients. Stored in Redis (key: doc:{doc_id}:cursors, hash of user_id -> position) with 30-second TTL per entry. Persistence: document operations stored in an append-only operations table (doc_id, revision, op_json, user_id, created_at). Current document state materialized by replaying from revision 0 (or from a snapshot). Snapshots taken every 1000 revisions to bound replay time. Cold start: load latest snapshot, replay operations from snapshot revision to current. Offline support: client stores operations locally while disconnected, submits all on reconnect. Server transforms each against operations that happened during the offline period.


{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is Operational Transformation (OT) and how does it solve concurrent edits?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”OT resolves conflicts between concurrent edits by transforming operations before applying them. When two users edit simultaneously, each submits an operation with the document revision they edited against. The server receives Op-A at revision 5 and Op-B also at revision 5 (concurrent). To apply Op-B: transform it against Op-A (which was applied first). The transform adjusts Op-B's position to account for Op-A's insertion or deletion. The result: both operations are applied and the document converges to the same state on all clients. Google Docs uses OT with a central server as the serialization point.”}},{“@type”:”Question”,”name”:”What is the difference between OT and CRDT for collaborative editing?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”OT requires a central server to serialize and transform operations – it cannot work peer-to-peer without coordination. CRDTs (Conflict-free Replicated Data Types) are designed to converge to the same state regardless of operation order, without transformation. For text: CRDT assigns each character a globally unique ID. Operations reference character IDs (not positions), so they remain valid regardless of what other edits happened. CRDT supports peer-to-peer collaboration and offline editing with sync on reconnect. Trade-off: CRDT accumulates tombstoned (deleted) characters and is more complex to implement. Notion and Figma use CRDTs; Google Docs uses OT.”}},{“@type”:”Question”,”name”:”How do you handle offline edits in a collaborative document system?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Clients buffer operations locally while offline (store in IndexedDB or similar). Each operation is tagged with the client's current revision number at time of creation. On reconnect, the client submits all buffered operations to the server in order. The server transforms each against all operations that happened during the offline period (from the client's last-known revision to current). After transformation, operations are applied to the authoritative document. The client receives all server operations it missed and applies them locally. Final state converges. CRDT-based systems handle this more naturally since operations are order-independent.”}},{“@type”:”Question”,”name”:”How do you implement cursor presence in a collaborative editor?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Each client broadcasts cursor position and selection on every keystroke via WebSocket: {user_id, position, color, username}. The server stores current cursor positions in Redis as a hash: HSET doc:{doc_id}:cursors {user_id} {position_json} with a 30-second TTL per field (refreshed on each cursor update). On each cursor update, the server fans out to all other connected clients. Clients render remote cursors as colored carets with user labels. Cursor positions must be transformed against incoming operations: if a user deletes text before your cursor, shift your cursor left accordingly.”}},{“@type”:”Question”,”name”:”How do you persist collaborative document history efficiently?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Store every operation in an append-only operations table: (doc_id, revision, op_json, user_id, created_at). The current document state can be reconstructed by replaying all operations from revision 0. To bound replay time, take periodic snapshots: every 1000 revisions, store the full document text as a snapshot with its revision number. Cold start: load the latest snapshot, then replay only the operations since that snapshot. For undo/redo: operations are already in the history – undo creates a reverse operation. For version history display: replay to any target revision.”}}]}

Atlassian products (Confluence) use collaborative editing. See system design questions for Atlassian interview: collaborative editing and document system design.

Databricks interviews cover distributed state and consistency. See design patterns for Databricks interview: distributed state and CRDT systems.

LinkedIn system design rounds include real-time collaborative features. See patterns for LinkedIn interview: real-time collaboration system design.

Scroll to Top