System Design: Collaborative Document Editing — Operational Transformation and CRDT (2025)

Requirements and Core Challenge

Functional: multiple users edit the same document simultaneously, changes from all users are reflected in real time, document state is eventually consistent across all clients, full edit history preserved. Non-functional: < 100ms latency for local edits to appear, eventual consistency across all connected clients, handle network partitions (offline editing + sync on reconnect), support documents up to 1M characters. The core challenge: concurrent edits from different users can conflict. User A inserts "X" at position 5 while User B deletes character at position 5 – who wins, and how does each client converge to the same final state?

Operational Transformation (OT)

OT approach: each edit is an Operation (INSERT char at position, DELETE char at position). When two operations are concurrent (happened before either was acknowledged by the server), transform one against the other to adjust positions. Example: A inserts at pos 5, B deletes at pos 5. After A’s insert: B’s delete should now target pos 6 (shifted by the insert). OT requires a central server to serialize operations and broadcast transformed versions. Google Docs uses OT.

def transform(op1, op2):
    # op1 is applied first (server-received order), transform op2 against it
    if op1["type"] == "insert" and op2["type"] == "insert":
        if op1["pos"] <= op2["pos"]:
            op2["pos"] += 1  # op1 inserted before op2: shift op2 right
    elif op1["type"] == "insert" and op2["type"] == "delete":
        if op1["pos"] <= op2["pos"]:
            op2["pos"] += 1
    elif op1["type"] == "delete" and op2["type"] == "insert":
        if op1["pos"] < op2["pos"]:
            op2["pos"] -= 1
    elif op1["type"] == "delete" and op2["type"] == "delete":
        if op1["pos"] < op2["pos"]:
            op2["pos"] -= 1
        elif op1["pos"] == op2["pos"]:
            op2["type"] = "noop"  # both deleted same char
    return op2

CRDT (Conflict-free Replicated Data Type)

CRDT approach: design the data structure so that any order of operation application converges to the same result – no transformation needed. For text editing: assign each character a globally unique ID (user_id + logical_clock). Characters are ordered by their IDs, not positions. Insert: “insert character C after character with ID X”. Delete: mark character C as deleted (tombstone) – never actually remove it (order is preserved). Since IDs are globally unique and total order on IDs is defined, any permutation of applying operations produces the same final document state. Figma, Notion use CRDT-based approaches. Advantage over OT: works peer-to-peer without central server; supports offline edits that sync on reconnect. Disadvantage: tombstoned characters accumulate (garbage collection needed), more complex implementation.

Architecture: Server-Side OT with WebSocket

# Server maintains authoritative document state
class DocumentServer:
    def __init__(self, doc_id: str):
        self.doc_id = doc_id
        self.content = []   # list of chars
        self.history = []   # list of operations (revision log)
        self.revision = 0

    def apply_operation(self, op: dict, client_revision: int) -> dict:
        # Transform op against all operations since client_revision
        for server_op in self.history[client_revision:]:
            op = transform(op, server_op)

        # Apply to document
        if op["type"] == "insert":
            self.content.insert(op["pos"], op["char"])
        elif op["type"] == "delete" and op["type"] != "noop":
            if op["pos"] < len(self.content):
                del self.content[op["pos"]]

        self.history.append(op)
        self.revision += 1
        return {"op": op, "revision": self.revision}

    def broadcast(self, op: dict, sender_conn_id: str):
        # Send transformed op to all other connected clients
        for conn_id, conn in self.connections.items():
            if conn_id != sender_conn_id:
                conn.send({"type": "op", "op": op, "revision": self.revision})

Presence, Cursors, and Persistence

Cursor presence: each client broadcasts cursor position (user_id, position, color) on each keystroke via WebSocket. Server fans out to other clients. Stored in Redis (key: doc:{doc_id}:cursors, hash of user_id -> position) with 30-second TTL per entry. Persistence: document operations stored in an append-only operations table (doc_id, revision, op_json, user_id, created_at). Current document state materialized by replaying from revision 0 (or from a snapshot). Snapshots taken every 1000 revisions to bound replay time. Cold start: load latest snapshot, replay operations from snapshot revision to current. Offline support: client stores operations locally while disconnected, submits all on reconnect. Server transforms each against operations that happened during the offline period.

Atlassian products (Confluence) use collaborative editing. See system design questions for Atlassian interview: collaborative editing and document system design.

Databricks interviews cover distributed state and consistency. See design patterns for Databricks interview: distributed state and CRDT systems.

LinkedIn system design rounds include real-time collaborative features. See patterns for LinkedIn interview: real-time collaboration system design.

Scroll to Top