System Design Interview: Distributed Transactions (2PC, Saga Pattern)

System Design Interview: Distributed Transactions (2PC, Saga Pattern)

Distributed transactions are one of the hardest problems in distributed systems. When an operation spans multiple services or databases, ensuring atomicity (all-or-nothing) requires coordination protocols. Asked at Stripe, Coinbase, Uber, Shopify, and any company with microservices.

Why Distributed Transactions Are Hard

In a single database: ACID transactions are guaranteed by the DB engine. In microservices: each service has its own database. An order checkout involves inventory service, payment service, and fulfillment service — if payment succeeds but inventory decrement fails, data is inconsistent.

Two-Phase Commit (2PC)

The classic distributed transaction protocol. A coordinator manages the transaction across multiple participants (services/databases).

Phase 1 - Prepare:
  Coordinator -> all Participants: "Can you commit transaction X?"
  Each Participant:
    - Acquire locks
    - Write to write-ahead log
    - Reply: VOTE_COMMIT or VOTE_ABORT

Phase 2 - Commit or Abort:
  If all voted COMMIT:
    Coordinator -> all Participants: "Commit!"
    Each Participant: commit, release locks, ack
  If any voted ABORT:
    Coordinator -> all Participants: "Abort!"
    Each Participant: rollback, release locks, ack

2PC Problems

  • Blocking: if coordinator crashes after Phase 1 but before Phase 2, participants hold locks indefinitely waiting for coordinator to recover
  • Single point of failure: coordinator failure stalls all participants
  • Availability: violates CAP theorem – 2PC sacrifices availability for consistency
  • Performance: 2 round trips + lock holding time – slow for high-throughput systems

2PC is used within a single database system (XA transactions) but rarely across microservices in high-scale systems.

Saga Pattern

Sagas decompose a distributed transaction into a sequence of local transactions, each with a compensating transaction for rollback. No distributed locks – each service commits locally and publishes an event.

Choreography-Based Saga

Order Service: create order (PENDING) -> emit OrderCreated
  |
Inventory Service: reserve items -> emit InventoryReserved
  |                   (or emit InventoryFailed if out of stock)
Payment Service: charge card -> emit PaymentCompleted
  |                   (or emit PaymentFailed)
Order Service: mark order CONFIRMED

On failure:
PaymentFailed event -> Inventory Service: release reservation (compensate)
InventoryFailed event -> Order Service: cancel order (compensate)

Orchestration-Based Saga

Saga Orchestrator manages the sequence:
1. Tell Inventory: reserve items
   - Success: proceed to step 2
   - Failure: tell Order Service: cancel order. Done.
2. Tell Payment: charge card
   - Success: proceed to step 3
   - Failure: tell Inventory: release reservation. Done.
3. Tell Order Service: confirm order. Done.

Saga vs 2PC

  • Saga provides eventual consistency, not ACID atomicity
  • Saga is available (no cross-service locks) – better for microservices
  • Compensating transactions must be idempotent and always succeed
  • Saga does not provide isolation – other transactions may see intermediate state

Idempotency Keys

Critical for saga reliability: if a step is retried (network failure, crash), it must not double-charge or double-reserve. Each step includes a unique idempotency key. Service stores (idempotency_key, result) in a lookup table. If key seen again, return stored result without re-executing.

Table: idempotency_keys (key, status, result, created_at)
On request:
  SELECT * FROM idempotency_keys WHERE key = ?
  If found: return cached result
  If not found:
    Execute operation
    INSERT INTO idempotency_keys (key, status, result) VALUES (?, ?, ?)
    Return result

Outbox Pattern

Atomically commit DB write and event publication without distributed transaction:

In the same DB transaction:
  UPDATE orders SET status = 'PENDING' WHERE id = ?
  INSERT INTO outbox (event_type, payload) VALUES ('OrderCreated', {...})

Separate outbox worker:
  SELECT * FROM outbox WHERE processed = false ORDER BY created_at
  Publish each event to Kafka
  Mark as processed

Guarantees at-least-once event delivery. Combined with idempotent consumers, achieves exactly-once semantics.

Distributed Lock (for short critical sections)

# Redis Redlock for distributed mutex
SET lock_key {unique_value} EX 30 NX
  # EX 30: auto-expire after 30s (crash safety)
  # NX: only if not exists (atomic)

Do critical work...

DEL lock_key (only if value matches unique_value)
  # Lua script for atomic check-and-delete

Interview Tips

  • Know 2PC phases and why it blocks on coordinator failure
  • Explain Saga as the microservices alternative – eventual consistency with compensation
  • Discuss choreography (event-driven) vs orchestration (central coordinator) sagas
  • Idempotency keys are non-negotiable for any distributed transaction
  • Outbox pattern solves the dual-write problem atomically
  • Saga does not provide isolation – mention this trade-off explicitly

Scroll to Top