System Design Interview: Distributed Transactions (2PC, Saga Pattern)
Distributed transactions are one of the hardest problems in distributed systems. When an operation spans multiple services or databases, ensuring atomicity (all-or-nothing) requires coordination protocols. Asked at Stripe, Coinbase, Uber, Shopify, and any company with microservices.
Why Distributed Transactions Are Hard
In a single database: ACID transactions are guaranteed by the DB engine. In microservices: each service has its own database. An order checkout involves inventory service, payment service, and fulfillment service — if payment succeeds but inventory decrement fails, data is inconsistent.
Two-Phase Commit (2PC)
The classic distributed transaction protocol. A coordinator manages the transaction across multiple participants (services/databases).
Phase 1 - Prepare:
Coordinator -> all Participants: "Can you commit transaction X?"
Each Participant:
- Acquire locks
- Write to write-ahead log
- Reply: VOTE_COMMIT or VOTE_ABORT
Phase 2 - Commit or Abort:
If all voted COMMIT:
Coordinator -> all Participants: "Commit!"
Each Participant: commit, release locks, ack
If any voted ABORT:
Coordinator -> all Participants: "Abort!"
Each Participant: rollback, release locks, ack
2PC Problems
- Blocking: if coordinator crashes after Phase 1 but before Phase 2, participants hold locks indefinitely waiting for coordinator to recover
- Single point of failure: coordinator failure stalls all participants
- Availability: violates CAP theorem – 2PC sacrifices availability for consistency
- Performance: 2 round trips + lock holding time – slow for high-throughput systems
2PC is used within a single database system (XA transactions) but rarely across microservices in high-scale systems.
Saga Pattern
Sagas decompose a distributed transaction into a sequence of local transactions, each with a compensating transaction for rollback. No distributed locks – each service commits locally and publishes an event.
Choreography-Based Saga
Order Service: create order (PENDING) -> emit OrderCreated
|
Inventory Service: reserve items -> emit InventoryReserved
| (or emit InventoryFailed if out of stock)
Payment Service: charge card -> emit PaymentCompleted
| (or emit PaymentFailed)
Order Service: mark order CONFIRMED
On failure:
PaymentFailed event -> Inventory Service: release reservation (compensate)
InventoryFailed event -> Order Service: cancel order (compensate)
Orchestration-Based Saga
Saga Orchestrator manages the sequence:
1. Tell Inventory: reserve items
- Success: proceed to step 2
- Failure: tell Order Service: cancel order. Done.
2. Tell Payment: charge card
- Success: proceed to step 3
- Failure: tell Inventory: release reservation. Done.
3. Tell Order Service: confirm order. Done.
Saga vs 2PC
- Saga provides eventual consistency, not ACID atomicity
- Saga is available (no cross-service locks) – better for microservices
- Compensating transactions must be idempotent and always succeed
- Saga does not provide isolation – other transactions may see intermediate state
Idempotency Keys
Critical for saga reliability: if a step is retried (network failure, crash), it must not double-charge or double-reserve. Each step includes a unique idempotency key. Service stores (idempotency_key, result) in a lookup table. If key seen again, return stored result without re-executing.
Table: idempotency_keys (key, status, result, created_at)
On request:
SELECT * FROM idempotency_keys WHERE key = ?
If found: return cached result
If not found:
Execute operation
INSERT INTO idempotency_keys (key, status, result) VALUES (?, ?, ?)
Return result
Outbox Pattern
Atomically commit DB write and event publication without distributed transaction:
In the same DB transaction:
UPDATE orders SET status = 'PENDING' WHERE id = ?
INSERT INTO outbox (event_type, payload) VALUES ('OrderCreated', {...})
Separate outbox worker:
SELECT * FROM outbox WHERE processed = false ORDER BY created_at
Publish each event to Kafka
Mark as processed
Guarantees at-least-once event delivery. Combined with idempotent consumers, achieves exactly-once semantics.
Distributed Lock (for short critical sections)
# Redis Redlock for distributed mutex
SET lock_key {unique_value} EX 30 NX
# EX 30: auto-expire after 30s (crash safety)
# NX: only if not exists (atomic)
Do critical work...
DEL lock_key (only if value matches unique_value)
# Lua script for atomic check-and-delete
Interview Tips
- Know 2PC phases and why it blocks on coordinator failure
- Explain Saga as the microservices alternative – eventual consistency with compensation
- Discuss choreography (event-driven) vs orchestration (central coordinator) sagas
- Idempotency keys are non-negotiable for any distributed transaction
- Outbox pattern solves the dual-write problem atomically
- Saga does not provide isolation – mention this trade-off explicitly