System Design: Distributed Transactions — Two-Phase Commit, Saga Pattern, and the Outbox Pattern

The Problem with Distributed Transactions

In a monolith, a database transaction is atomic: either all changes commit or all roll back. In microservices, an operation spans multiple services with separate databases. Example: placing an order requires: (1) deduct inventory, (2) charge payment, (3) create order record, (4) send confirmation email. If step 3 fails after step 2 succeeds, money is charged but no order exists. Maintaining atomicity across multiple independent databases is the distributed transaction problem.

Two-Phase Commit (2PC)

2PC uses a coordinator to achieve atomicity across multiple participants. Phase 1 (Prepare): coordinator sends “prepare” to all participants. Each participant executes the transaction locally, writes to a durable log, and replies “ready” or “abort”. Phase 2 (Commit): if all replied “ready”, coordinator sends “commit” — all participants finalize. If any replied “abort”, coordinator sends “rollback” — all abort.

Problems: Blocking protocol — if the coordinator fails after prepare but before commit, participants are stuck holding locks indefinitely (they have said “ready” and cannot unilaterally abort). Performance: two network round-trips for every transaction; participants hold locks during both phases. Use in practice: 2PC is used within a single database cluster (PostgreSQL, MySQL for distributed setups) but avoided across microservices. XA protocol is the standard 2PC API supported by most databases.

Saga Pattern

A saga is a sequence of local transactions. Each step has a compensating transaction that undoes its effect. If step N fails, execute compensating transactions for steps N-1, N-2, …, 1 in reverse. No locks held across services — each service completes its local transaction immediately. Eventual consistency: the system may be in an intermediate state between steps.

Two coordination models: Choreography (event-driven): each service publishes an event after its local transaction. Other services listen and react. Order service publishes OrderCreated → Inventory service decrements stock, publishes InventoryReserved → Payment service charges, publishes PaymentCharged → Notification service sends email. On failure: Payment publishes PaymentFailed → Inventory publishes InventoryReleased (compensating). Pro: decoupled. Con: hard to track saga state, distributed logic.

Orchestration: a central saga orchestrator sends commands to each service and waits for responses. Orchestrator tracks state: STARTED → INVENTORY_RESERVED → PAYMENT_CHARGED → ORDER_CREATED → COMPLETED. On failure at any step, orchestrator sends compensating commands. Pro: clear state, easy to monitor and debug. Con: orchestrator is a central point (but not a single point of failure if stateless with persistent state in a DB).

The Outbox Pattern

The dual-write problem: after a local transaction, you must publish an event to Kafka. If the event publish fails after the transaction commits, the event is lost (other services never know the step completed). Outbox pattern: write the event to an outbox table inside the same database transaction as the business operation. A separate outbox processor reads the table and publishes to Kafka, deleting rows after successful publish. This gives exactly-once event publishing relative to the database transaction: if the transaction commits, the event is in the outbox; the publisher will eventually deliver it. If the transaction rolls back, the event is never in the outbox.

Outbox processor: poll the outbox table every 100ms, or use CDC (Change Data Capture) like Debezium to stream changes from the database’s WAL (write-ahead log) to Kafka. CDC is more efficient (no polling) and has lower latency. The outbox table: (id, event_type, aggregate_id, payload JSON, created_at, published_at). Index on published_at IS NULL for efficient polling.

Idempotency Across Services

Sagas may retry failed steps. Each service must be idempotent: receiving the same command twice produces the same result as receiving it once. Technique: include an idempotency_key (saga_id + step_name) in each command. Service stores (idempotency_key, result) in a table. On receipt: check if this key was already processed. If yes, return the stored result. If no, process and store. This allows safe retries without side effects (no double charges, no double reservations).

Interview Tips

  • 2PC vs Saga: 2PC gives strong consistency but is slow and blocking. Saga gives eventual consistency but requires idempotency and compensation logic. Use Saga for microservices.
  • Compensating transactions: not always possible (cannot un-send an email). Use best-effort compensation (send a “sorry” email) or prevent the issue (only send the email at the last step).
  • Outbox pattern is essential for reliable event publishing — always use it when a database transaction must trigger an event.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why is Two-Phase Commit (2PC) avoided in microservices architectures?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “2PC has three major problems for microservices. First, blocking: if the coordinator crashes after the prepare phase, all participants hold locks indefinitely — they replied “ready” but cannot commit or abort without coordinator direction. This blocks those records for other operations until the coordinator recovers. Second, performance: two network round-trips plus lock-holding across services means high latency and low throughput. Third, availability: 2PC requires all participants to be reachable. If any service is down during phase 1, the entire transaction must abort — in a microservices environment with many services and independent deployments, this is frequent. The CAP theorem trade-off: 2PC favors consistency over availability. Saga favors availability with eventual consistency.”
}
},
{
“@type”: “Question”,
“name”: “What is a compensating transaction and when is it not possible?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A compensating transaction reverses the effect of a completed step. Examples: inventory reservation compensated by releasing it; payment charge compensated by a refund. Compensation is not always a true undo — it is a semantic reversal that is logically correct but may not be identical to a rollback. Situations where compensation is impossible: sending an email or SMS (cannot un-send), printing a label (cannot un-print), publishing a post publicly (cannot guarantee all readers forget it). In these cases: (1) defer the irreversible action to the last step of the saga so compensation is rarely needed; (2) use a best-effort compensation (send a “we made a mistake” follow-up email); (3) design the system to handle partial completion gracefully (order shows “pending” until all steps commit).”
}
},
{
“@type”: “Question”,
“name”: “How does the Outbox pattern guarantee exactly-once event delivery?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The outbox pattern achieves exactly-once relative to the database transaction. The event is written to an outbox table inside the same database transaction as the business record. Since they share a transaction, they commit or rollback together — no dual-write race condition. The outbox processor (Debezium CDC or a polling worker) reads undelivered outbox rows and publishes to Kafka. If the publish succeeds: mark the row as delivered (or delete it). If the publish fails: retry. The consumer receives the event at-least-once (because the outbox row may be retried). To achieve exactly-once end-to-end: the consumer must be idempotent — process the same event twice with the same result. Together: outbox ensures the event is always delivered (at-least-once) + idempotent consumer = effectively exactly-once.”
}
},
{
“@type”: “Question”,
“name”: “How do you choose between choreography and orchestration for a saga?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Choreography: each service reacts to events from the previous step. No central coordinator. Better for: simple flows with few steps, teams that want maximum service independence, event-driven architectures already using Kafka. Downsides: saga state is implicit and distributed — hard to query “what step is order X on?”, hard to debug failures, risk of cyclic event chains. Orchestration: a saga orchestrator service maintains explicit state and sends commands. Better for: complex flows with many steps, business processes that need monitoring and dashboards, teams that want clear ownership of business logic. Downsides: orchestrator is a new service to maintain. In practice: choreography for simple 2-3 step flows; orchestration for anything with 4+ steps, complex error handling, or audit requirements.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement saga idempotency to handle retries safely?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each saga step receives a command with a saga_id and step_name. The receiving service generates an idempotency_key = hash(saga_id + step_name). Before processing: check the idempotency_keys table for this key. If found: return the previously stored result (skip processing). If not found: process the command, insert the key and result atomically in the same transaction. This guarantees that retrying the same command produces the same result. For payment steps: the idempotency key maps to the payment provider’s idempotency key (Stripe uses Idempotency-Key headers). If the saga retries a payment step, Stripe returns the original payment result rather than charging again. Idempotency keys have a TTL (24-48 hours) after which they can be cleaned up.”
}
}
]
}

Asked at: Stripe Interview Guide

Asked at: Shopify Interview Guide

Asked at: Uber Interview Guide

Asked at: Coinbase Interview Guide

Scroll to Top