Q: How does the saga pattern handle eventual consistency vs. strong consistency?

Sagas provide eventual consistency, not strong consistency. During a saga's execution: inventory is reserved but payment hasn't been charged yet — a read at this point sees reserved inventory but no payment record. This intermediate state is visible to other operations. Consequences: (1) two users can simultaneously start sagas to purchase the last item in stock — both reserve it, but only one payment saga will succeed; the other must compensate. Use a reservation pattern (reserve inventory at step 1, finalize at step N) to minimize the inconsistency window. (2) a business report run during a saga's execution may see partial data. For financial reporting, always query completed sagas (status='completed') and add manual reconciliation for in-flight sagas. (3) the alternative (distributed 2PC) provides strong consistency but requires all participating services to support prepare/commit/rollback protocols — complex, and creates long-held locks that kill throughput. For most business workflows, saga eventual consistency is acceptable. For money movements requiring zero intermediate inconsistency, use a single-database transaction instead.

Question 1

What is the difference between saga orchestration and saga choreography?

Accepted Answer

Choreography: each service listens to events and reacts by performing its action and emitting the next event. No central coordinator. Example: Order service emits OrderPlaced → Inventory service listens, reserves items, emits InventoryReserved → Payment service listens, charges card, emits PaymentCharged → Fulfillment service listens and ships. Advantage: loose coupling, no single point of failure. Disadvantage: the saga flow is implicit — spread across multiple services' event handlers — making it hard to reason about the overall flow, test, or debug. Orchestration: a central Saga Orchestrator calls each service in sequence and tracks state. The flow is explicit in one place. Advantage: easy to see the full saga state at any time, easier to add retry and compensation logic, simpler testing. Disadvantage: the orchestrator is an additional service that must be maintained. Choose choreography for simple 2–3 step flows; choose orchestration for complex flows with many steps, conditional branching, or strict compensation requirements.

Question 2

How do you design idempotent compensating transactions?

Accepted Answer

A compensating transaction must be safe to call multiple times — the orchestrator retries it if the first call times out. Design: (1) reserve_inventory (forward step) → release_inventory (compensation). release_inventory checks: if the reservation (identified by reservation_id from step_results) no longer exists, return success immediately — it was already released. If it exists, release it and return success. (2) charge_payment → refund_payment. refund_payment uses the processor's idempotency key to avoid double-refunding. (3) create_shipment → cancel_shipment. cancel_shipment checks: if shipment is already cancelled, return success. If it is shipped (too late to cancel), return a special already_shipped error — the orchestrator escalates to manual intervention. The key principle: a compensation that finds "nothing to undo" should succeed silently, not error. Design compensations to be naturally idempotent by always checking current state before acting.

Question 3

How do you handle a saga step that partially succeeds (some records created, some failed)?

Accepted Answer

A step that creates 10 records and fails on record 7 leaves 6 orphaned records. The compensation must clean them up, but the step' response_payload may not include the IDs of the partially created records. Mitigation: (1) make steps atomic where possible — wrap all database operations in a single transaction; if the transaction fails, nothing is committed; (2) for non-atomic steps (calling an external service that creates records one by one): return partial results in the response_payload even on failure (e.g., {'created_ids': [1,2,3], 'failed_at': 4}) and design the compensation to delete the partial list; (3) use a saga_correlation_id as a foreign key in the created records (step_record.idempotency_key as the correlation key). The compensation queries SELECT * FROM records WHERE saga_idem_key=X and deletes all of them — works even if the response_payload was lost due to a crash.

Question 4

How do you monitor and alert on stuck or long-running sagas?

Accepted Answer

A saga stuck in 'compensating' status for 30 minutes is a financial or data integrity emergency. Monitoring: (1) alert on sagas where status IN ('running','compensating') AND started_at < NOW() - INTERVAL '15 minutes'. For most sagas this should never happen — a 15-minute running saga indicates a service is down or unresponsive. (2) alert on status='failed' (compensation itself failed) — these require immediate manual intervention. (3) dashboard: show saga count by status, age distribution, and type. The SLA dashboard should show: running (healthy: <1 min), completed, compensated, failed. A failed saga means data is in a partially inconsistent state across multiple services — it is a P1 incident. (4) runbook: for each saga type, document how to manually complete or compensate a stuck saga, including the exact SQL and service API calls needed. Sagas with no runbook are operational debt.

Question 5

How does the saga pattern handle eventual consistency vs. strong consistency?

Accepted Answer

Sagas provide eventual consistency, not strong consistency. During a saga's execution: inventory is reserved but payment hasn't been charged yet — a read at this point sees reserved inventory but no payment record. This intermediate state is visible to other operations. Consequences: (1) two users can simultaneously start sagas to purchase the last item in stock — both reserve it, but only one payment saga will succeed; the other must compensate. Use a reservation pattern (reserve inventory at step 1, finalize at step N) to minimize the inconsistency window. (2) a business report run during a saga's execution may see partial data. For financial reporting, always query completed sagas (status='completed') and add manual reconciliation for in-flight sagas. (3) the alternative (distributed 2PC) provides strong consistency but requires all participating services to support prepare/commit/rollback protocols — complex, and creates long-held locks that kill throughput. For most business workflows, saga eventual consistency is acceptable. For money movements requiring zero intermediate inconsistency, use a single-database transaction instead.

Saga Orchestration System Low-Level Design: Distributed Transactions, Compensation, and Idempotent Step Execution

Saga Orchestration System: Low-Level Design

Orchestrator: Execute Forward Steps

Compensation: Roll Back Committed Steps

Key Design Decisions