The saga pattern manages distributed transactions across multiple services by decomposing them into a sequence of local transactions, each publishing an event or message. When a step fails, compensating transactions undo the preceding steps. Without sagas, a distributed transaction that touches three services has no atomic rollback — you must build it explicitly.
Core Data Model
CREATE TABLE Saga (
saga_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
saga_type VARCHAR(50) NOT NULL, -- 'order_fulfillment', 'user_registration'
correlation_id UUID NOT NULL, -- ID of the business object (order_id, etc.)
current_step VARCHAR(50) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'RUNNING', -- RUNNING, COMPLETED, COMPENSATING, FAILED
payload JSONB NOT NULL, -- initial input, shared across steps
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE SagaStep (
step_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
saga_id UUID NOT NULL REFERENCES Saga(saga_id),
step_name VARCHAR(50) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'PENDING', -- PENDING, COMPLETED, COMPENSATED, FAILED
result JSONB, -- output stored for compensation
compensated_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_saga_correlation ON Saga(correlation_id);
CREATE INDEX idx_sagastep_saga ON SagaStep(saga_id, step_name);
Choreography vs Orchestration
Choreography: Each service listens for events and reacts. OrderService publishes OrderCreated; PaymentService listens, charges the card, publishes PaymentCompleted; InventoryService listens, reserves stock, publishes StockReserved. No central coordinator. Scales well; hard to debug and trace.
Orchestration: A central saga orchestrator drives the workflow. It sends commands to each service and waits for replies. Easier to monitor (the saga state machine is visible in one place), easier to compensate (orchestrator knows which steps completed), preferred for complex flows with many steps or non-trivial compensation logic. Use orchestration for anything with more than 3 steps.
Orchestrated Saga Implementation
class OrderFulfillmentSaga:
STEPS = ['reserve_inventory', 'charge_payment', 'schedule_shipping', 'send_confirmation']
def start(self, order_id: str, payload: dict) -> str:
saga_id = str(uuid4())
db.execute("""
INSERT INTO Saga (saga_id, saga_type, correlation_id, current_step, payload)
VALUES (%s, 'order_fulfillment', %s, 'reserve_inventory', %s)
""", [saga_id, order_id, json.dumps(payload)])
self._execute_step(saga_id, 'reserve_inventory', payload)
return saga_id
def _execute_step(self, saga_id: str, step: str, payload: dict):
try:
result = self._call_service(step, payload)
db.execute("""
INSERT INTO SagaStep (saga_id, step_name, status, result)
VALUES (%s, %s, 'COMPLETED', %s)
""", [saga_id, step, json.dumps(result)])
next_step = self._next_step(step)
if next_step:
db.execute("UPDATE Saga SET current_step=%s WHERE saga_id=%s", [next_step, saga_id])
self._execute_step(saga_id, next_step, {**payload, **result})
else:
db.execute("UPDATE Saga SET status='COMPLETED' WHERE saga_id=%s", [saga_id])
except ServiceException as e:
self._compensate(saga_id, step, payload)
def _compensate(self, saga_id: str, failed_step: str, payload: dict):
db.execute("UPDATE Saga SET status='COMPENSATING' WHERE saga_id=%s", [saga_id])
# Get all completed steps before the failed one, in reverse order
completed = db.fetchall("""
SELECT step_name, result FROM SagaStep
WHERE saga_id=%s AND status='COMPLETED'
ORDER BY completed_at DESC
""", [saga_id])
for step_row in completed:
try:
self._call_compensation(step_row['step_name'], payload, step_row['result'])
db.execute("""
UPDATE SagaStep SET status='COMPENSATED', compensated_at=NOW()
WHERE saga_id=%s AND step_name=%s
""", [saga_id, step_row['step_name']])
except Exception:
# Compensation failure is critical — alert and require manual intervention
alert_ops(f"Compensation failed for saga {saga_id} step {step_row['step_name']}")
db.execute("UPDATE Saga SET status='FAILED' WHERE saga_id=%s", [saga_id])
Idempotent Steps
Every saga step must be idempotent. The orchestrator may retry a step after a network timeout — the service may have already executed it. Solution: pass the saga_id + step_name as an idempotency key to each service call. The receiving service checks if it already processed this key: if yes, return the cached result; if no, execute and store.
-- In PaymentService
CREATE TABLE IdempotencyKey (
key VARCHAR(100) PRIMARY KEY, -- '{saga_id}:{step_name}'
result JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
def charge_payment(saga_id, step_name, amount, card_token):
idem_key = f"{saga_id}:{step_name}"
existing = db.fetchone("SELECT result FROM IdempotencyKey WHERE key=%s", [idem_key])
if existing:
return existing['result'] # already charged — return cached result
result = payment_processor.charge(amount, card_token)
db.execute("INSERT INTO IdempotencyKey (key, result) VALUES (%s, %s)",
[idem_key, json.dumps(result)])
return result
Key Interview Points
- Sagas trade ACID isolation for availability — two concurrent sagas can see each other’s intermediate states. Use semantic locks or reservation patterns to prevent conflicts.
- Compensating transactions are not rollbacks — they are new forward transactions that undo business effects. A charge compensation is a refund, not a DB rollback.
- Compensation failures require manual intervention or a dead-letter queue — you cannot automatically recover from a failed compensation.
- Prefer orchestration over choreography for complex flows — the state machine is testable and traceable.
- The saga log (Saga + SagaStep tables) is your audit trail and recovery mechanism — if the orchestrator crashes mid-saga, replay from the last recorded step on restart.
- At-least-once delivery + idempotent steps = exactly-once business semantics.
Saga pattern and distributed transaction design is discussed in Uber system design interview questions.
Saga pattern and payment flow reliability design is covered in Stripe system design interview preparation.
Saga pattern and order fulfillment workflow design is discussed in Amazon system design interview guide.
See also: Shopify Interview Guide