Saga Compensation Pattern Low-Level Design
Distributed transactions across microservices cannot use two-phase commit at scale — it introduces tight coupling and liveness risks. The Saga pattern breaks a distributed transaction into a sequence of local transactions, each with a corresponding compensating transaction that undoes its effect if a later step fails. This guide covers the full low-level design: orchestration-based state machine, compensating transaction registry, partial failure handling, idempotent rollback, and timeout-triggered compensation.
Orchestration vs Choreography
Orchestration places a central coordinator (the saga orchestrator) in charge of executing each step and triggering compensation. Choreography distributes the control via events — each service listens for events and emits its own. Orchestration is easier to reason about, easier to observe, and the right choice when you need explicit rollback semantics with a clear audit trail.
This design uses orchestration. The orchestrator is a persistent state machine stored in the database, driven by a polling or event-driven loop.
Saga State Machine
The saga instance transitions through a well-defined set of states:
PENDING
→ STEP_1_STARTED → STEP_1_COMPLETED
→ STEP_2_STARTED → STEP_2_COMPLETED
→ ...
→ COMPLETED
On failure at step N:
STEP_N_STARTED → STEP_N_FAILED
→ COMPENSATING
→ COMP_N-1_STARTED → COMP_N-1_COMPLETED
→ COMP_N-2_STARTED → COMP_N-2_COMPLETED
→ ...
→ COMPENSATED | COMPENSATION_FAILED
The state is persisted to the database after every transition. If the orchestrator crashes, it resumes from the last durable state on restart.
Compensating Transaction Registry
Each saga type registers its steps and their compensations at startup. The registry maps (saga_type, step_number) to a pair of callables: forward_fn and compensate_fn. Both must be idempotent — they may be called more than once due to retries.
Example saga: Order placement across inventory, payment, and fulfillment services.
- Step 1: Reserve inventory → Compensate: Release inventory reservation
- Step 2: Charge payment → Compensate: Refund payment
- Step 3: Create fulfillment order → Compensate: Cancel fulfillment order
If step 3 fails, compensation runs: cancel fulfillment (step 3 comp, if partially applied), refund payment (step 2 comp), release reservation (step 1 comp) — in reverse order.
SQL Schema
CREATE TABLE saga_instance (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
saga_type VARCHAR(128) NOT NULL,
state VARCHAR(64) NOT NULL DEFAULT 'PENDING',
current_step INT NOT NULL DEFAULT 0,
payload JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_saga_state ON saga_instance (state, updated_at);
CREATE TABLE saga_step (
id BIGSERIAL PRIMARY KEY,
saga_id UUID NOT NULL REFERENCES saga_instance(id),
step_number INT NOT NULL,
step_name VARCHAR(128) NOT NULL,
status VARCHAR(32) NOT NULL DEFAULT 'PENDING',
forward_result JSONB,
compensation_status VARCHAR(32) NOT NULL DEFAULT 'NOT_NEEDED',
executed_at TIMESTAMPTZ,
CONSTRAINT uq_saga_step UNIQUE (saga_id, step_number)
);
CREATE TABLE saga_compensation (
id BIGSERIAL PRIMARY KEY,
saga_id UUID NOT NULL REFERENCES saga_instance(id),
step_number INT NOT NULL,
attempt INT NOT NULL DEFAULT 1,
status VARCHAR(32) NOT NULL DEFAULT 'PENDING',
error_message TEXT,
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ
);
CREATE INDEX idx_saga_comp_saga ON saga_compensation (saga_id, step_number);
Python: Orchestrator Implementation
import uuid
import json
import psycopg2
from datetime import datetime, timezone
from typing import Callable, Dict, List, Optional, Tuple
# Registry: saga_type -> list of (step_name, forward_fn, compensate_fn)
SAGA_REGISTRY: Dict[str, List[Tuple[str, Callable, Callable]]] = {}
def register_saga(saga_type: str, steps: List[Tuple[str, Callable, Callable]]):
SAGA_REGISTRY[saga_type] = steps
def execute_saga(saga_type: str, payload: dict) -> str:
"""Create a new saga instance and begin execution. Returns saga_id."""
saga_id = str(uuid.uuid4())
with psycopg2.connect(dsn="postgresql://app:pass@db/appdb") as conn:
with conn.cursor() as cur:
cur.execute(
"""
INSERT INTO saga_instance (id, saga_type, state, current_step, payload)
VALUES (%s, %s, 'PENDING', 0, %s)
""",
(saga_id, saga_type, json.dumps(payload)),
)
conn.commit()
advance_step(saga_id)
return saga_id
def advance_step(saga_id: str):
"""Execute the next pending step of the saga."""
with psycopg2.connect(dsn="postgresql://app:pass@db/appdb") as conn:
with conn.cursor() as cur:
cur.execute(
"SELECT saga_type, state, current_step, payload FROM saga_instance WHERE id = %s FOR UPDATE",
(saga_id,),
)
row = cur.fetchone()
if not row:
raise ValueError(f"Saga {saga_id} not found")
saga_type, state, current_step, payload = row
steps = SAGA_REGISTRY.get(saga_type, [])
if current_step >= len(steps):
# All steps completed
with conn.cursor() as cur:
cur.execute(
"UPDATE saga_instance SET state = 'COMPLETED', updated_at = NOW() WHERE id = %s",
(saga_id,),
)
conn.commit()
return
step_name, forward_fn, _ = steps[current_step]
with conn.cursor() as cur:
cur.execute(
"""
INSERT INTO saga_step (saga_id, step_number, step_name, status, executed_at)
VALUES (%s, %s, %s, 'STARTED', NOW())
ON CONFLICT (saga_id, step_number) DO UPDATE SET status = 'STARTED', executed_at = NOW()
""",
(saga_id, current_step, step_name),
)
cur.execute(
"UPDATE saga_instance SET state = %s, updated_at = NOW() WHERE id = %s",
(f"STEP_{current_step}_STARTED", saga_id),
)
conn.commit()
try:
result = forward_fn(payload)
with conn.cursor() as cur:
cur.execute(
"""
UPDATE saga_step SET status = 'COMPLETED', forward_result = %s
WHERE saga_id = %s AND step_number = %s
""",
(json.dumps(result), saga_id, current_step),
)
cur.execute(
"""
UPDATE saga_instance SET state = %s, current_step = %s, updated_at = NOW()
WHERE id = %s
""",
(f"STEP_{current_step}_COMPLETED", current_step + 1, saga_id),
)
conn.commit()
advance_step(saga_id)
except Exception as exc:
with conn.cursor() as cur:
cur.execute(
"UPDATE saga_step SET status = 'FAILED' WHERE saga_id = %s AND step_number = %s",
(saga_id, current_step),
)
cur.execute(
"UPDATE saga_instance SET state = 'COMPENSATING', updated_at = NOW() WHERE id = %s",
(saga_id,),
)
conn.commit()
compensate_saga(saga_id, from_step=current_step - 1)
def compensate_saga(saga_id: str, from_step: int):
"""Run compensating transactions in reverse order from from_step down to 0."""
with psycopg2.connect(dsn="postgresql://app:pass@db/appdb") as conn:
with conn.cursor() as cur:
cur.execute(
"SELECT saga_type, payload FROM saga_instance WHERE id = %s",
(saga_id,),
)
row = cur.fetchone()
if not row:
return
saga_type, payload = row
steps = SAGA_REGISTRY.get(saga_type, [])
for step_number in range(from_step, -1, -1):
step_name, _, compensate_fn = steps[step_number]
try:
compensate_fn(payload)
with psycopg2.connect(dsn="postgresql://app:pass@db/appdb") as conn:
with conn.cursor() as cur:
cur.execute(
"""
UPDATE saga_step SET compensation_status = 'COMPLETED'
WHERE saga_id = %s AND step_number = %s
""",
(saga_id, step_number),
)
cur.execute(
"""
INSERT INTO saga_compensation (saga_id, step_number, status, completed_at)
VALUES (%s, %s, 'COMPLETED', NOW())
""",
(saga_id, step_number),
)
conn.commit()
except Exception as exc:
with psycopg2.connect(dsn="postgresql://app:pass@db/appdb") as conn:
with conn.cursor() as cur:
cur.execute(
"""
UPDATE saga_instance SET state = 'COMPENSATION_FAILED', updated_at = NOW()
WHERE id = %s
""",
(saga_id,),
)
cur.execute(
"""
INSERT INTO saga_compensation (saga_id, step_number, status, error_message)
VALUES (%s, %s, 'FAILED', %s)
""",
(saga_id, step_number, str(exc)),
)
conn.commit()
return
with psycopg2.connect(dsn="postgresql://app:pass@db/appdb") as conn:
with conn.cursor() as cur:
cur.execute(
"UPDATE saga_instance SET state = 'COMPENSATED', updated_at = NOW() WHERE id = %s",
(saga_id,),
)
conn.commit()
Timeout Handling
A step that does not complete within its deadline must trigger compensation. A background job polls saga_instance for rows in a STEP_N_STARTED state with updated_at older than the step timeout. On detection, it transitions the saga to COMPENSATING and calls compensate_saga(saga_id, from_step=N-1). The timed-out step itself is not compensated (it never completed), but all prior completed steps are.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “When should I use orchestration vs choreography for sagas?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use orchestration when you need explicit rollback semantics, a clear audit trail, and centralized visibility into saga state. Orchestration makes it straightforward to answer ‘what step is this saga on and why did it fail.’ Use choreography when services are truly independent and you want to avoid introducing a central coordinator as a bottleneck or single point of failure. For most transactional workflows involving compensation, orchestration is easier to implement correctly.”
}
},
{
“@type”: “Question”,
“name”: “How do I make compensating transactions idempotent?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Design each compensating action so that applying it multiple times produces the same result as applying it once. For example, a payment refund compensator should check whether the refund already exists before issuing it, and return success if it does. Use the saga_id and step_number as idempotency keys when calling downstream services. Store the compensation status in saga_step so the orchestrator knows whether a compensation has already succeeded before retrying.”
}
},
{
“@type”: “Question”,
“name”: “How do I handle partial failures where only some steps completed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Compensation runs in strict reverse order from the last successfully completed step. If steps 1, 2, and 3 completed and step 4 failed, compensation runs: compensate step 3, compensate step 2, compensate step 1. Steps that never completed do not need compensation. The saga_step table records the status of each step, so the orchestrator always knows exactly which steps need to be compensated.”
}
},
{
“@type”: “Question”,
“name”: “What happens when a saga step times out?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A background monitor polls for saga instances stuck in a STEP_N_STARTED state beyond the configured timeout. It marks the step as timed out and triggers compensation for all previously completed steps (not including the timed-out step, which never committed its result). After compensation completes, the saga reaches COMPENSATED state. The timed-out downstream call may still complete eventually — the compensating transaction must handle that race condition idempotently.”
}
}
]
}
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does the saga orchestrator choose between orchestration and choreography?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Orchestration is preferred when the workflow is complex or requires central visibility; choreography suits loosely coupled services where each step reacts to events independently without a coordinator.”
}
},
{
“@type”: “Question”,
“name”: “How is compensation made idempotent?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each compensating action checks preconditions before executing (e.g., only refund if payment status is CAPTURED); repeated compensation calls are safe because the idempotency check short-circuits on already-compensated state.”
}
},
{
“@type”: “Question”,
“name”: “How are saga timeouts detected?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A scheduler queries SagaInstance WHERE state NOT IN (COMPLETED, COMPENSATED) AND updated_at < NOW() – timeout_interval; timed-out sagas are transitioned to COMPENSATING and their completed steps are reversed."
}
},
{
"@type": "Question",
"name": "How is the saga state machine persisted durably?",
"acceptedAnswer": {
"@type": "Answer",
"text": "SagaInstance and SagaStep are written in the same transaction as each forward or compensating action; the state machine can be fully reconstructed from the step log after a crash."
}
}
]
}
See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering