Saga Compensation Pattern Low-Level Design: Distributed Transaction Rollback, Compensating Actions, and State Machine

Saga Compensation Pattern Low-Level Design

Distributed transactions across microservices cannot use two-phase commit at scale — it introduces tight coupling and liveness risks. The Saga pattern breaks a distributed transaction into a sequence of local transactions, each with a corresponding compensating transaction that undoes its effect if a later step fails. This guide covers the full low-level design: orchestration-based state machine, compensating transaction registry, partial failure handling, idempotent rollback, and timeout-triggered compensation.

Orchestration vs Choreography

Orchestration places a central coordinator (the saga orchestrator) in charge of executing each step and triggering compensation. Choreography distributes the control via events — each service listens for events and emits its own. Orchestration is easier to reason about, easier to observe, and the right choice when you need explicit rollback semantics with a clear audit trail.

This design uses orchestration. The orchestrator is a persistent state machine stored in the database, driven by a polling or event-driven loop.

Saga State Machine

The saga instance transitions through a well-defined set of states:

PENDING
  → STEP_1_STARTED → STEP_1_COMPLETED
  → STEP_2_STARTED → STEP_2_COMPLETED
  → ...
  → COMPLETED

On failure at step N:
  STEP_N_STARTED → STEP_N_FAILED
  → COMPENSATING
  → COMP_N-1_STARTED → COMP_N-1_COMPLETED
  → COMP_N-2_STARTED → COMP_N-2_COMPLETED
  → ...
  → COMPENSATED | COMPENSATION_FAILED

The state is persisted to the database after every transition. If the orchestrator crashes, it resumes from the last durable state on restart.

Compensating Transaction Registry

Each saga type registers its steps and their compensations at startup. The registry maps (saga_type, step_number) to a pair of callables: forward_fn and compensate_fn. Both must be idempotent — they may be called more than once due to retries.

Example saga: Order placement across inventory, payment, and fulfillment services.

Step 1: Reserve inventory → Compensate: Release inventory reservation
Step 2: Charge payment → Compensate: Refund payment
Step 3: Create fulfillment order → Compensate: Cancel fulfillment order

If step 3 fails, compensation runs: cancel fulfillment (step 3 comp, if partially applied), refund payment (step 2 comp), release reservation (step 1 comp) — in reverse order.

SQL Schema

CREATE TABLE saga_instance (
    id              UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
    saga_type       VARCHAR(128) NOT NULL,
    state           VARCHAR(64)  NOT NULL DEFAULT 'PENDING',
    current_step    INT          NOT NULL DEFAULT 0,
    payload         JSONB        NOT NULL DEFAULT '{}',
    created_at      TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    updated_at      TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_saga_state ON saga_instance (state, updated_at);

CREATE TABLE saga_step (
    id                  BIGSERIAL   PRIMARY KEY,
    saga_id             UUID        NOT NULL REFERENCES saga_instance(id),
    step_number         INT         NOT NULL,
    step_name           VARCHAR(128) NOT NULL,
    status              VARCHAR(32)  NOT NULL DEFAULT 'PENDING',
    forward_result      JSONB,
    compensation_status VARCHAR(32)  NOT NULL DEFAULT 'NOT_NEEDED',
    executed_at         TIMESTAMPTZ,
    CONSTRAINT uq_saga_step UNIQUE (saga_id, step_number)
);

CREATE TABLE saga_compensation (
    id              BIGSERIAL   PRIMARY KEY,
    saga_id         UUID        NOT NULL REFERENCES saga_instance(id),
    step_number     INT         NOT NULL,
    attempt         INT         NOT NULL DEFAULT 1,
    status          VARCHAR(32)  NOT NULL DEFAULT 'PENDING',
    error_message   TEXT,
    started_at      TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    completed_at    TIMESTAMPTZ
);

CREATE INDEX idx_saga_comp_saga ON saga_compensation (saga_id, step_number);

Python: Orchestrator Implementation

import uuid
import json
import psycopg2
from datetime import datetime, timezone
from typing import Callable, Dict, List, Optional, Tuple

# Registry: saga_type -> list of (step_name, forward_fn, compensate_fn)
SAGA_REGISTRY: Dict[str, List[Tuple[str, Callable, Callable]]] = {}

def register_saga(saga_type: str, steps: List[Tuple[str, Callable, Callable]]):
    SAGA_REGISTRY[saga_type] = steps


def execute_saga(saga_type: str, payload: dict) -> str:
    """Create a new saga instance and begin execution. Returns saga_id."""
    saga_id = str(uuid.uuid4())
    with psycopg2.connect(dsn="postgresql://app:pass@db/appdb") as conn:
        with conn.cursor() as cur:
            cur.execute(
                """
                INSERT INTO saga_instance (id, saga_type, state, current_step, payload)
                VALUES (%s, %s, 'PENDING', 0, %s)
                """,
                (saga_id, saga_type, json.dumps(payload)),
            )
        conn.commit()
    advance_step(saga_id)
    return saga_id


def advance_step(saga_id: str):
    """Execute the next pending step of the saga."""
    with psycopg2.connect(dsn="postgresql://app:pass@db/appdb") as conn:
        with conn.cursor() as cur:
            cur.execute(
                "SELECT saga_type, state, current_step, payload FROM saga_instance WHERE id = %s FOR UPDATE",
                (saga_id,),
            )
            row = cur.fetchone()
            if not row:
                raise ValueError(f"Saga {saga_id} not found")
            saga_type, state, current_step, payload = row

        steps = SAGA_REGISTRY.get(saga_type, [])
        if current_step >= len(steps):
            # All steps completed
            with conn.cursor() as cur:
                cur.execute(
                    "UPDATE saga_instance SET state = 'COMPLETED', updated_at = NOW() WHERE id = %s",
                    (saga_id,),
                )
            conn.commit()
            return

        step_name, forward_fn, _ = steps[current_step]
        with conn.cursor() as cur:
            cur.execute(
                """
                INSERT INTO saga_step (saga_id, step_number, step_name, status, executed_at)
                VALUES (%s, %s, %s, 'STARTED', NOW())
                ON CONFLICT (saga_id, step_number) DO UPDATE SET status = 'STARTED', executed_at = NOW()
                """,
                (saga_id, current_step, step_name),
            )
            cur.execute(
                "UPDATE saga_instance SET state = %s, updated_at = NOW() WHERE id = %s",
                (f"STEP_{current_step}_STARTED", saga_id),
            )
        conn.commit()

        try:
            result = forward_fn(payload)
            with conn.cursor() as cur:
                cur.execute(
                    """
                    UPDATE saga_step SET status = 'COMPLETED', forward_result = %s
                    WHERE saga_id = %s AND step_number = %s
                    """,
                    (json.dumps(result), saga_id, current_step),
                )
                cur.execute(
                    """
                    UPDATE saga_instance SET state = %s, current_step = %s, updated_at = NOW()
                    WHERE id = %s
                    """,
                    (f"STEP_{current_step}_COMPLETED", current_step + 1, saga_id),
                )
            conn.commit()
            advance_step(saga_id)
        except Exception as exc:
            with conn.cursor() as cur:
                cur.execute(
                    "UPDATE saga_step SET status = 'FAILED' WHERE saga_id = %s AND step_number = %s",
                    (saga_id, current_step),
                )
                cur.execute(
                    "UPDATE saga_instance SET state = 'COMPENSATING', updated_at = NOW() WHERE id = %s",
                    (saga_id,),
                )
            conn.commit()
            compensate_saga(saga_id, from_step=current_step - 1)


def compensate_saga(saga_id: str, from_step: int):
    """Run compensating transactions in reverse order from from_step down to 0."""
    with psycopg2.connect(dsn="postgresql://app:pass@db/appdb") as conn:
        with conn.cursor() as cur:
            cur.execute(
                "SELECT saga_type, payload FROM saga_instance WHERE id = %s",
                (saga_id,),
            )
            row = cur.fetchone()
            if not row:
                return
            saga_type, payload = row

    steps = SAGA_REGISTRY.get(saga_type, [])
    for step_number in range(from_step, -1, -1):
        step_name, _, compensate_fn = steps[step_number]
        try:
            compensate_fn(payload)
            with psycopg2.connect(dsn="postgresql://app:pass@db/appdb") as conn:
                with conn.cursor() as cur:
                    cur.execute(
                        """
                        UPDATE saga_step SET compensation_status = 'COMPLETED'
                        WHERE saga_id = %s AND step_number = %s
                        """,
                        (saga_id, step_number),
                    )
                    cur.execute(
                        """
                        INSERT INTO saga_compensation (saga_id, step_number, status, completed_at)
                        VALUES (%s, %s, 'COMPLETED', NOW())
                        """,
                        (saga_id, step_number),
                    )
                conn.commit()
        except Exception as exc:
            with psycopg2.connect(dsn="postgresql://app:pass@db/appdb") as conn:
                with conn.cursor() as cur:
                    cur.execute(
                        """
                        UPDATE saga_instance SET state = 'COMPENSATION_FAILED', updated_at = NOW()
                        WHERE id = %s
                        """,
                        (saga_id,),
                    )
                    cur.execute(
                        """
                        INSERT INTO saga_compensation (saga_id, step_number, status, error_message)
                        VALUES (%s, %s, 'FAILED', %s)
                        """,
                        (saga_id, step_number, str(exc)),
                    )
                conn.commit()
            return

    with psycopg2.connect(dsn="postgresql://app:pass@db/appdb") as conn:
        with conn.cursor() as cur:
            cur.execute(
                "UPDATE saga_instance SET state = 'COMPENSATED', updated_at = NOW() WHERE id = %s",
                (saga_id,),
            )
        conn.commit()

Timeout Handling

A step that does not complete within its deadline must trigger compensation. A background job polls saga_instance for rows in a STEP_N_STARTED state with updated_at older than the step timeout. On detection, it transitions the saga to COMPENSATING and calls compensate_saga(saga_id, from_step=N-1). The timed-out step itself is not compensated (it never completed), but all prior completed steps are.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “When should I use orchestration vs choreography for sagas?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use orchestration when you need explicit rollback semantics, a clear audit trail, and centralized visibility into saga state. Orchestration makes it straightforward to answer ‘what step is this saga on and why did it fail.’ Use choreography when services are truly independent and you want to avoid introducing a central coordinator as a bottleneck or single point of failure. For most transactional workflows involving compensation, orchestration is easier to implement correctly.”
}
},
{
“@type”: “Question”,
“name”: “How do I make compensating transactions idempotent?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Design each compensating action so that applying it multiple times produces the same result as applying it once. For example, a payment refund compensator should check whether the refund already exists before issuing it, and return success if it does. Use the saga_id and step_number as idempotency keys when calling downstream services. Store the compensation status in saga_step so the orchestrator knows whether a compensation has already succeeded before retrying.”
}
},
{
“@type”: “Question”,
“name”: “How do I handle partial failures where only some steps completed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Compensation runs in strict reverse order from the last successfully completed step. If steps 1, 2, and 3 completed and step 4 failed, compensation runs: compensate step 3, compensate step 2, compensate step 1. Steps that never completed do not need compensation. The saga_step table records the status of each step, so the orchestrator always knows exactly which steps need to be compensated.”
}
},
{
“@type”: “Question”,
“name”: “What happens when a saga step times out?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A background monitor polls for saga instances stuck in a STEP_N_STARTED state beyond the configured timeout. It marks the step as timed out and triggers compensation for all previously completed steps (not including the timed-out step, which never committed its result). After compensation completes, the saga reaches COMPENSATED state. The timed-out downstream call may still complete eventually — the compensating transaction must handle that race condition idempotently.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does the saga orchestrator choose between orchestration and choreography?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Orchestration is preferred when the workflow is complex or requires central visibility; choreography suits loosely coupled services where each step reacts to events independently without a coordinator.”
}
},
{
“@type”: “Question”,
“name”: “How is compensation made idempotent?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each compensating action checks preconditions before executing (e.g., only refund if payment status is CAPTURED); repeated compensation calls are safe because the idempotency check short-circuits on already-compensated state.”
}
},
{
“@type”: “Question”,
“name”: “How are saga timeouts detected?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A scheduler queries SagaInstance WHERE state NOT IN (COMPLETED, COMPENSATED) AND updated_at < NOW() – timeout_interval; timed-out sagas are transitioned to COMPENSATING and their completed steps are reversed."
}
},
{
"@type": "Question",
"name": "How is the saga state machine persisted durably?",
"acceptedAnswer": {
"@type": "Answer",
"text": "SagaInstance and SagaStep are written in the same transaction as each forward or compensating action; the state machine can be fully reconstructed from the step log after a crash."
}
}
]
}