Outbox Pattern Low-Level Design

Outbox Pattern — Low-Level Design

The Transactional Outbox pattern ensures that database writes and message/event publishing happen atomically, without two-phase commit. It solves the dual-write problem: how do you write to both a database and a message broker (Kafka, SQS) without risking one succeeding and the other failing? This pattern is asked at Stripe, Uber, and any company doing event-driven microservices.

The Dual-Write Problem

-- Naive approach (broken):
def create_order(user_id, items):
    order = db.insert(Order, {...})       # Step 1: write to DB
    kafka.publish('order.created', order) # Step 2: publish event
    return order

-- Failure modes:
-- Step 1 succeeds, step 2 fails → DB has order but no event published
--   → downstream services never process the order
-- Step 1 fails, step 2 succeeds → event published for non-existent order
--   → downstream services process a ghost order

-- Two-phase commit (2PC) is the "correct" solution but requires
-- distributed transaction support across DB + broker — complex, slow, rarely available.

Outbox Pattern: Write to Same DB Transaction

-- The outbox table lives in the same database as the business data.
-- Writing both in one transaction makes them atomic.

CREATE TABLE OutboxEvent (
    id              BIGSERIAL PRIMARY KEY,
    aggregate_type  TEXT NOT NULL,       -- 'Order', 'Payment'
    aggregate_id    BIGINT NOT NULL,
    event_type      TEXT NOT NULL,       -- 'order.created'
    payload         JSONB NOT NULL,
    status          TEXT DEFAULT 'pending', -- pending, published
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    published_at    TIMESTAMPTZ
);

def create_order(user_id, items):
    with db.transaction():
        # Step 1: create the order
        order = db.insert(Order, {'user_id': user_id, 'status': 'pending'})

        # Step 2: write the event to the outbox — same transaction
        db.insert(OutboxEvent, {
            'aggregate_type': 'Order',
            'aggregate_id': order.id,
            'event_type': 'order.created',
            'payload': {'order_id': order.id, 'user_id': user_id, 'items': items},
        })
    # If the transaction commits: both order AND outbox event exist
    # If the transaction rolls back: neither exists
    return order

Outbox Relay Worker

def relay_outbox_events():
    """Runs continuously; picks up pending outbox events and publishes them."""
    while True:
        # Claim a batch atomically
        events = db.execute("""
            UPDATE OutboxEvent
            SET status = 'publishing'
            WHERE id IN (
                SELECT id FROM OutboxEvent
                WHERE status = 'pending'
                ORDER BY id ASC
                LIMIT 100
                FOR UPDATE SKIP LOCKED
            )
            RETURNING *
        """)

        for event in events:
            try:
                kafka.publish(
                    topic=event.event_type,
                    key=str(event.aggregate_id),
                    value=event.payload,
                    headers={'event_id': str(event.id)},  # for deduplication
                )
                db.execute("""
                    UPDATE OutboxEvent
                    SET status='published', published_at=NOW()
                    WHERE id=%(id)s
                """, {'id': event.id})
            except Exception:
                # Reset to pending for retry
                db.execute("UPDATE OutboxEvent SET status='pending' WHERE id=%(id)s",
                           {'id': event.id})

        if not events:
            time.sleep(0.1)

Change Data Capture Alternative

-- Instead of a relay worker polling the OutboxEvent table,
-- use PostgreSQL logical replication (Debezium) to stream changes.

-- Debezium reads PostgreSQL WAL (write-ahead log) and publishes
-- every INSERT into OutboxEvent directly to Kafka.

-- Advantages over relay worker:
--   Near real-time (milliseconds vs. polling interval)
--   No additional DB load from polling queries
--   Guaranteed ordering (WAL is ordered)

-- Disadvantages:
--   Requires Debezium/Kafka Connect setup
--   More infrastructure to manage
--   WAL retention must be configured (replication slot lag)

-- For most teams: start with a polling relay worker.
-- Migrate to CDC when sub-second latency is required.

Outbox Cleanup

-- Published events accumulate. Clean up after retention window.
-- Run nightly:
DELETE FROM OutboxEvent
WHERE status = 'published'
  AND published_at < NOW() - INTERVAL '7 days';

-- Or use partitioning: partition by week, drop old partitions.
-- Stale 'publishing' events (relay worker crashed):
UPDATE OutboxEvent SET status='pending'
WHERE status='publishing'
  AND created_at < NOW() - INTERVAL '5 minutes';

Key Interview Points

  • Same-DB transaction is the key insight: The outbox works because both the business record and the event record live in the same database. One ACID transaction writes both atomically. The relay worker publishes to Kafka separately — if it fails, the event stays in the outbox and is retried.
  • Events may be published more than once: The relay worker publishes at-least-once. Consumers must be idempotent (use the event.id as a deduplication key). The outbox pattern guarantees no events are lost; it does not guarantee exactly-once delivery.
  • Order preservation: Processing outbox events in id (insertion) order preserves causality. A payment.captured event will always be published after the corresponding order.created event.
  • FOR UPDATE SKIP LOCKED enables multiple relay workers: Run multiple relay workers in parallel; each claims a disjoint batch. Add workers to increase throughput.

Outbox pattern and payment event reliability design is discussed in Stripe system design interview questions.

Outbox pattern and reliable event delivery design is covered in Uber system design interview preparation.

Outbox pattern and order processing reliability is discussed in Shopify system design interview guide.

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

Scroll to Top