Outbox Pattern — Low-Level Design
The Transactional Outbox pattern ensures that database writes and message/event publishing happen atomically, without two-phase commit. It solves the dual-write problem: how do you write to both a database and a message broker (Kafka, SQS) without risking one succeeding and the other failing? This pattern is asked at Stripe, Uber, and any company doing event-driven microservices.
The Dual-Write Problem
-- Naive approach (broken):
def create_order(user_id, items):
order = db.insert(Order, {...}) # Step 1: write to DB
kafka.publish('order.created', order) # Step 2: publish event
return order
-- Failure modes:
-- Step 1 succeeds, step 2 fails → DB has order but no event published
-- → downstream services never process the order
-- Step 1 fails, step 2 succeeds → event published for non-existent order
-- → downstream services process a ghost order
-- Two-phase commit (2PC) is the "correct" solution but requires
-- distributed transaction support across DB + broker — complex, slow, rarely available.
Outbox Pattern: Write to Same DB Transaction
-- The outbox table lives in the same database as the business data.
-- Writing both in one transaction makes them atomic.
CREATE TABLE OutboxEvent (
id BIGSERIAL PRIMARY KEY,
aggregate_type TEXT NOT NULL, -- 'Order', 'Payment'
aggregate_id BIGINT NOT NULL,
event_type TEXT NOT NULL, -- 'order.created'
payload JSONB NOT NULL,
status TEXT DEFAULT 'pending', -- pending, published
created_at TIMESTAMPTZ DEFAULT NOW(),
published_at TIMESTAMPTZ
);
def create_order(user_id, items):
with db.transaction():
# Step 1: create the order
order = db.insert(Order, {'user_id': user_id, 'status': 'pending'})
# Step 2: write the event to the outbox — same transaction
db.insert(OutboxEvent, {
'aggregate_type': 'Order',
'aggregate_id': order.id,
'event_type': 'order.created',
'payload': {'order_id': order.id, 'user_id': user_id, 'items': items},
})
# If the transaction commits: both order AND outbox event exist
# If the transaction rolls back: neither exists
return order
Outbox Relay Worker
def relay_outbox_events():
"""Runs continuously; picks up pending outbox events and publishes them."""
while True:
# Claim a batch atomically
events = db.execute("""
UPDATE OutboxEvent
SET status = 'publishing'
WHERE id IN (
SELECT id FROM OutboxEvent
WHERE status = 'pending'
ORDER BY id ASC
LIMIT 100
FOR UPDATE SKIP LOCKED
)
RETURNING *
""")
for event in events:
try:
kafka.publish(
topic=event.event_type,
key=str(event.aggregate_id),
value=event.payload,
headers={'event_id': str(event.id)}, # for deduplication
)
db.execute("""
UPDATE OutboxEvent
SET status='published', published_at=NOW()
WHERE id=%(id)s
""", {'id': event.id})
except Exception:
# Reset to pending for retry
db.execute("UPDATE OutboxEvent SET status='pending' WHERE id=%(id)s",
{'id': event.id})
if not events:
time.sleep(0.1)
Change Data Capture Alternative
-- Instead of a relay worker polling the OutboxEvent table,
-- use PostgreSQL logical replication (Debezium) to stream changes.
-- Debezium reads PostgreSQL WAL (write-ahead log) and publishes
-- every INSERT into OutboxEvent directly to Kafka.
-- Advantages over relay worker:
-- Near real-time (milliseconds vs. polling interval)
-- No additional DB load from polling queries
-- Guaranteed ordering (WAL is ordered)
-- Disadvantages:
-- Requires Debezium/Kafka Connect setup
-- More infrastructure to manage
-- WAL retention must be configured (replication slot lag)
-- For most teams: start with a polling relay worker.
-- Migrate to CDC when sub-second latency is required.
Outbox Cleanup
-- Published events accumulate. Clean up after retention window.
-- Run nightly:
DELETE FROM OutboxEvent
WHERE status = 'published'
AND published_at < NOW() - INTERVAL '7 days';
-- Or use partitioning: partition by week, drop old partitions.
-- Stale 'publishing' events (relay worker crashed):
UPDATE OutboxEvent SET status='pending'
WHERE status='publishing'
AND created_at < NOW() - INTERVAL '5 minutes';
Key Interview Points
- Same-DB transaction is the key insight: The outbox works because both the business record and the event record live in the same database. One ACID transaction writes both atomically. The relay worker publishes to Kafka separately — if it fails, the event stays in the outbox and is retried.
- Events may be published more than once: The relay worker publishes at-least-once. Consumers must be idempotent (use the event.id as a deduplication key). The outbox pattern guarantees no events are lost; it does not guarantee exactly-once delivery.
- Order preservation: Processing outbox events in id (insertion) order preserves causality. A payment.captured event will always be published after the corresponding order.created event.
- FOR UPDATE SKIP LOCKED enables multiple relay workers: Run multiple relay workers in parallel; each claims a disjoint batch. Add workers to increase throughput.
Outbox pattern and payment event reliability design is discussed in Stripe system design interview questions.
Outbox pattern and reliable event delivery design is covered in Uber system design interview preparation.
Outbox pattern and order processing reliability is discussed in Shopify system design interview guide.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering