Event Bus Low-Level Design: Pub/Sub Routing, Schema Registry, Dead Letters, and Event Replay

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does schema validation enforcement work in an event bus?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Schema validation is enforced at publish time. Each event type has a registered schema (Avro or JSON Schema) in the schema registry with a schema_version. When a publisher calls publish(event_type, payload), the bus looks up the current schema for that event type and validates the payload against it. If validation fails, the publish call is rejected with a detailed error — the event never enters the log. Schema evolution is handled via compatibility rules: backward-compatible changes (adding optional fields) are allowed without a version bump; breaking changes (removing fields, changing types) require a new schema_version and a migration period during which both versions are accepted. Subscribers declare which schema_version they support; the bus can transform events between compatible versions.”
}
},
{
“@type”: “Question”,
“name”: “How does content-based routing work in an event bus?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Content-based routing evaluates filter expressions against the event envelope (metadata fields + payload) at delivery time. A subscription includes a filter_expr such as: {“payload.amount”: {“>=”: 1000}, “payload.currency”: “USD”}. When an event is published, the bus evaluates the filter expression for each subscriber. Only matching subscribers receive the event. Filter evaluation must be fast to avoid being the delivery bottleneck — common implementations use a compiled expression tree (not JSON parsing on every event) or a Rete-like algorithm that shares evaluation across subscribers with overlapping conditions. Content-based routing enables fine-grained fan-out without requiring publishers to know subscriber topology.”
}
},
{
“@type”: “Question”,
“name”: “How does the dead letter queue mechanism work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When the bus attempts to deliver an event to a subscriber and the delivery fails (HTTP timeout, non-2xx response, or subscriber service down), the bus retries with exponential backoff. After a configured maximum number of attempts (e.g., 5), the event is moved to the Dead Letter Queue (DLQ) for that subscription. The DLQ record captures the original event, the subscription ID, the failure reason, and the number of attempts. DLQ events are not retried automatically — they require manual intervention: a human or an automated DLQ processor inspects the failure reason, fixes the underlying issue (e.g., deploys a bug fix to the subscriber), and replays the DLQ events. DLQ monitoring and alerting are critical to avoid silently dropped events.”
}
},
{
“@type”: “Question”,
“name”: “What are the use cases for event replay from the durable log?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Event replay allows a subscriber to re-consume events from a specific point in the event log. Common use cases: (1) New subscriber bootstrap — a new service subscribes to an event type and needs to process all historical events to build its initial state; (2) Bug fix replay — a subscriber had a bug that caused incorrect processing; after the fix is deployed, it replays the affected time window to correct its state; (3) Disaster recovery — a subscriber's database was corrupted; replay rebuilds the state from the authoritative event log; (4) Analytics backfill — a new analytics query needs historical data that was not previously materialized. Replay is only possible if the event bus maintains a durable append-only log with configurable retention — unlike traditional fire-and-forget pub/sub systems that do not persist events after delivery.”
}
}
]
}

An event bus is a pub/sub messaging system designed for event-driven architectures: services publish domain events, and multiple subscribers react independently. Unlike a message broker (which typically targets one consumer group per topic), an event bus is built for fan-out — one published event reaches many subscribers simultaneously via routing rules.

Pub/Sub Model

Publisher: a service that emits events when something happens (e.g., OrderPlaced, PaymentProcessed, UserRegistered). The publisher sends the event to the bus without knowing which services will consume it.

Event bus: receives events, validates schemas, routes to matching subscribers, delivers with retry semantics, and persists events to a durable log.

Subscriber: a service that declares interest in specific event types (and optionally, filter conditions on the payload). The bus delivers matching events to the subscriber's endpoint (HTTP webhook, queue, or function).

The key architectural property: publishers and subscribers are fully decoupled. Adding a new subscriber does not require any change to the publisher.

Topic Routing Strategies

Topic-based routing: subscriber declares “deliver all events of type OrderPlaced”. The bus routes by event_type exact match. Simple and efficient — routing decision is a hash map lookup on event_type.

Pattern-based routing: subscriber uses wildcards: Order.* matches OrderPlaced, OrderCancelled, OrderShipped. Useful for consuming all events from a domain without listing each event type explicitly. Implemented via trie or glob matching on the event_type string.

Content-based routing: subscriber provides a filter expression evaluated against the event payload: {"payload.order_value": {">=": 1000}}. More expressive but more CPU-intensive to evaluate. The bus evaluates filter expressions for each subscriber on each matching event type.

Schema Validation and Registry

Each event type has a registered schema (Avro or JSON Schema) stored in the EventType table. On publish:

  1. Validate payload against the schema for the declared event_type.
  2. If validation fails, reject the publish call (publisher receives a 400 error).
  3. If valid, assign event_id (UUID), record schema_version, and proceed to routing and storage.

Schema evolution rules:

  • Backward compatible: add optional fields with defaults — existing subscribers continue working.
  • Breaking change: remove required fields or change field types — requires a new schema_version, dual-version period, and subscriber migration.

The schema registry prevents schema drift: publishers cannot accidentally emit events that break downstream subscribers.

Fan-Out Delivery

On a validated event arriving at the bus:

  1. Persist event to EventLog (durable, append-only).
  2. Look up all subscriptions for the event_type.
  3. For each subscription with a filter_expr, evaluate the expression against the event envelope.
  4. Deliver to all matching subscribers in parallel (async HTTP POST to webhook, or enqueue to subscriber's dedicated queue).
  5. Track delivery status per (event_log_id, subscription_id).

Fan-out is parallel — a slow subscriber does not block delivery to other subscribers. Each subscriber has an independent delivery queue and retry state.

Delivery Semantics and Idempotency

The bus delivers at-least-once: on network failure or subscriber timeout, the bus retries the delivery. Subscribers must handle duplicate delivery using the event_id as an idempotency key:

  • On receipt, check if event_id has been processed (e.g., via a processed_events table).
  • If already processed, return 2xx to acknowledge without re-processing.
  • If not processed, process and record event_id atomically (in the same database transaction as the business operation).

Dead Letter Queue

Each subscription can have DLQ enabled. Delivery failure handling:

  1. First failure: retry immediately.
  2. Subsequent failures: exponential backoff (1s, 2s, 4s, 8s, …).
  3. After max_attempts (configurable, e.g., 5): move event to DeadLetterEvent, stop retrying.
  4. DLQ processor (manual or automated): inspect failure_reason, fix issue, trigger replay or manual reprocessing.

Durable Log and Event Replay

All events are stored in EventLog before delivery. The log is append-only and retained for a configurable period (e.g., 30 days). Replay API: replay_from(event_type, from_offset) re-delivers events starting at a given position to a specified subscriber.

The event log serves as the system of record — it enables new subscriber bootstrapping, disaster recovery, and bug-fix reprocessing without requiring publishers to re-emit events.

SQL DDL

CREATE TABLE EventType (
    id              BIGSERIAL PRIMARY KEY,
    name            VARCHAR(255)  NOT NULL UNIQUE,
    schema_json     JSONB         NOT NULL,
    schema_version  INTEGER       NOT NULL DEFAULT 1,
    created_at      TIMESTAMPTZ   NOT NULL DEFAULT now()
);

CREATE TABLE Subscription (
    id                 BIGSERIAL PRIMARY KEY,
    subscriber_service VARCHAR(255)  NOT NULL,
    event_type         VARCHAR(255)  NOT NULL REFERENCES EventType(name),
    filter_expr        JSONB,
    endpoint_url       TEXT          NOT NULL,
    dlq_enabled        BOOLEAN       NOT NULL DEFAULT TRUE,
    max_attempts       INTEGER       NOT NULL DEFAULT 5
);

CREATE INDEX idx_sub_event_type ON Subscription (event_type);

-- Durable event log
CREATE TABLE EventLog (
    id              BIGSERIAL PRIMARY KEY,
    event_type_id   BIGINT        NOT NULL REFERENCES EventType(id),
    event_id        UUID          NOT NULL UNIQUE DEFAULT gen_random_uuid(),
    source_service  VARCHAR(255)  NOT NULL,
    schema_version  INTEGER       NOT NULL,
    payload         JSONB         NOT NULL,
    published_at    TIMESTAMPTZ   NOT NULL DEFAULT now()
);

CREATE INDEX idx_eventlog_type_published ON EventLog (event_type_id, published_at);

CREATE TABLE DeadLetterEvent (
    id              BIGSERIAL PRIMARY KEY,
    event_log_id    BIGINT        NOT NULL REFERENCES EventLog(id),
    subscription_id BIGINT        NOT NULL REFERENCES Subscription(id),
    failure_reason  TEXT          NOT NULL,
    attempts        INTEGER       NOT NULL,
    created_at      TIMESTAMPTZ   NOT NULL DEFAULT now(),
    UNIQUE (event_log_id, subscription_id)
);

Python: Core Operations

import uuid
import json
import time
import requests
from typing import Any, Callable, Optional

# In-memory state for illustration
_event_types: dict[str, dict] = {}
_subscriptions: list[dict] = []
_event_log: list[dict] = []
_dlq: list[dict] = []

def publish(event_type: str, payload: dict, source: str) -> str:
    """Validate, persist, and route an event. Returns event_id."""
    et = _event_types.get(event_type)
    if not et:
        raise ValueError(f"Unknown event type: {event_type}")
    # Schema validation (simplified: just check required fields exist)
    for field in et.get('required_fields', []):
        if field not in payload:
            raise ValueError(f"Missing required field: {field}")
    event = {
        'event_id': str(uuid.uuid4()),
        'event_type': event_type,
        'source_service': source,
        'schema_version': et['schema_version'],
        'payload': payload,
        'published_at': time.time(),
    }
    _event_log.append(event)
    route_event(event)
    return event['event_id']

def route_event(event: dict) -> None:
    """Fan-out event to all matching subscribers."""
    for sub in _subscriptions:
        if sub['event_type'] != event['event_type']:
            continue
        if sub.get('filter_expr') and not _evaluate_filter(sub['filter_expr'], event):
            continue
        deliver_to_subscriber(sub, event)

def deliver_to_subscriber(subscription: dict, event: dict,
                          attempt: int = 1) -> bool:
    """Deliver event to subscriber endpoint with retry and DLQ fallback."""
    max_attempts = subscription.get('max_attempts', 5)
    try:
        resp = requests.post(
            subscription['endpoint_url'],
            json=event,
            headers={'X-Event-Id': event['event_id'],
                     'X-Event-Type': event['event_type']},
            timeout=5
        )
        if resp.status_code < 300:
            return True
        raise Exception(f"HTTP {resp.status_code}: {resp.text[:200]}")
    except Exception as e:
        if attempt  int:
    """Re-deliver events of a given type starting from a log offset."""
    sub = next((s for s in _subscriptions if s.get('id') == subscriber_id), None)
    replayed = 0
    for i, event in enumerate(_event_log):
        if i  bool:
    """Simplified content-based filter evaluation."""
    for field_path, condition in filter_expr.items():
        parts = field_path.split('.')
        value = event
        for part in parts:
            value = value.get(part) if isinstance(value, dict) else None
        if isinstance(condition, dict):
            for op, threshold in condition.items():
                if op == '>=' and not (value is not None and value >= threshold):
                    return False
                elif op == '<=' and not (value is not None and value <= threshold):
                    return False
        elif value != condition:
            return False
    return True

Design Considerations Summary

  • Schema validation: enforce at publish time via schema registry; prevents schema drift from breaking subscribers silently.
  • Content-based routing: powerful but CPU-intensive; compile filter expressions and share evaluation across subscribers with overlapping conditions.
  • Dead letter queue: essential for operational visibility; alert on DLQ growth; provide replay tooling for DLQ reprocessing after fixes.
  • Event replay: requires a durable append-only log with sufficient retention; design for it from the start as it is expensive to add retroactively.
  • Fan-out scale: for very high fan-out (hundreds of subscribers per event type), deliver to per-subscriber queues asynchronously rather than making synchronous HTTP calls in the delivery path.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does schema validation prevent bad events from propagating?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “On publish, the event bus validates the payload against the registered JSON or Avro schema for that event_type; invalid events are rejected immediately with a validation error, preventing downstream consumers from receiving malformed data.”
}
},
{
“@type”: “Question”,
“name”: “How does content-based routing differ from topic-based routing?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Topic-based routing delivers all events of a given type to subscribers of that topic; content-based routing evaluates a filter expression against each event's payload fields, allowing fine-grained subscription (e.g., only orders with amount > 1000).”
}
},
{
“@type”: “Question”,
“name”: “How does the event bus guarantee at-least-once delivery?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The bus retries delivery until the subscriber returns a 2xx acknowledgment; a delivery attempt counter prevents infinite retries; after max_attempts, the event is moved to the DLQ.”
}
},
{
“@type”: “Question”,
“name”: “How is event replay used for disaster recovery?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “If a subscriber's data store is corrupted, it can replay all events of a given type from a specific offset in the durable EventLog; the subscriber re-processes them to reconstruct its state.”
}
}
]
}

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

Scroll to Top