An event bus is a pub/sub messaging system designed for event-driven architectures: services publish domain events, and multiple subscribers react independently. Unlike a message broker (which typically targets one consumer group per topic), an event bus is built for fan-out — one published event reaches many subscribers simultaneously via routing rules.
Pub/Sub Model
Publisher: a service that emits events when something happens (e.g., OrderPlaced, PaymentProcessed, UserRegistered). The publisher sends the event to the bus without knowing which services will consume it.
Event bus: receives events, validates schemas, routes to matching subscribers, delivers with retry semantics, and persists events to a durable log.
Subscriber: a service that declares interest in specific event types (and optionally, filter conditions on the payload). The bus delivers matching events to the subscriber's endpoint (HTTP webhook, queue, or function).
The key architectural property: publishers and subscribers are fully decoupled. Adding a new subscriber does not require any change to the publisher.
Topic Routing Strategies
Topic-based routing: subscriber declares “deliver all events of type OrderPlaced”. The bus routes by event_type exact match. Simple and efficient — routing decision is a hash map lookup on event_type.
Pattern-based routing: subscriber uses wildcards: Order.* matches OrderPlaced, OrderCancelled, OrderShipped. Useful for consuming all events from a domain without listing each event type explicitly. Implemented via trie or glob matching on the event_type string.
Content-based routing: subscriber provides a filter expression evaluated against the event payload: {"payload.order_value": {">=": 1000}}. More expressive but more CPU-intensive to evaluate. The bus evaluates filter expressions for each subscriber on each matching event type.
Schema Validation and Registry
Each event type has a registered schema (Avro or JSON Schema) stored in the EventType table. On publish:
- Validate payload against the schema for the declared event_type.
- If validation fails, reject the publish call (publisher receives a 400 error).
- If valid, assign event_id (UUID), record schema_version, and proceed to routing and storage.
Schema evolution rules:
- Backward compatible: add optional fields with defaults — existing subscribers continue working.
- Breaking change: remove required fields or change field types — requires a new schema_version, dual-version period, and subscriber migration.
The schema registry prevents schema drift: publishers cannot accidentally emit events that break downstream subscribers.
Fan-Out Delivery
On a validated event arriving at the bus:
- Persist event to EventLog (durable, append-only).
- Look up all subscriptions for the event_type.
- For each subscription with a filter_expr, evaluate the expression against the event envelope.
- Deliver to all matching subscribers in parallel (async HTTP POST to webhook, or enqueue to subscriber's dedicated queue).
- Track delivery status per (event_log_id, subscription_id).
Fan-out is parallel — a slow subscriber does not block delivery to other subscribers. Each subscriber has an independent delivery queue and retry state.
Delivery Semantics and Idempotency
The bus delivers at-least-once: on network failure or subscriber timeout, the bus retries the delivery. Subscribers must handle duplicate delivery using the event_id as an idempotency key:
- On receipt, check if event_id has been processed (e.g., via a processed_events table).
- If already processed, return 2xx to acknowledge without re-processing.
- If not processed, process and record event_id atomically (in the same database transaction as the business operation).
Dead Letter Queue
Each subscription can have DLQ enabled. Delivery failure handling:
- First failure: retry immediately.
- Subsequent failures: exponential backoff (1s, 2s, 4s, 8s, …).
- After max_attempts (configurable, e.g., 5): move event to DeadLetterEvent, stop retrying.
- DLQ processor (manual or automated): inspect failure_reason, fix issue, trigger replay or manual reprocessing.
Durable Log and Event Replay
All events are stored in EventLog before delivery. The log is append-only and retained for a configurable period (e.g., 30 days). Replay API: replay_from(event_type, from_offset) re-delivers events starting at a given position to a specified subscriber.
The event log serves as the system of record — it enables new subscriber bootstrapping, disaster recovery, and bug-fix reprocessing without requiring publishers to re-emit events.
SQL DDL
CREATE TABLE EventType (
id BIGSERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL UNIQUE,
schema_json JSONB NOT NULL,
schema_version INTEGER NOT NULL DEFAULT 1,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE Subscription (
id BIGSERIAL PRIMARY KEY,
subscriber_service VARCHAR(255) NOT NULL,
event_type VARCHAR(255) NOT NULL REFERENCES EventType(name),
filter_expr JSONB,
endpoint_url TEXT NOT NULL,
dlq_enabled BOOLEAN NOT NULL DEFAULT TRUE,
max_attempts INTEGER NOT NULL DEFAULT 5
);
CREATE INDEX idx_sub_event_type ON Subscription (event_type);
-- Durable event log
CREATE TABLE EventLog (
id BIGSERIAL PRIMARY KEY,
event_type_id BIGINT NOT NULL REFERENCES EventType(id),
event_id UUID NOT NULL UNIQUE DEFAULT gen_random_uuid(),
source_service VARCHAR(255) NOT NULL,
schema_version INTEGER NOT NULL,
payload JSONB NOT NULL,
published_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_eventlog_type_published ON EventLog (event_type_id, published_at);
CREATE TABLE DeadLetterEvent (
id BIGSERIAL PRIMARY KEY,
event_log_id BIGINT NOT NULL REFERENCES EventLog(id),
subscription_id BIGINT NOT NULL REFERENCES Subscription(id),
failure_reason TEXT NOT NULL,
attempts INTEGER NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (event_log_id, subscription_id)
);
Python: Core Operations
import uuid
import json
import time
import requests
from typing import Any, Callable, Optional
# In-memory state for illustration
_event_types: dict[str, dict] = {}
_subscriptions: list[dict] = []
_event_log: list[dict] = []
_dlq: list[dict] = []
def publish(event_type: str, payload: dict, source: str) -> str:
"""Validate, persist, and route an event. Returns event_id."""
et = _event_types.get(event_type)
if not et:
raise ValueError(f"Unknown event type: {event_type}")
# Schema validation (simplified: just check required fields exist)
for field in et.get('required_fields', []):
if field not in payload:
raise ValueError(f"Missing required field: {field}")
event = {
'event_id': str(uuid.uuid4()),
'event_type': event_type,
'source_service': source,
'schema_version': et['schema_version'],
'payload': payload,
'published_at': time.time(),
}
_event_log.append(event)
route_event(event)
return event['event_id']
def route_event(event: dict) -> None:
"""Fan-out event to all matching subscribers."""
for sub in _subscriptions:
if sub['event_type'] != event['event_type']:
continue
if sub.get('filter_expr') and not _evaluate_filter(sub['filter_expr'], event):
continue
deliver_to_subscriber(sub, event)
def deliver_to_subscriber(subscription: dict, event: dict,
attempt: int = 1) -> bool:
"""Deliver event to subscriber endpoint with retry and DLQ fallback."""
max_attempts = subscription.get('max_attempts', 5)
try:
resp = requests.post(
subscription['endpoint_url'],
json=event,
headers={'X-Event-Id': event['event_id'],
'X-Event-Type': event['event_type']},
timeout=5
)
if resp.status_code < 300:
return True
raise Exception(f"HTTP {resp.status_code}: {resp.text[:200]}")
except Exception as e:
if attempt int:
"""Re-deliver events of a given type starting from a log offset."""
sub = next((s for s in _subscriptions if s.get('id') == subscriber_id), None)
replayed = 0
for i, event in enumerate(_event_log):
if i bool:
"""Simplified content-based filter evaluation."""
for field_path, condition in filter_expr.items():
parts = field_path.split('.')
value = event
for part in parts:
value = value.get(part) if isinstance(value, dict) else None
if isinstance(condition, dict):
for op, threshold in condition.items():
if op == '>=' and not (value is not None and value >= threshold):
return False
elif op == '<=' and not (value is not None and value <= threshold):
return False
elif value != condition:
return False
return True
Design Considerations Summary
- Schema validation: enforce at publish time via schema registry; prevents schema drift from breaking subscribers silently.
- Content-based routing: powerful but CPU-intensive; compile filter expressions and share evaluation across subscribers with overlapping conditions.
- Dead letter queue: essential for operational visibility; alert on DLQ growth; provide replay tooling for DLQ reprocessing after fixes.
- Event replay: requires a durable append-only log with sufficient retention; design for it from the start as it is expensive to add retroactively.
- Fan-out scale: for very high fan-out (hundreds of subscribers per event type), deliver to per-subscriber queues asynchronously rather than making synchronous HTTP calls in the delivery path.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety