What Is a Pub/Sub Messaging System?
A publish/subscribe (pub/sub) messaging system decouples producers of events from consumers. Publishers emit messages to named topics without knowing who will receive them. Subscribers register interest in one or more topics and receive messages asynchronously. This pattern underpins event-driven architectures at companies like Google (Pub/Sub), Amazon (SNS/SQS), and Apache (Kafka).
Data Model
TABLE topics (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(255) UNIQUE NOT NULL,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
retention_ms BIGINT NOT NULL DEFAULT 604800000 -- 7 days
);
TABLE subscriptions (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
topic_id BIGINT NOT NULL REFERENCES topics(id),
subscriber VARCHAR(255) NOT NULL, -- endpoint URL or queue name
filter_expr TEXT, -- optional attribute filter
ack_deadline_s INT NOT NULL DEFAULT 30,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
UNIQUE (topic_id, subscriber)
);
TABLE messages (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
topic_id BIGINT NOT NULL REFERENCES topics(id),
payload BLOB NOT NULL,
attributes JSON,
published_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
expires_at DATETIME NOT NULL
);
TABLE ack_log (
subscription_id BIGINT NOT NULL REFERENCES subscriptions(id),
message_id BIGINT NOT NULL REFERENCES messages(id),
status ENUM('pending','acked','nacked') NOT NULL DEFAULT 'pending',
deliver_count INT NOT NULL DEFAULT 0,
last_attempt DATETIME,
PRIMARY KEY (subscription_id, message_id)
);
Core Workflow
- Publish. Producer calls
publish(topic, payload, attributes). The broker assigns a monotonically increasing message ID, persists the message, and fans out anack_logrow for every active subscription. - Deliver. Each subscription has a dedicated dispatcher thread (or actor). The dispatcher queries
ack_logfor pending rows, fetches the message payload, and pushes it to the subscriber endpoint (HTTP POST, gRPC, or queue write). - Acknowledge. The subscriber confirms receipt within
ack_deadline_sseconds. The dispatcher updatesack_log.status = 'acked'. - Filter. If a subscription carries a
filter_expr, the dispatcher evaluates it againstmessages.attributesbefore delivery. Non-matching messages are auto-acked.
Failure Handling and Recovery
At-least-once delivery is the default guarantee. If a subscriber does not ack within the deadline, the dispatcher increments deliver_count and retries with exponential backoff (1 s, 2 s, 4 s, up to a configurable cap). After a configured maximum attempt count, the message is routed to a Dead Letter Topic (see the DLQ post).
Broker crash recovery: Because every message and ack state is persisted before the dispatcher acts, a restarted broker simply re-reads pending rows and continues. No in-flight state is lost.
Duplicate handling: Consumers must be idempotent. Publishers may optionally provide a dedup_id; the broker deduplicates within a configurable window using a hash index on (topic_id, dedup_id).
Scalability Considerations
- Partitioned topics: Large-volume topics are split into N partitions. Each partition is an ordered log. Subscribers in a consumer group are assigned non-overlapping partitions, enabling horizontal fan-out throughput scaling.
- Pull vs push: Push delivery suits low-latency use cases; pull suits batch consumers that need back-pressure control.
- Storage tiering: Hot messages (last 24 h) stay in fast SSD-backed storage. Older messages spill to object storage (S3-compatible). The dispatcher is storage-tier-aware.
- Flow control: Per-subscription in-flight windows (e.g., max 1000 unacked messages) prevent fast publishers from overwhelming slow subscribers.
Summary
A pub/sub system is built around three concerns: reliable message persistence, fan-out delivery tracking per subscription, and bounded retry with explicit acknowledgment. The data model keeps these concerns in separate tables to allow independent scaling. In interviews, be ready to discuss ordering guarantees (per-partition vs global), exactly-once semantics (two-phase ack + idempotent consumers), and how retention policies interact with storage costs.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What are the core components of a pub/sub system design?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A pub/sub system consists of publishers that emit events, a message broker or topic layer that routes messages, and subscribers that consume events matching their subscriptions. Supporting components include a topic registry, offset/cursor tracking for durable subscriptions, and a delivery guarantee mechanism such as at-least-once or exactly-once semantics.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle message ordering in a distributed pub/sub system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Strict global ordering is expensive and usually replaced with per-partition ordering, where messages sharing a partition key (e.g., user ID) are routed to the same partition and consumed sequentially. Systems like Kafka enforce ordering within a partition while allowing parallel consumption across partitions, balancing throughput with ordering guarantees.”
}
},
{
“@type”: “Question”,
“name”: “What delivery semantics should a pub/sub system support and how are they implemented?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The three common semantics are at-most-once (fire and forget, no retries), at-least-once (messages persisted and retried until acknowledged), and exactly-once (idempotent consumers plus transactional commits). At-least-once is the most practical default; exactly-once requires careful coordination between the broker and consumer storage, often using idempotency keys or two-phase commit.”
}
},
{
“@type”: “Question”,
“name”: “How would you scale a pub/sub system to handle millions of messages per second?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Horizontal scaling is achieved by partitioning topics across multiple broker nodes and assigning consumer groups so each partition is processed by exactly one consumer instance. Adding partitions increases throughput linearly; a tiered storage layer (e.g., offloading old segments to object storage) keeps brokers lean while retaining long message history for replay.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering