Pub/Sub Service Low-Level Design: Topic Fanout, Subscription Filters, and Dead Letter Handling

Problem Statement and Requirements

A pub/sub service enables one-to-many message delivery: publishers send messages to topics without knowing who will receive them, and subscribers receive messages matching their subscriptions. Unlike a message broker focused on point-to-point or consumer-group delivery, pub/sub emphasizes fanout — a single published message may be delivered to thousands of independent subscribers. The key design challenges are efficient fanout, attribute-based filtering to avoid unnecessary delivery, and reliable dead letter handling for messages that cannot be processed.

Functional Requirements

  • Publishers send messages to named topics; messages include a body and key-value attributes
  • Subscribers create subscriptions on topics; subscriptions support push (HTTP webhook) and pull delivery modes
  • Attribute-based filtering: subscriptions specify filter expressions; only matching messages are delivered
  • At-least-once delivery with configurable retry policy (max attempts, backoff schedule)
  • Dead letter topics: messages that exceed retry attempts are forwarded to a configured dead letter subscription
  • Message ordering: optional per-publisher ordering key guarantees ordered delivery within a key

Non-Functional Requirements

  • Fanout: a single publish to a topic with 10,000 subscriptions must complete delivery initiation within 1 second
  • End-to-end latency (publish to push delivery): p99 < 500 ms under normal conditions
  • Message retention: undelivered messages retained for up to 7 days
  • Throughput: 1 million publishes/sec globally with horizontal scaling

Core Data Model

Topics and Subscriptions

topics (
  topic_id      TEXT PRIMARY KEY,
  project_id    TEXT NOT NULL,
  message_retention_days INTEGER DEFAULT 7,
  schema_id     TEXT,                  -- optional schema enforcement
  created_at    TIMESTAMPTZ NOT NULL
)

subscriptions (
  subscription_id  TEXT PRIMARY KEY,
  topic_id         TEXT NOT NULL REFERENCES topics(topic_id),
  delivery_type    ENUM('push','pull'),
  push_endpoint    TEXT,               -- HTTPS URL for push delivery
  ack_deadline_sec INTEGER DEFAULT 10,
  filter_expr      TEXT,               -- attribute filter expression
  ordering_key     BOOLEAN DEFAULT FALSE,
  dead_letter_topic TEXT,
  max_delivery_attempts INTEGER DEFAULT 5,
  retry_policy     JSONB,              -- {min_backoff, max_backoff}
  created_at       TIMESTAMPTZ NOT NULL
)

Message Storage

messages (
  message_id    TEXT PRIMARY KEY,      -- globally unique ID assigned by broker
  topic_id      TEXT NOT NULL,
  data          BYTEA NOT NULL,        -- base64-encoded body
  attributes    JSONB NOT NULL DEFAULT '{}',
  ordering_key  TEXT,
  publish_time  TIMESTAMPTZ NOT NULL,
  expires_at    TIMESTAMPTZ NOT NULL
)

delivery_attempts (
  attempt_id    BIGSERIAL PRIMARY KEY,
  message_id    TEXT NOT NULL,
  subscription_id TEXT NOT NULL,
  attempt_number  INTEGER NOT NULL,
  status        ENUM('pending','delivering','acked','nacked','dead_lettered'),
  next_delivery_at TIMESTAMPTZ NOT NULL,
  ack_id        TEXT UNIQUE,           -- opaque token returned to subscriber for ack
  delivered_at  TIMESTAMPTZ,
  acked_at      TIMESTAMPTZ
)

Key Algorithms and Logic

Fanout on Publish

When a message is published, the broker must create a delivery record for every matching subscription. For topics with thousands of subscriptions, synchronous fanout in the publish path would be too slow. The broker uses a two-stage approach: the publish API writes the message to durable storage and enqueues a fanout task, returning a message ID to the publisher immediately (acknowledgment of receipt, not delivery). A fanout worker reads the topic's subscription list, evaluates filter expressions, and writes delivery_attempts rows in batches. For topics with >100 subscriptions, fanout workers run in parallel across a worker pool.

Attribute Filter Evaluation

Filter expressions match against message attributes using a restricted expression language: exact match (attributes.env = "prod"), existence (hasAttribute("priority")), and prefix match (attributes.topic_prefix: "payments/"). Filters are parsed at subscription creation time into an AST and stored. At fanout time, the compiled filter is evaluated in-memory against the message's attributes map. Invalid filters are rejected at subscription creation with a 400 error. Filters that match all messages (empty filter) are represented as a null AST for fast-path evaluation.

Push Delivery with Retry and Backoff

For push subscriptions, a delivery worker picks up pending delivery_attempts where next_delivery_at <= now() and issues an HTTPS POST to the configured endpoint with the message payload. A 2xx response triggers an ack (update status to acked). Any non-2xx or timeout triggers a nack: attempt_number increments and next_delivery_at is set to now() + min(max_backoff, min_backoff * 2^attempt_number) (exponential backoff with jitter). After max_delivery_attempts, the message is written to the dead letter topic (as a new publish) and the delivery_attempt is marked dead_lettered.

Pull Delivery and Ack Deadline

Pull subscribers call a Pull API to receive a batch of messages. Each message comes with an ack_id and the subscriber has ack_deadline_sec seconds to acknowledge it. If the deadline expires without an ack, the message becomes available for re-delivery (delivered again to any puller). Subscribers that need more processing time call ModifyAckDeadline to extend it. The ack deadline mechanism is implemented via the next_delivery_at field: setting it to now() + ack_deadline_sec on delivery, and re-queuing if it passes without an ack.

Ordered Delivery

When a subscription has ordering enabled, messages with the same ordering_key are delivered in publish order. The broker assigns messages with the same ordering key to the same delivery shard (consistent hashing on the key). Within a shard, messages are delivered strictly sequentially — the next message for a key is not delivered until the prior one is acked. This reduces throughput but guarantees ordering. A nack on an ordered message pauses delivery for that key until the message is acked or dead-lettered.

API Design

  • POST /v1/projects/{project}/topics/{topic}/publish: publish one or more messages; returns array of message IDs
  • POST /v1/projects/{project}/subscriptions/{sub}/pull?maxMessages=N: pull up to N pending messages; returns messages with ack IDs
  • POST /v1/projects/{project}/subscriptions/{sub}/acknowledge: ack a list of ack IDs
  • POST /v1/projects/{project}/subscriptions/{sub}/modifyAckDeadline: extend deadline for ack IDs
  • PUT /v1/projects/{project}/subscriptions/{sub}: create or update subscription with filter, delivery config, dead letter config
  • GET /v1/projects/{project}/subscriptions/{sub}/snapshot: get current delivery position for time-travel replay
  • POST /v1/projects/{project}/subscriptions/{sub}/seek: replay from a snapshot or timestamp

Scalability Considerations

Fanout Worker Scaling

Fanout workers are stateless and scale horizontally. Topic-to-worker assignment uses consistent hashing so that all publishes to a topic go to the same worker pool, enabling local caching of the subscription list. The subscription list is cached in memory with a short TTL (30 seconds) and invalidated on subscription create/delete events via a control plane notification. For topics with 10,000+ subscriptions, a dedicated high-fanout tier of workers is used with larger batch sizes for delivery_attempts inserts.

Dead Letter Handling Operations

The dead letter topic is a regular topic, allowing operators to inspect, replay, or discard failed messages using the same APIs. A common operational pattern is a dead letter monitor subscription that alerts on new dead letter messages and a remediation workflow that fixes the consumer bug, then re-publishes dead letter messages to the original topic via a seek-based replay or manual re-publish script.

Interview Discussion Points

  • How do you prevent a slow push subscriber from blocking fanout for other subscriptions on the same topic?
  • What are the trade-offs between push and pull delivery? (Push is lower latency; pull allows consumer-controlled rate)
  • How do you implement exactly-once delivery semantics in a pub/sub system?
  • How would you handle a subscriber endpoint that returns 200 but does not actually process the message (silent discard)?
  • How do attribute-based filters interact with message ordering — can filters cause out-of-order delivery for a subscriber?

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

Scroll to Top