How do you determine source polling cadence in a news feed aggregator?

Polling cadence is typically adaptive: high-frequency sources (breaking news outlets) get polled every 1-5 minutes, while low-activity sources drop to 30-60 minute intervals. You track each source's historical update rate and use exponential back-off on consecutive empty polls to reduce wasted requests. A priority queue ordered by next-scheduled-poll time drives the scheduler.

How does SimHash or URL normalization help with deduplication in a news aggregator?

URL normalization strips query parameters, UTM tokens, and trailing slashes so the same article from different referral links maps to one canonical key. SimHash generates a 64-bit fingerprint from the article's token shingles; two articles within a Hamming distance of 3 are treated as near-duplicates. Storing SimHash values in a distributed hash table lets you check duplicates in O(1) without storing full content.

How is relevance scoring implemented in a news feed aggregator?

Relevance scoring combines several signals: TF-IDF or BM25 text similarity against a user's interest profile, recency decay (e.g., exponential half-life of 6 hours), source authority score derived from domain PageRank, and engagement signals like click-through rate. These features are fed into a lightweight gradient-boosted model or a dot-product scoring layer trained on implicit feedback.

How do you architect personalized delivery in a news feed aggregator at scale?

Personalized delivery is split into offline and online stages. Offline, a batch job builds per-user interest vectors and candidate sets. Online, a low-latency ranking service re-scores the candidate set using fresh signals (time of day, recent clicks) and returns a ranked list. A fan-out-on-read model is typical for smaller userbases; fan-out-on-write (pre-computed feeds in Redis) suits high-volume feeds where read latency matters more than write amplification.

News Feed Aggregator Low-Level Design: Source Polling, Deduplication, and Ranking

⏱ 5 min read

What Is a News Feed Aggregator?

A news feed aggregator collects articles and posts from multiple external sources — RSS feeds, REST APIs, and scraped pages — normalizes them into a unified format, removes duplicate content, and ranks items by relevance before delivering a personalized stream to each user. Designing one at the low level requires careful attention to polling schedules, deduplication strategies, scoring pipelines, and delivery infrastructure.

Requirements

Functional Requirements

Poll hundreds of thousands of RSS/Atom feeds and partner APIs on configurable intervals.
Deduplicate articles that appear across multiple sources or re-submissions of the same story.
Score articles by freshness, source authority, and user interest signals.
Deliver a ranked, personalized feed per user via REST and WebSocket push.
Support category filtering, keyword subscriptions, and blocked sources.

Non-Functional Requirements

Ingest up to 50,000 new articles per minute at peak.
Feed API p99 latency under 150 ms for up to 50 million daily active users.
Near-real-time delivery: articles visible within 60 seconds of source publication.
Deduplication false-negative rate below 0.1%.

Data Model

Three primary entities drive the system. The Source record stores the feed URL, polling interval, last-fetched timestamp, ETag/Last-Modified headers for conditional GET optimization, and a reliability score used to prioritize the polling queue. The Article record holds a canonical ID (SHA-256 of normalized URL), title, body text, publication timestamp, source ID, embedding vector for similarity search, and computed relevance score. The UserFeedPreference record maps user ID to followed sources, topic weights, and blocked domains.

Articles are stored in a relational database (PostgreSQL) for metadata and in a document store (Elasticsearch) for full-text search and faceted filtering. Embedding vectors are stored in a purpose-built vector index (FAISS or pgvector) for similarity-based deduplication and personalization.

Core Algorithms

Adaptive Polling

A scheduler assigns each source a polling interval based on its historical publication rate. Sources that publish frequently get short intervals (as low as 60 seconds); inactive sources are polled every few hours. The scheduler uses a priority queue sorted by next-fetch time. Workers pull the top-N sources, issue conditional HTTP GETs, and re-enqueue with the updated interval. This reduces unnecessary requests by up to 70% compared to fixed-interval polling.

Deduplication Pipeline

Deduplication runs in two passes. The first pass uses a canonical URL normalization step — stripping tracking parameters, resolving redirects, and lowercasing the host — then checks a Bloom filter backed by Redis to detect exact URL duplicates in O(1). The second pass computes a SimHash fingerprint of the article title and first 200 characters of body text. Articles whose SimHash differs by fewer than 3 bits from an existing fingerprint stored in a sorted set are treated as near-duplicates and merged under the earliest canonical ID.

Relevance Scoring

Each article receives a composite score: score = w1 * freshness + w2 * authority + w3 * engagement + w4 * personalization. Freshness decays exponentially with a half-life tuned per category (breaking news: 2 hours; long reads: 48 hours). Authority is derived from the source reliability score updated weekly via a PageRank-style computation over cross-source citation graphs. Engagement uses early click-through rate and share velocity observed in the first 30 minutes. Personalization applies a dot-product between the article embedding and the user interest vector maintained in the vector store.

API Design

The Feed API exposes a single primary endpoint: GET /v1/feed?user_id={id}&cursor={cursor}&limit={n}. Responses are cursor-paginated to support infinite scroll without duplicates. Each item in the response includes article metadata, the canonical URL, thumbnail URL, source name, and a rendered score bucket (top/trending/standard) for client-side display logic.

A Subscription API at POST /v1/subscriptions accepts source URLs or topic keywords and enqueues the source for immediate first-poll. Webhooks and WebSocket channels notify subscribed clients when new articles matching their preferences are ingested, enabling sub-minute delivery without polling the feed endpoint.

Scalability and Infrastructure

The polling layer runs as a fleet of stateless workers reading from the priority queue in Redis. Workers are horizontally scaled based on queue depth. The normalization and scoring pipeline runs as a series of Kafka consumer groups, each processing a specific stage (fetch, parse, deduplicate, score, index) independently, allowing per-stage autoscaling. The feed read path is served from a pre-computed ranked list cached in Redis per user segment, refreshed every 30 seconds, with a fallback to real-time assembly for cold users. A CDN caches public trending feeds at the edge, reducing origin load by 60% during traffic spikes.