System Design: Design Twitter Trending Topics — Real-Time Analytics, Stream Processing, Count-Min Sketch, Sliding Window

Twitter Trending Topics surfaces the most discussed subjects in real-time from billions of tweets. Designing a trending detection system tests your understanding of stream processing, approximate counting, time-windowed analytics, and ranking algorithms. This is a focused system design question that goes deep on real-time data processing — a valuable complement to the broader Twitter feed design.

What Makes Something “Trending”

Trending is not just “most mentioned.” A topic with 1 million mentions per day that always has 1 million mentions is not trending — it is popular. Trending means a sudden increase in mentions relative to the baseline. The trending score measures velocity of growth, not absolute volume. Formula: trending_score = current_mentions / expected_mentions. If “earthquake” normally gets 100 mentions per hour but suddenly gets 10,000, the trending score is 100x — highly trending. If “weather” gets 50,000 mentions per hour as usual, trending score is ~1x — not trending despite high volume. This requires: (1) A real-time count of mentions per topic in the current window (last 1 hour). (2) A baseline count (average mentions per hour over the last 7 days). (3) The ratio determines trending status. Topics with trending_score > threshold (e.g., 5x) and minimum_volume > threshold (e.g., 1000 mentions) are candidates. This prevents low-volume noise from triggering (“my cat” mentioned 10x more than usual but only 5 total mentions).

Stream Processing Pipeline

Every tweet is a stream event. The pipeline: (1) Tweet ingestion — tweets are published to Kafka topic tweets. With 500 million tweets per day = ~6,000 tweets per second. (2) Topic extraction — a stream processor (Flink or Spark Streaming) consumes each tweet and extracts topics: hashtags (#earthquake), named entities (detected by NLP: “Taylor Swift,” “Super Bowl”), and n-grams (frequently co-occurring words). Each tweet may produce multiple topics. (3) Counting — for each extracted topic, increment the count in the current time window. Use a sliding window with 1-minute granularity: maintain counts per topic per minute for the last 60 minutes. The hourly count is the sum of the 60 one-minute buckets. When a minute passes, drop the oldest bucket and add a new one. (4) Trending detection — every minute, compute the trending score for each topic: current_hour_count / baseline_hour_count. Rank by trending score. The top N topics with sufficient volume are the trending topics. (5) Results are written to a cache (Redis) and served to the API. The trending list refreshes every 1-5 minutes.

Approximate Counting at Scale

Exact counting of millions of distinct topics is memory-intensive. Approximate counting with Count-Min Sketch reduces memory by 100x while accepting a small overcount error. Count-Min Sketch for topic counts: a 2D array of counters with d rows and w columns. d hash functions (one per row). To increment topic T: hash T with each of the d functions, increment the corresponding counter in each row. To query: hash T, read the counter from each row, return the minimum (least inflated by collisions). Memory: with d=5 rows and w=10,000 columns = 50,000 counters = 200 KB. This tracks millions of topics with controllable error. For trending specifically: maintain two Count-Min Sketches: one for the current hour (sliding window) and one for the baseline (7-day average). The trending score is current_sketch.query(topic) / baseline_sketch.query(topic). Space-saving algorithm (alternative): maintains exact counts for the top-K heaviest items. Items below the threshold are evicted and their counts may be slightly off. More accurate than CMS for the top items specifically. Used by Twitter internally for heavy hitters detection.

Geographic and Personalized Trends

Twitter shows trends per location (country, city). Implementation: maintain separate counting pipelines per geographic region. When a tweet is processed, determine the author location (from profile or GPS) and increment counts in the appropriate regional counter in addition to the global counter. With ~200 countries and ~500 cities tracked: 700 separate counting instances. Each is a small Count-Min Sketch or Space-Saving structure. Personalized trends: blend global/local trends with topics from accounts the user follows. If the user follows many tech accounts and “Python 4.0” is trending among tech accounts but not globally, show it as a personalized trend. Implementation: pre-compute trends per interest cluster (tech, sports, politics, entertainment). Each user is assigned to 2-3 clusters based on their follows. Their trending list merges global trends with cluster-specific trends. Trend suppression: filter out: (1) Offensive or harmful content (blocklist + ML classification). (2) Spam campaigns (detect coordinated inauthentic behavior — many new accounts tweeting the same hashtag). (3) Advertiser-manipulated trends (paid promoted trends are labeled separately).

Ranking and Display

The trending list shows 10-30 topics, ranked by: (1) Trending score (velocity of growth) — the primary signal. (2) Total volume — among equally trending topics, higher volume ranks first. (3) Freshness — topics that started trending more recently rank higher than those that have been trending for hours. (4) Diversity — avoid showing 5 topics about the same event. Cluster related topics (“Super Bowl,” “halftime show,” “Usher”) and show one representative with “related” links. (5) Geographic relevance — boost topics relevant to the user location. Display format: topic name, tweet count (“125K tweets”), category label (Sports, Entertainment, Technology), and a representative tweet or description. Trend lifecycle: a topic typically trends for 2-6 hours. After the initial spike subsides (current mentions drop back toward baseline), the trending score decreases and the topic falls off the list. Long-running events (elections, sports tournaments) may trend repeatedly with each development, appearing as distinct trend entries. Caching: the trending list per region is cached in Redis with 1-minute TTL. All users in the same region see the same trending list (personalized trends add a thin personalization layer on top).

Scroll to Top