Question 1

What is the Lambda architecture and why is it used for ad click aggregation?

Accepted Answer

Lambda architecture uses two parallel data pipelines: a speed layer (real-time streaming with Flink/Spark Streaming) and a batch layer (periodic batch jobs with Spark/MapReduce). The speed layer provides low-latency approximate results; the batch layer provides high-latency exact results. For ad click aggregation: advertisers want to see their click counts update within 1 minute (speed layer serves this), but billing must be exact (batch layer reprocesses raw events with full deduplication). The serving layer routes queries to the appropriate pipeline based on time range: real-time counts from Redis for the last hour, exact batch counts from the data warehouse for historical reporting. The trade-off: operational complexity of maintaining two pipelines. Kappa architecture (Kafka-only, replayable) simplifies this by treating batch processing as a special case of streaming.

Question 2

How do you deduplicate ad clicks at scale?

Accepted Answer

Click deduplication requires different strategies for real-time vs. batch. For real-time: use a Redis Bloom filter keyed by click_id (UUID). Before processing a click, check the Bloom filter: if probably seen, discard. If not seen, process and add to the filter. False positive rate ~1% (some valid clicks discarded) is acceptable for real-time dashboards. Bloom filter memory: 1 billion clicks/day * 10 bits/element = 1.25 GB — fits in Redis. TTL: 24 hours. For batch (billing): use exact deduplication via GROUP BY click_id, take the first occurrence. Spark handles this natively. Read all raw events from Kafka or S3, deduplicate, then aggregate. Exact, no false positives, used for final billing numbers. Never use real-time approximate counts for billing.

Question 3

How do you partition Kafka topics for an ad click aggregation system?

Accepted Answer

Partition the click events topic by ad_id (using ad_id as the Kafka message key). This guarantees all events for the same ad go to the same partition, which means: (1) the Flink streaming job can maintain per-ad running counts in local state without cross-partition communication, (2) within-ad event ordering is preserved, and (3) the workload is distributed across Flink task managers proportionally to ad traffic. Potential hot partition issue: a very popular ad (viral campaign) generates orders-of-magnitude more clicks than others, causing one partition to lag. Mitigation: add a random suffix to the key for very high-volume ads (ad_id + random(0-9)), process each sub-partition independently, then sum results at query time. Kafka's partition count should be set to at least the number of Flink parallel instances (typical: 32–256 partitions per topic).

System Design Interview: Design an Ad Click Aggregation System (Google/Meta Ads)

What Is an Ad Click Aggregation System?

System Requirements

Functional

Non-Functional

High-Level Architecture: Lambda Architecture

Click Event Ingestion

Deduplication

Real-Time Aggregation (Speed Layer)

Batch Aggregation (Batch Layer)

Query Serving Layer

Click Fraud Detection

Data Retention

Interview Tips