What Is an Ad Click Aggregation System?
Ad platforms (Google Ads, Meta Ads, Amazon Advertising) must aggregate billions of ad click events per day and provide real-time reporting to advertisers: “how many users clicked my ad in the last hour?” The challenge is combining high write throughput (billions of raw click events/day) with low-latency read queries on aggregated data. This is a classic time-series aggregation problem with strict accuracy requirements (billing depends on it).
System Requirements
Functional
- Record ad click events: ad_id, user_id, timestamp, IP, device type
- Aggregate clicks: total clicks per ad per time window (last minute, last hour, last day)
- Filter aggregations: by country, device type, age group
- Real-time reporting: advertisers see click counts within 1 minute of the click
- Historical reporting: exact click counts for billing over any date range
Non-Functional
- Scale: 10B clicks/day = ~115K clicks/sec peak
- Accuracy: billing reports must be exact (no approximation)
- Real-time latency: dashboard updates within 60 seconds
- Deduplication: invalid/duplicate clicks must not be counted
High-Level Architecture: Lambda Architecture
Two parallel pipelines serve different accuracy/latency trade-offs:
- Speed layer: real-time streaming (Kafka + Flink) → approximate counts, low latency
- Batch layer: periodic batch jobs (Spark) → exact counts, higher latency
- Serving layer: query real-time counts for recent data, batch counts for historical data
Click Event Ingestion
When a user clicks an ad, the ad server logs the event and publishes it to Kafka. Events are partitioned by ad_id (so all events for the same ad go to the same partition — enables per-ad aggregation without shuffling). Each event:
{
click_id: UUID (for deduplication),
ad_id, user_id, campaign_id,
timestamp, country, device_type
}
Kafka retains raw events for 7 days — the batch layer can reprocess from any point.
Deduplication
Users double-click, bots replay requests, network retries cause duplicates. Dedup strategy:
- Redis Bloom filter: check click_id before processing. If probably seen, discard. Fast, O(1), ~1% false positive rate (some valid clicks incorrectly discarded — acceptable for real-time).
- Batch dedup: the batch pipeline uses exact dedup (GROUP BY click_id, keep first) before aggregating. This is authoritative for billing.
- Never use real-time counts for billing — always use batch-computed exact counts.
Real-Time Aggregation (Speed Layer)
A Flink job consumes from Kafka and maintains sliding window aggregates:
- Tumbling 1-minute windows: emit click count per (ad_id, country, device) every minute
- Sliding 1-hour window: rolling last-60-minutes count per ad, updated every minute
- Results written to: Redis (for dashboard reads) and a time-series database (InfluxDB / Cassandra) for trend charts
Flink state is backed by RocksDB (for large state — millions of active ads). Checkpointing every 30 seconds ensures recovery within 30 seconds of failure.
Batch Aggregation (Batch Layer)
Hourly Spark jobs read raw Kafka events (or S3 archive), deduplicate by click_id, and compute exact aggregates:
SELECT ad_id, country, device_type,
COUNT(DISTINCT click_id) as clicks,
date_trunc('hour', timestamp) as hour
FROM raw_clicks
GROUP BY 1, 2, 3, 4
Results written to a data warehouse (BigQuery, Redshift, Snowflake) for billing reports and historical analytics. The batch job for hour H runs at H+1:05 (5-minute grace period for late-arriving events).
Query Serving Layer
Advertisers query: “clicks for ad_123 in the last 3 hours, by country.”
- Last 0–60 minutes: read from Redis real-time aggregates (fast, ~1 minute latency)
- Last 1–24 hours: read pre-aggregated hourly rows from Cassandra time-series store
- Older than 24 hours: read from data warehouse (BigQuery/Redshift)
The API layer routes queries to the appropriate store based on time range.
Click Fraud Detection
- IP rate limiting: >10 clicks/minute from same IP → flag as bot, discard
- User agent analysis: headless browsers, known bot user agents
- Click pattern ML: abnormal click velocity, impossible geographic movement between clicks
- Fraudulent clicks removed from batch counts before billing
Data Retention
- Raw clicks: S3 / cold storage for 7 years (regulatory requirement)
- Aggregated hourly data: data warehouse indefinitely
- Real-time Redis counts: 24-hour TTL (served by batch data after that)
Interview Tips
- Lambda architecture (speed + batch layers) is the key concept — explain why both are needed: speed layer for real-time, batch for accuracy.
- Emphasize that billing must use batch-computed exact counts — never the approximate real-time stream.
- Deduplication is a differentiator — Bloom filter for real-time, exact GROUP BY click_id for batch.
- Kafka partition by ad_id is crucial — it co-locates all events for an ad and enables stateful per-ad aggregation in Flink without cross-partition shuffles.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the Lambda architecture and why is it used for ad click aggregation?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Lambda architecture uses two parallel data pipelines: a speed layer (real-time streaming with Flink/Spark Streaming) and a batch layer (periodic batch jobs with Spark/MapReduce). The speed layer provides low-latency approximate results; the batch layer provides high-latency exact results. For ad click aggregation: advertisers want to see their click counts update within 1 minute (speed layer serves this), but billing must be exact (batch layer reprocesses raw events with full deduplication). The serving layer routes queries to the appropriate pipeline based on time range: real-time counts from Redis for the last hour, exact batch counts from the data warehouse for historical reporting. The trade-off: operational complexity of maintaining two pipelines. Kappa architecture (Kafka-only, replayable) simplifies this by treating batch processing as a special case of streaming.” }
},
{
“@type”: “Question”,
“name”: “How do you deduplicate ad clicks at scale?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Click deduplication requires different strategies for real-time vs. batch. For real-time: use a Redis Bloom filter keyed by click_id (UUID). Before processing a click, check the Bloom filter: if probably seen, discard. If not seen, process and add to the filter. False positive rate ~1% (some valid clicks discarded) is acceptable for real-time dashboards. Bloom filter memory: 1 billion clicks/day * 10 bits/element = 1.25 GB — fits in Redis. TTL: 24 hours. For batch (billing): use exact deduplication via GROUP BY click_id, take the first occurrence. Spark handles this natively. Read all raw events from Kafka or S3, deduplicate, then aggregate. Exact, no false positives, used for final billing numbers. Never use real-time approximate counts for billing.” }
},
{
“@type”: “Question”,
“name”: “How do you partition Kafka topics for an ad click aggregation system?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Partition the click events topic by ad_id (using ad_id as the Kafka message key). This guarantees all events for the same ad go to the same partition, which means: (1) the Flink streaming job can maintain per-ad running counts in local state without cross-partition communication, (2) within-ad event ordering is preserved, and (3) the workload is distributed across Flink task managers proportionally to ad traffic. Potential hot partition issue: a very popular ad (viral campaign) generates orders-of-magnitude more clicks than others, causing one partition to lag. Mitigation: add a random suffix to the key for very high-volume ads (ad_id + random(0-9)), process each sub-partition independently, then sum results at query time. Kafka's partition count should be set to at least the number of Flink parallel instances (typical: 32–256 partitions per topic).” }
}
]
}