Design an Ad Click Aggregation System

⏱ 5 min read

Design an ad click aggregation system — the infrastructure that counts how many times each ad was clicked, detects fraud, and powers billing and reporting for advertisers. This is a classic stream processing problem with a twist: correctness matters directly in dollars, and you need to handle both real-time and batch processing simultaneously.

Requirements Clarification

Operations: Ingest click events (ad_id, user_id, timestamp, publisher_id, IP), aggregate click counts by ad_id per time window, detect fraudulent clicks, generate daily billing reports for advertisers.
Scale: 100 billion ad impressions/day, 1 billion clicks/day (1% CTR). ~11,500 clicks/sec average, ~50,000/sec at peak.
Latency: Fraud detection needs near-real-time (under 5 minutes to flag suspicious patterns). Billing reports: generated nightly, delivered by 6am.
Accuracy: Billing must be accurate to ±0.1%. Fraud detection accepts higher false positive rate to catch more fraud.
Deduplication: A user refreshing a page may send the same click twice. Count it once.

Back-of-Envelope

11,500 clicks/sec × 500 bytes/event = 5.75 MB/sec ingestion
1B clicks/day × 500 bytes = 500GB raw event data per day
Aggregated results: 10M ad_ids × 365 days × 100 bytes per count = 365GB/year — tiny
Dedup window: keep click_id in Redis for 24h → 1B events × 16 bytes UUID = 16GB — fits in a Redis cluster

Lambda Architecture: Batch + Speed Layer

Ad aggregation is a textbook Lambda architecture problem. You need two data paths:

Click Events
    │
    ├─ [Speed Layer] → Kafka → Stream Processor (Flink) → Redis/Cassandra
    │                             (real-time counts, fraud signals, 5-min windows)
    │
    └─ [Batch Layer] → Kafka → S3 (raw event archive)
                              → Spark batch job (nightly)
                              → Cassandra/Redshift (billing-grade counts)

Query Layer: Cassandra (serving real-time) + Redshift (billing reports)

Why both? The stream processor gives fast approximate counts for dashboards and fraud detection. The batch job runs on complete, deduplicated data from S3 for billing — where accuracy is contractual. The batch results overwrite or supplement the stream results nightly.

The alternative, Kappa architecture, eliminates the batch layer entirely — reprocess the full event log through the stream processor with a longer window when you need historical accuracy. Simpler to operate (one code path), but requires replaying potentially petabytes of events for a historical report. Good choice when stream processing is mature enough to handle the reprocessing load.

Ingestion: Kafka at the Front Door

Every click event hits a Kafka topic (ad-clicks) immediately. Kafka provides:

Buffer against downstream slowness (stream processor or S3 writer falls behind)
Replay: if the stream processor crashes, rewind and reprocess from the last checkpoint
Fan-out: multiple consumers (fraud detector, real-time counter, S3 archiver) independently read the same events

Partition by ad_id: all events for the same ad land on the same partition, in order. This enables per-ad aggregation without cross-partition coordination.

Stream Processing: Flink Aggregation

// Apache Flink — sliding window click count per ad
DataStream<ClickEvent> clicks = env.addSource(kafkaSource);

clicks
    .keyBy(ClickEvent::getAdId)
    .window(SlidingEventTimeWindows.of(Time.minutes(5), Time.minutes(1)))
    .aggregate(new ClickCounter())
    .addSink(cassandraSink);

Flink handles late-arriving events with watermarks — events up to 30 seconds late are included in the window. Events later than 30 seconds go to a side output for separate handling.

Aggregate results per (ad_id, time_bucket) are written to Cassandra:

CREATE TABLE click_counts (
    ad_id        UUID,
    time_bucket  TIMESTAMP,   -- truncated to 1-minute intervals
    click_count  COUNTER,
    PRIMARY KEY (ad_id, time_bucket)
) WITH CLUSTERING ORDER BY (time_bucket DESC);

Cassandra’s COUNTER type is atomic — concurrent increments from multiple Flink workers don’t lose counts. Partition by ad_id ensures related rows cluster together for efficient range queries.

Deduplication

Two sources of duplicates:

User-level: User clicks an ad, network glitch causes browser to retry. Two identical events with the same click_id arrive.
System-level: Flink checkpoint recovery may replay events already processed.

Solution: each click event carries a unique click_id (UUID generated client-side). Before counting, check Redis:

# Atomic check-and-set with TTL (24h dedup window)
result = redis.set(f"click:{click_id}", 1, nx=True, ex=86400)
if result is None:
    # Already seen — discard
    return
# New click — count it

Redis SET NX (set if not exists) is atomic. At 11,500 clicks/sec with 24h retention, Redis stores ~1 billion keys (16 bytes each) = 16GB — fits in a standard Redis Cluster.

Fraud Detection

Click fraud costs advertisers ~$35B/year globally. Common patterns:

Click farms: Many clicks from the same IP or IP range in a short window
Bot clicks: Clicks with no subsequent page engagement (immediate bounce), or inhuman click speeds
Publisher fraud: Publishers artificially inflate their own click counts to earn more revenue share

Real-time fraud signals (computed in Flink every 5 minutes):

-- Per-IP click velocity: flag if > 50 clicks on same ad in 5 minutes
-- Per-publisher CTR anomaly: flag if CTR > 3× their 30-day average
-- Click-impression gap: flag clicks with no corresponding impression event
-- User-agent entropy: flag clicks from the same UA string with high volume

Flagged clicks are quarantined — counted separately, excluded from billing until a manual review or automated ML classifier adjudicates them. Definitive fraud classification runs in a batch job (higher accuracy) and removes fraudulent clicks from final billing counts.

Billing Report Generation

Nightly batch job (Spark on EMR or Databricks):

Read all raw click events from S3 for the billing period
Join with impression events to verify click-impression pairs
Apply deduplication (same click_id = same click)
Remove fraud-flagged clicks
Aggregate: total valid clicks per (ad_id, advertiser_id, date)
Apply CPC (cost-per-click) rates from the campaign configuration
Write billing summary to data warehouse (Redshift/BigQuery)
Generate PDF/CSV invoices via billing service

The batch job’s count is the authoritative billing number. The real-time counts from Flink are for dashboards and fraud detection only.

Data Reconciliation

Periodically compare Flink counts against batch counts. Acceptable discrepancy: <0.1% (from late-arriving or out-of-order events). Larger discrepancies trigger investigation — likely a bug in the stream processor or a deduplication failure.

Interview Follow-ups

How do you handle a Kafka consumer group that falls hours behind during a traffic spike? What’s your catch-up strategy without overwhelming downstream systems?
An advertiser disputes their bill — claims our click count is 5% higher than their own tracking. How do you investigate and resolve this?
How would you add viewability tracking — an ad impression only counts if the ad was in the viewport for at least 1 second?
Design the real-time bidding (RTB) system — the auction that decides which ad to show in 100ms when a page loads.
How do you detect a new type of click fraud your rules haven’t seen before?

Message Queues — Kafka is the ingest buffer; partition by ad_id for per-ad ordering and aggregation
Database Sharding — partitioning click_counts by ad_id in Cassandra for local aggregation
Caching Strategies — Redis for click deduplication window (SET NX with 24h TTL)
Design a Payment System — billing accuracy and reconciliation requirements are identical: batch counts overwrite stream approximations
CAP Theorem — stream counts choose availability (AP); billing batch job enforces consistency (CP)

Companies That Ask This System Design Question

This problem type commonly appears in interviews at:

See our company interview guides for full interview process, compensation, and preparation tips.