Requirements and Scale
An analytics pipeline collects events from product surfaces (web, mobile, backend services), processes them in real time or batch, and makes the results queryable for dashboards and ad-hoc analysis. Functional requirements: ingest high-volume event streams (clickstreams, purchases, API calls), compute aggregates (DAU, revenue, funnel conversion), and expose a query layer for dashboards. Non-functional: ingest 1M events/second at peak, dashboards must reflect data within 5 minutes (near-real-time), historical queries over years of data must return in seconds. The fundamental tension: low-latency ingestion vs high-throughput batch analytics vs interactive ad-hoc queries — each favors a different storage and processing model.
Ingestion Layer
Clients send events to an ingest API (REST or gRPC). The API does minimal validation (schema check, auth) and publishes to Kafka. Kafka acts as the ingestion buffer: decouples producers (apps) from consumers (processors). Producers write to a topic partitioned by event_type or user_id. With 1M events/second and average event size 500 bytes: ~500 MB/s throughput. Kafka handles this with 20-50 partitions. Client-side batching: SDK collects events for 100ms and sends in batches of 100 events. Reduces HTTP overhead by 100x. Kafka retention: keep 7 days of raw events — allows reprocessing if a downstream processor has a bug. Schema registry (Confluent Schema Registry): enforce Avro or Protobuf schemas per topic to prevent malformed events from propagating downstream.
Stream Processing
Real-time aggregations with Apache Flink or Spark Streaming. Consume from Kafka, compute windowed aggregates, write to the serving layer. Example: count active users per 5-minute window, per country. Flink job:
stream
.filter(e -> e.type == "page_view")
.keyBy(e -> e.country)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.aggregate(new CountAggregator())
.addSink(new RedisSink());
Event time vs processing time: use event timestamps (event time) for accuracy — events can arrive late due to mobile offline mode. Watermarks: tell Flink how late events can arrive (e.g., 30-second watermark). Events arriving later than the watermark are dropped or handled by a side output. Output: per-window aggregates written to Redis (for real-time dashboards) and to the OLAP store (for historical analysis). Exactly-once semantics: Flink with Kafka checkpointing provides exactly-once processing guarantees — no double-counting even on restart.
Batch Processing and OLAP Store
For historical queries: raw events are also written to S3 (via Kafka S3 connector) in Parquet format, partitioned by date and event_type. A nightly Spark job computes full-day aggregates and writes to the OLAP store. OLAP options: ClickHouse (columnar, extremely fast aggregations), Apache Druid (real-time + batch, good for time-series), BigQuery (serverless, good for ad-hoc). ClickHouse example: queries over 10 billion rows in under a second using columnar storage and vectorized execution. Data retention: keep 90 days of raw events in S3 Standard, move to Glacier for long-term compliance. Aggregate tables retain indefinitely. Compaction: daily aggregates are compacted from 5-minute aggregates — reduces storage and speeds up year-over-year queries.
Query Layer and Dashboards
Two query patterns: (1) Pre-aggregated metrics (DAU, revenue): served from Redis or a summary table with sub-millisecond latency. Dashboard reads these via a metrics API. (2) Ad-hoc analysis: analyst writes SQL against ClickHouse or BigQuery. Query router: parse the query, determine the time range and granularity. If the range fits in pre-computed aggregates: serve from the fast path. Otherwise: hit ClickHouse or BigQuery. Caching: cache query results for 60 seconds (dashboards refresh every minute). Use a query fingerprint (hash of the SQL + time range) as the cache key. Rate limiting: limit ad-hoc queries per user to prevent one analyst from exhausting cluster resources. Backfill: if a Flink job has a bug and produces wrong metrics for 2 hours: replay the 2-hour window from Kafka raw events (Flink savepoint + replay). Kafka’s 7-day retention makes this practical.
Interview Tips
- Lambda vs Kappa architecture: Lambda has separate batch and stream paths (complex but reliable batch layer). Kappa replaces batch with Kafka replay — simpler, but requires the stream processor to handle reprocessing efficiently.
- Data freshness SLA drives architecture: 5-minute freshness means streaming; 1-hour freshness could be micro-batch; 24-hour is pure batch.
- Hotspot partitioning: if partitioning by user_id, viral users create hot partitions. Mitigate with composite keys or salting.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between Lambda architecture and Kappa architecture for analytics pipelines?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Lambda architecture uses two separate processing paths: a batch layer (Spark on Hadoop) that recomputes accurate aggregates over all historical data nightly, and a speed layer (Flink/Storm) that computes approximate real-time aggregates. Query results merge both layers. Pros: the batch layer guarantees correctness; the speed layer provides low latency. Cons: two codebases to maintain (same logic written twice), complex merge logic at query time, operational overhead of two clusters. Kappa architecture eliminates the batch layer: all processing happens in the stream processor, and historical reprocessing is done by replaying events from Kafka. Pros: one codebase, simpler operations. Cons: requires Kafka to retain long event history (expensive at scale), reprocessing can be slow if the stream processor has low throughput. Modern preference: Kappa for new systems (Kafka storage costs have dropped), Lambda for existing batch-heavy systems where correctness guarantees from batch are non-negotiable.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle late-arriving events in a streaming analytics pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Late events arrive after their event time has passed — caused by mobile offline mode, network delays, or batch uploads. Without handling: a 5-minute window closes at T+5, but events from T+3 arrive at T+8 — they are dropped and the aggregate is wrong. Watermarks: define the maximum expected lateness (e.g., 30 seconds). Flink emits a watermark of current_time – 30s. A window closes only when the watermark advances past the window end time. Events arriving within 30 seconds of their window end are included. Events later than 30 seconds (beyond the watermark) are considered late. Late event handling: (1) Drop (simplest — acceptable if late events are rare). (2) Side output: route late events to a separate Kafka topic for later reprocessing. (3) Allowed lateness extension: keep windows open for an additional N minutes, re-emit the updated aggregate. Trade-off: longer allowed lateness = more accurate results = higher memory usage (keeping window state open). Dashboard strategy: show the current aggregate with a “refreshed X minutes ago” indicator so users know data may be incomplete.”
}
},
{
“@type”: “Question”,
“name”: “How would you design the schema for an event in an analytics pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A good analytics event schema balances expressiveness with consistency. Required fields for every event: event_id (UUID, for deduplication), event_type (string, e.g., “page_view”, “purchase”, “click”), timestamp (ISO-8601 with timezone, client-side event time), received_at (server-side ingestion time — used to detect clock skew), session_id, user_id (nullable if anonymous), device_id, platform (web/ios/android), app_version. Context fields: page_url, referrer, user_agent, ip_address (anonymized to /24 for privacy). Event-specific properties: stored in a properties JSON blob — flexible schema for event-specific data. Schema registry: enforce the common fields via Avro or Protobuf schema. Properties are validated per event_type by a separate schema. Versioning: include schema_version field. When you add a new required field, increment the version and handle both old and new versions in consumers for a transition period.”
}
},
{
“@type”: “Question”,
“name”: “How do you prevent duplicate event counting in an analytics pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Duplicates arise from: client-side retry on network failure (same event sent twice), Kafka at-least-once delivery, stream processor restart from checkpoint. Three strategies: (1) Event ID deduplication: each event has a UUID event_id. At the ingest API, check a Redis set of recently seen event IDs (TTL = 1 hour). If seen: discard silently. Effective for client-side retries. Limitation: the dedup window is bounded by Redis memory. (2) Idempotent Kafka producers: Kafka exactly-once semantics (enable.idempotence=true + transactional.id) prevents broker-side duplicates. (3) Stream processor exactly-once: Flink with Kafka checkpointing uses two-phase commit to guarantee each event is processed exactly once end-to-end. The Kafka sink uses transactions: write aggregates and commit the Kafka offset atomically. If the job restarts: it resumes from the last successful checkpoint without reprocessing. For counting metrics: exactly-once matters; for sum metrics: idempotency via event_id dedup is sufficient since re-processing the same event_id returns the same contribution.”
}
},
{
“@type”: “Question”,
“name”: “How would you design the query layer for an analytics dashboard with 100 concurrent users?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “100 concurrent users querying dashboards with diverse time ranges. Architecture: (1) Pre-computed metrics API: for standard KPIs (DAU, revenue, conversion rate), pre-aggregate in ClickHouse materialized views or a Redis hash. Dashboard widgets for common metrics hit this API. Response time <50ms. (2) Ad-hoc OLAP layer: ClickHouse or BigQuery for analyst queries. Handle 100 concurrent queries with connection pooling (ClickHouse supports 100+ concurrent queries with a shared thread pool). (3) Result cache: cache query results in Redis with a 60-second TTL. Key = SHA256(sql + parameters). When 10 users all load the same dashboard at once, only one hits ClickHouse. (4) Query queue: if the OLAP cluster is under heavy load, queue excess queries rather than failing. Notify users with estimated wait time. (5) Resource isolation: separate ClickHouse clusters for internal dashboards (low latency) and analyst ad-hoc (higher latency acceptable). Prevent a heavy analyst query from impacting the dashboard SLA."
}
}
]
}
Asked at: Databricks Interview Guide
Asked at: Netflix Interview Guide
Asked at: Twitter/X Interview Guide
Asked at: Snap Interview Guide