Question 1

What is the difference between Lambda architecture and Kappa architecture for analytics pipelines?

Accepted Answer

Lambda architecture uses two separate processing paths: a batch layer (Spark on Hadoop) that recomputes accurate aggregates over all historical data nightly, and a speed layer (Flink/Storm) that computes approximate real-time aggregates. Query results merge both layers. Pros: the batch layer guarantees correctness; the speed layer provides low latency. Cons: two codebases to maintain (same logic written twice), complex merge logic at query time, operational overhead of two clusters. Kappa architecture eliminates the batch layer: all processing happens in the stream processor, and historical reprocessing is done by replaying events from Kafka. Pros: one codebase, simpler operations. Cons: requires Kafka to retain long event history (expensive at scale), reprocessing can be slow if the stream processor has low throughput. Modern preference: Kappa for new systems (Kafka storage costs have dropped), Lambda for existing batch-heavy systems where correctness guarantees from batch are non-negotiable.

Question 2

How do you handle late-arriving events in a streaming analytics pipeline?

Accepted Answer

Late events arrive after their event time has passed -- caused by mobile offline mode, network delays, or batch uploads. Without handling: a 5-minute window closes at T+5, but events from T+3 arrive at T+8 -- they are dropped and the aggregate is wrong. Watermarks: define the maximum expected lateness (e.g., 30 seconds). Flink emits a watermark of current_time - 30s. A window closes only when the watermark advances past the window end time. Events arriving within 30 seconds of their window end are included. Events later than 30 seconds (beyond the watermark) are considered late. Late event handling: (1) Drop (simplest -- acceptable if late events are rare). (2) Side output: route late events to a separate Kafka topic for later reprocessing. (3) Allowed lateness extension: keep windows open for an additional N minutes, re-emit the updated aggregate. Trade-off: longer allowed lateness = more accurate results = higher memory usage (keeping window state open). Dashboard strategy: show the current aggregate with a "refreshed X minutes ago" indicator so users know data may be incomplete.

Question 3

How would you design the schema for an event in an analytics pipeline?

Accepted Answer

A good analytics event schema balances expressiveness with consistency. Required fields for every event: event_id (UUID, for deduplication), event_type (string, e.g., "page_view", "purchase", "click"), timestamp (ISO-8601 with timezone, client-side event time), received_at (server-side ingestion time -- used to detect clock skew), session_id, user_id (nullable if anonymous), device_id, platform (web/ios/android), app_version. Context fields: page_url, referrer, user_agent, ip_address (anonymized to /24 for privacy). Event-specific properties: stored in a properties JSON blob -- flexible schema for event-specific data. Schema registry: enforce the common fields via Avro or Protobuf schema. Properties are validated per event_type by a separate schema. Versioning: include schema_version field. When you add a new required field, increment the version and handle both old and new versions in consumers for a transition period.

Question 4

How do you prevent duplicate event counting in an analytics pipeline?

Accepted Answer

Duplicates arise from: client-side retry on network failure (same event sent twice), Kafka at-least-once delivery, stream processor restart from checkpoint. Three strategies: (1) Event ID deduplication: each event has a UUID event_id. At the ingest API, check a Redis set of recently seen event IDs (TTL = 1 hour). If seen: discard silently. Effective for client-side retries. Limitation: the dedup window is bounded by Redis memory. (2) Idempotent Kafka producers: Kafka exactly-once semantics (enable.idempotence=true + transactional.id) prevents broker-side duplicates. (3) Stream processor exactly-once: Flink with Kafka checkpointing uses two-phase commit to guarantee each event is processed exactly once end-to-end. The Kafka sink uses transactions: write aggregates and commit the Kafka offset atomically. If the job restarts: it resumes from the last successful checkpoint without reprocessing. For counting metrics: exactly-once matters; for sum metrics: idempotency via event_id dedup is sufficient since re-processing the same event_id returns the same contribution.

Question 5

How would you design the query layer for an analytics dashboard with 100 concurrent users?

Accepted Answer

100 concurrent users querying dashboards with diverse time ranges. Architecture: (1) Pre-computed metrics API: for standard KPIs (DAU, revenue, conversion rate), pre-aggregate in ClickHouse materialized views or a Redis hash. Dashboard widgets for common metrics hit this API. Response time

System Design: Analytics Pipeline — Ingestion, Stream Processing, and OLAP Query Layer

Requirements and Scale

Ingestion Layer

Stream Processing

Batch Processing and OLAP Store

Query Layer and Dashboards

Interview Tips