Data Engineering Interview Questions: Spark, Kafka, and Pipeline Design

Data engineering roles require you to design, build, and maintain data pipelines that move and transform data at scale. Interviews test your knowledge of Spark, Kafka, SQL, pipeline design patterns, and data modeling. This guide covers the most common data engineering interview questions.

Apache Spark Core Concepts

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import *

spark = SparkSession.builder 
    .appName("DataEngInterview") 
    .config("spark.sql.adaptive.enabled", "true") 
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") 
    .getOrCreate()

# RDD vs DataFrame vs Dataset:
# RDD: low-level, untyped, no Catalyst optimizer
# DataFrame: untyped rows, Catalyst optimizer, fast
# Dataset: typed (Scala/Java), Catalyst + Tungsten
# In PySpark: always use DataFrame API (Dataset not available)

# Lazy evaluation: transformations build a DAG, actions trigger execution
df = spark.read.parquet("s3://data/events/")  # No I/O yet
df_filtered = df.filter(F.col("event_type") == "purchase")  # Still lazy
df_enriched = df_filtered.withColumn("total", F.col("price") * F.col("qty"))  # Lazy
# Action: triggers execution of entire DAG
df_enriched.write.parquet("s3://output/purchases/")  # NOW it runs

# Window functions
window_spec = Window.partitionBy("user_id").orderBy("event_ts")

df_with_session = df.withColumn(
    "prev_event_ts",
    F.lag("event_ts", 1).over(window_spec)
).withColumn(
    "gap_minutes",
    (F.col("event_ts").cast("long") - F.col("prev_event_ts").cast("long")) / 60
).withColumn(
    "is_new_session",
    F.when(F.col("gap_minutes") > 30, 1).otherwise(0)  # 30-min session timeout
).withColumn(
    "session_id",
    F.concat(
        F.col("user_id"),
        F.lit("_"),
        F.sum("is_new_session").over(window_spec)
    )
)

Spark Performance Tuning

# Common performance problems and fixes:

# 1. DATA SKEW — one partition has 10x more data than others
# Symptom: one task takes 10x longer than others in Spark UI

# Fix: salting — artificially distribute skewed key
skewed_df = spark.read.parquet("s3://data/orders/")

# Add salt to distribute one hot seller_id across N partitions
N_SALTS = 50
df_salted = skewed_df.withColumn(
    "seller_id_salted",
    F.concat(F.col("seller_id"), F.lit("_"), (F.rand() * N_SALTS).cast("int"))
)
# Join on salted key, then aggregate away the salt

# 2. TOO MANY SMALL FILES — many 1KB files instead of fewer large ones
# Fix: coalesce before writing (reduce without shuffle) or repartition (with shuffle)
df.coalesce(100).write.parquet("output/")  # Reduce partitions, no shuffle
df.repartition(200, "date").write.parquet("output/")  # Redistribute by date

# 3. BROADCAST JOIN — small table join
# Instead of shuffle join (expensive), broadcast small table to all workers
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_lookup), "product_id")

# 4. CACHING — avoid recomputing same DataFrame multiple times
df_expensive = spark.read.parquet("s3://data/").filter(...).join(...)
df_expensive.cache()  # Cache in memory (evicts on LRU)
df_expensive.persist(StorageLevel.MEMORY_AND_DISK)  # Spill to disk if OOM

count = df_expensive.count()  # First action: computes and caches
stats = df_expensive.agg(F.mean("revenue")).collect()  # Uses cache

df_expensive.unpersist()  # Free memory when done

# 5. PARTITION PRUNING — filter on partition column
# Good: Spark only reads relevant partitions
df = spark.read.parquet("s3://data/events/")  # Partitioned by date=YYYY-MM-DD
filtered = df.filter(F.col("date") >= "2026-01-01")  # Reads only 2026 data

# 6. ADAPTIVE QUERY EXECUTION (AQE) — Spark 3.0+
# Automatically coalesces partitions, switches join strategies based on runtime stats
spark.conf.set("spark.sql.adaptive.enabled", "true")

Kafka Architecture and Patterns

"""
Kafka concepts:
  Topic:     named stream of records; partitioned for parallelism
  Partition: ordered, immutable sequence; each record has offset
  Producer:  publishes to topics; controls partition via key or custom partitioner
  Consumer:  reads records; tracks offset per partition
  Consumer Group: multiple consumers sharing work; each partition → one consumer
  Broker:    Kafka server; multiple brokers = Kafka cluster
  Replication: each partition has N replicas; one leader, rest followers
  ISR:       In-Sync Replicas — must acknowledge writes for durability

Delivery guarantees:
  At-most-once:  fire-and-forget (fast, can lose messages)
  At-least-once: retry on failure (default; may process twice)
  Exactly-once:  idempotent producer + transactional API (slow; use sparingly)

Performance tuning:
  linger.ms:           batch messages for N ms before sending (improves throughput)
  batch.size:          max bytes per batch (default 16KB; increase for high throughput)
  compression.type:    lz4 for throughput, zstd for best ratio
  acks=all:            require all ISR to acknowledge (durability)
  max.poll.records:    records per poll() call (tune based on processing time)
"""

from confluent_kafka import Producer, Consumer, KafkaError
import json
import time

def create_producer(bootstrap_servers: str) -> Producer:
    return Producer({
        "bootstrap.servers":   bootstrap_servers,
        "acks":               "all",          # Wait for all ISR
        "enable.idempotence": True,           # Exactly-once producer
        "compression.type":   "lz4",
        "linger.ms":          10,             # Batch for 10ms
        "batch.size":         65536,          # 64KB batch
    })

def create_consumer(bootstrap_servers: str, group_id: str) -> Consumer:
    return Consumer({
        "bootstrap.servers": bootstrap_servers,
        "group.id":         group_id,
        "auto.offset.reset": "earliest",      # Start from beginning if no offset
        "enable.auto.commit": False,          # Manual commit for at-least-once
        "max.poll.records":  500,
    })

def reliable_consumer_loop(consumer: Consumer, topics: list, process_fn):
    consumer.subscribe(topics)
    try:
        while True:
            msgs = consumer.consume(num_messages=500, timeout=1.0)
            if not msgs:
                continue
            for msg in msgs:
                if msg.error():
                    if msg.error().code() == KafkaError._PARTITION_EOF:
                        continue
                    raise Exception(f"Consumer error: {msg.error()}")
                try:
                    record = json.loads(msg.value())
                    process_fn(record)
                except Exception as e:
                    # Dead letter queue: publish failed message for investigation
                    pass  # publish to DLQ topic
            # Commit only after successful processing of batch
            consumer.commit(asynchronous=False)
    finally:
        consumer.close()

Data Pipeline Design Patterns

"""
Lambda Architecture:
  Batch layer:   processes all historical data; produces accurate views
  Speed layer:   processes recent data in real-time; low-latency approximate views
  Serving layer: merges batch and speed views for queries
  Problem: complex; two codebases to maintain

Kappa Architecture:
  Single streaming pipeline handles everything (Kafka + Flink/Spark Streaming)
  Replay historical data by re-reading from Kafka (retention must be long enough)
  Simpler, but requires streaming system to handle batch-scale

Delta Architecture (Delta Lake / Iceberg / Hudi):
  ACID transactions on data lake (Parquet files in S3)
  Unified batch + streaming: same API for both
  Time travel: query historical versions of data
  Schema evolution: add/rename columns without rewriting all data

Data Quality patterns:
  - Validate schema at ingestion boundary (Great Expectations, Soda)
  - Monitor row counts, null rates, value distributions
  - Use quarantine tables for invalid records (never drop silently)
  - Idempotent writes: same run twice = same output (overwrite by partition)
"""

# Idempotent Spark write — safe to re-run
def write_partition_idempotent(df, output_path: str, partition_col: str):
    """
    Write partitioned Parquet. If partition already exists, overwrite it.
    Prevents data duplication on pipeline retry.
    """
    df.write 
        .mode("overwrite") 
        .option("partitionOverwriteMode", "dynamic") 
        .partitionBy(partition_col) 
        .parquet(output_path)

# Slowly Changing Dimensions (SCD) Type 2
def apply_scd2(
    spark,
    current_df,       # Current state of dimension
    new_snapshot_df,  # New snapshot of source system
    key_col: str,
    track_cols: list,
    effective_date: str,
):
    """
    SCD Type 2: when dimension changes, close old row and open new one.
    Keeps full history of changes.
    """
    # Detect changed records
    changed = current_df.alias("curr").join(
        new_snapshot_df.alias("new"),
        on=key_col,
        how="inner"
    ).filter(
        " OR ".join([f"curr.{c} != new.{c}" for c in track_cols])
    )

    # Close old records
    closed = changed.select("curr.*").withColumn(
        "end_date", F.lit(effective_date)
    ).withColumn("is_current", F.lit(False))

    # Open new records
    new_records = changed.select("new.*").withColumn(
        "start_date", F.lit(effective_date)
    ).withColumn("end_date", F.lit(None)).withColumn("is_current", F.lit(True))

    return closed.union(new_records)

Data Modeling: Star vs Snowflake Schema

Concept Star Schema Snowflake Schema
Structure Fact table + denormalized dimension tables Fact table + normalized (hierarchical) dimensions
Query performance Faster (fewer JOINs) Slower (more JOINs)
Storage More (redundant data in dimensions) Less (normalized)
Maintenance Simpler Complex (update propagation)
Use case OLAP dashboards, ad-hoc queries When dimension size matters, strict normalization
Examples Redshift, BigQuery data marts Operational data stores

Interview Questions to Expect

  • “How would you handle late-arriving data in a streaming pipeline?” — Use watermarks + allowed lateness in Flink/Spark Streaming; store in a corrections topic
  • “What is a data lakehouse?” — Combines data lake storage (S3/ADLS) with ACID transactions + SQL (Delta Lake, Apache Iceberg, Hudi)
  • “Explain partitioning strategy for a 100TB events table in Parquet” — Partition by date (daily); bucket by user_id within partitions; Zorder on event_type for multi-dimensional filtering
  • “How do you detect and handle data quality issues in production?” — Schema validation, row count checks, null rate alerts, value distribution checks; dead-letter queue for failures
  • “Difference between Spark coalesce and repartition?” — coalesce: reduce partitions without shuffle; repartition: redistribute data with full shuffle; use coalesce to reduce, repartition to increase or rebalance

Frequently Asked Questions

What is the difference between a data warehouse and a data lake?

A data warehouse stores structured, processed data in a schema-on-write model — data is transformed before loading (ETL). Optimized for BI queries with high performance on aggregations (Snowflake, Redshift, BigQuery). A data lake stores raw data in any format (structured, semi-structured, unstructured) in a schema-on-read model — transformation happens at query time (ETL vs ELT). Use a data lake for raw data retention and ML feature engineering; use a warehouse for business analytics with predictable query patterns.

What is Apache Spark and how does it differ from Hadoop MapReduce?

Apache Spark is an in-memory distributed computing engine. It builds a DAG (Directed Acyclic Graph) of transformations and optimizes the execution plan before running. Data is cached in memory between stages, making iterative algorithms (ML training) 10-100x faster than MapReduce, which writes intermediate results to HDFS between every map and reduce step. Spark supports batch, streaming (Structured Streaming), SQL (Spark SQL), ML (MLlib), and graph processing — MapReduce is batch only.

What is the difference between partitioning and bucketing in Hive/Spark?

Partitioning splits data into separate directories by a column value (e.g., date=2025-01-01/), allowing queries to skip entire partitions via partition pruning. Best for high-cardinality time columns or categorical filters. Bucketing distributes data into a fixed number of files using a hash of a column, ensuring rows with the same bucket key land in the same file. Bucketing enables efficient joins (bucket-merge join avoids shuffle) and sampling. Partition pruning reduces I/O; bucketing reduces shuffle in joins.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between a data warehouse and a data lake?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A data warehouse stores structured, processed data in a schema-on-write model — data is transformed before loading (ETL). Optimized for BI queries with high performance on aggregations (Snowflake, Redshift, BigQuery). A data lake stores raw data in any format (structured, semi-structured, unstructured) in a schema-on-read model — transformation happens at query time (ETL vs ELT). Use a data lake for raw data retention and ML feature engineering; use a warehouse for business analytics with predictable query patterns.”
}
},
{
“@type”: “Question”,
“name”: “What is Apache Spark and how does it differ from Hadoop MapReduce?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Apache Spark is an in-memory distributed computing engine. It builds a DAG (Directed Acyclic Graph) of transformations and optimizes the execution plan before running. Data is cached in memory between stages, making iterative algorithms (ML training) 10-100x faster than MapReduce, which writes intermediate results to HDFS between every map and reduce step. Spark supports batch, streaming (Structured Streaming), SQL (Spark SQL), ML (MLlib), and graph processing — MapReduce is batch only.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between partitioning and bucketing in Hive/Spark?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Partitioning splits data into separate directories by a column value (e.g., date=2025-01-01/), allowing queries to skip entire partitions via partition pruning. Best for high-cardinality time columns or categorical filters. Bucketing distributes data into a fixed number of files using a hash of a column, ensuring rows with the same bucket key land in the same file. Bucketing enables efficient joins (bucket-merge join avoids shuffle) and sampling. Partition pruning reduces I/O; bucketing reduces shuffle in joins.”
}
}
]
}

Scroll to Top