Data Lake vs. Data Warehouse
A data warehouse stores structured, schema-on-write data optimized for SQL analytics (Redshift, BigQuery, Snowflake). A data lake stores raw data in any format (CSV, JSON, Parquet, images, logs) in cheap object storage (S3, GCS), with schema applied at query time (schema-on-read). The lakehouse architecture combines both: raw data in object storage with a transactional table format (Delta Lake, Apache Iceberg, Apache Hudi) that adds ACID transactions, schema enforcement, and efficient query performance on top of the data lake. Modern stacks often use a lakehouse for both analytical SQL queries and ML feature pipelines.
Ingestion Layer
Three ingestion patterns: Batch ingestion: nightly or hourly ETL jobs pull data from operational databases (via CDC — Change Data Capture — using Debezium, or full extracts), transform, and write to the lake. Tools: Apache Spark, AWS Glue, dbt. Partition by date: s3://datalake/events/year=2026/month=04/day=17/. Streaming ingestion: events flow through Kafka → Flink/Spark Streaming → land in the lake in micro-batches (every 1-5 minutes). Enables near-real-time analytics. Use S3 compaction jobs to merge many small files (the “small files problem”) into larger Parquet files for efficient reads. CDC (Change Data Capture): Debezium reads the database’s binary log (MySQL binlog, PostgreSQL WAL) and publishes every INSERT/UPDATE/DELETE as an event to Kafka. The lake consumer applies these changes to a lakehouse table, maintaining an up-to-date replica of the operational database for analytics without impacting production.
Storage Format: Parquet and Delta Lake
Parquet is the standard columnar storage format for analytics: column-oriented (reads only the columns needed), compressed per column (SNAPPY or ZSTD), with column statistics (min, max, null count) in metadata for predicate pushdown. A query like WHERE event_date = ‘2026-04-17’ skips entire row groups where max_date < 2026-04-17. Delta Lake adds on top: Transaction log: every write appends a JSON entry to the _delta_log/ directory describing the operation (add/remove files). Readers read the log to construct the current table state. ACID transactions: concurrent writes use optimistic concurrency — two writes to the same table succeed if they modify different partitions; conflict otherwise. Time travel: query any previous version of the table: SELECT * FROM events VERSION AS OF 100 or TIMESTAMP AS OF ‘2026-04-01’. Schema evolution: add columns without rewriting existing files — old files return NULL for the new column.
Query Engine and Metastore
The metastore (Hive Metastore, AWS Glue Catalog) stores table schemas, partition locations, and statistics. Query engines (Trino, Spark SQL, Athena, BigQuery Omni) read the metastore to understand the table structure, then scan only relevant Parquet files using partition pruning and predicate pushdown. Query optimization: Partition pruning: partition the table by date; queries filtering on date skip all other partitions. Data skipping: Delta Lake stores min/max statistics per file; queries skip files that cannot contain matching rows. Z-ordering: co-locate related data (e.g., Z-order by user_id) so that queries filtering on user_id read fewer files. Caching: Alluxio or local SSD caches hot data near compute nodes, reducing S3 reads for repeated queries.
Data Governance and Lineage
Governance requirements at scale: Access control: column-level and row-level security (restrict PII columns to authorized users; restrict rows by region for GDPR). Use Apache Ranger or lake-formation policies. Data lineage: track which upstream tables and jobs produced each table version (Apache Atlas, OpenLineage/Marquez). Required for debugging incorrect reports and GDPR “right to erasure” (identify all downstream tables that contain a user’s data). Data quality: run Great Expectations or Deequ checks on each new batch: null rate, value range, referential integrity. Fail the pipeline and alert on quality violations rather than silently ingesting bad data. Catalog: a data catalog (DataHub, Amundsen) makes tables discoverable with business descriptions, ownership, and usage statistics — critical for large organizations where data consumers cannot know all available tables.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the difference between a data lake, a data warehouse, and a lakehouse?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Data warehouse: stores structured, schema-on-write data in proprietary columnar format (Redshift, Snowflake, BigQuery). Optimized for SQL analytics, fast queries, but expensive storage and limited to structured data. Data lake: stores raw data in any format (CSV, JSON, Parquet, images) in cheap object storage (S3). Schema-on-read — flexible and cheap, but no ACID transactions, easy to create "data swamps" with unmanaged quality. Lakehouse: combines both — raw data in open format (Parquet) on object storage with a transactional metadata layer (Delta Lake, Iceberg, Hudi) that adds ACID transactions, schema enforcement, time travel, and query optimization. Enables SQL analytics and ML pipelines on the same storage with warehouse-like reliability and lake-like cost.”}},{“@type”:”Question”,”name”:”What is Change Data Capture (CDC) and how does Debezium implement it?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”CDC captures row-level changes (INSERT, UPDATE, DELETE) from an operational database and streams them to downstream consumers in real time. Debezium reads the database's internal replication log: MySQL binary log (binlog), PostgreSQL write-ahead log (WAL), MongoDB oplog. Each change event is published to a Kafka topic as a JSON message with before/after row states, operation type, and timestamp. Downstream consumers (lake ingestion jobs, search index updaters, cache invalidators) react to changes in near-real-time. Advantages over batch polling: lower latency (seconds vs. hours), no missed deletes (polling cannot detect deletions without tombstone columns), and lower database load (reads the replication log, not the tables).”}},{“@type”:”Question”,”name”:”How does Delta Lake achieve ACID transactions on object storage?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Object storage (S3) is not inherently transactional — concurrent writes can corrupt data. Delta Lake uses an optimistic concurrency protocol on top of S3: every write appends a new JSON commit entry to the _delta_log/ directory with a monotonically increasing version number. The commit entry lists which Parquet files were added and removed. To write: (1) read the current table version, (2) write new Parquet files to S3, (3) atomically create the next commit log entry (using S3 conditional PUT — only succeeds if the file does not already exist). If another writer committed first, the PUT fails and the writer retries. Readers construct the current table state by replaying the commit log from the beginning (or from a checkpoint). Isolation: snapshot isolation — readers see a consistent snapshot at a specific version.”}},{“@type”:”Question”,”name”:”What is the small files problem in data lakes and how do you solve it?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Streaming ingestion and frequent micro-batch writes create many small Parquet files (kilobytes each). This causes: slow query performance (opening thousands of files has high overhead per file — metadata fetches, S3 requests), and high storage costs (S3 charges per-request). Solution: periodic compaction jobs merge small files into larger ones (target: 128MB-1GB per file, the HDFS block size sweet spot for Spark). Delta Lake OPTIMIZE command runs compaction. Schedule compaction every few hours during low-traffic periods. Z-ordering during compaction: co-locate related rows (by user_id or event_date) so that queries filtering on those columns read fewer files. After compaction, VACUUM removes obsolete small files (respecting the data retention period for time travel queries).”}},{“@type”:”Question”,”name”:”How does predicate pushdown work in columnar storage like Parquet?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Parquet stores statistics (min value, max value, null count) in the metadata of each row group (a horizontal partition of ~128MB). When a query engine reads Parquet with a filter (WHERE event_date = '2026-04-17'), it reads the metadata first: if a row group's max_event_date < '2026-04-17', that entire row group is skipped without reading any data. This is row group pruning. Additionally, Parquet is columnar: if the query only needs columns A and C, only the byte ranges for those columns are read — columns B, D, E are never fetched from storage. Combined: a query on a 1TB table may read only 1-5GB of data with good partitioning and column selection. This is why Parquet + columnar query engines are 10-100x more efficient than row-oriented CSV for analytical queries.”}}]}
See also: Databricks Interview Prep
See also: Meta Interview Prep
See also: Netflix Interview Prep