Data Lake Architecture Low-Level Design: Ingestion, Partitioning, Schema Registry, and Query Layer

Data Lake Architecture Low-Level Design

A data lake is a centralized repository that stores structured and unstructured data at any scale. Unlike a data warehouse, a data lake preserves raw data in its native format and applies schema on read. The low-level design of a data lake covers the zone architecture, ingestion mechanisms, storage format, schema management, and the query layer.

Zone Architecture

Data moves through three zones, each stored in a separate S3 prefix (or equivalent object storage path):

Raw (landing) zone: Data arrives exactly as the source produced it — JSON, CSV, Avro, Parquet, binary logs. No transformation is applied. This zone is the source of truth and enables full reprocessing if downstream logic changes.
Cleansed zone: Raw data is validated, typed, deduplicated, and encoded in Parquet with a consistent schema. Malformed records are quarantined to a dead-letter prefix. This zone is queryable but reflects source semantics.
Curated zone: Business-level aggregates, joins, and derived datasets. These tables are optimized for analyst and ML consumption. Schemas are stable and documented. This zone has the highest query SLA.

Ingestion: Batch and Streaming

Batch ingestion: Source systems write files to the raw zone via S3 PUT (direct upload, SFTP sync, or scheduled export). A manifest file or S3 event notification triggers the cleansing job. Typical cadence: hourly or daily.

Streaming ingestion: Events flow from producers through Amazon Kinesis Data Streams, which buffers and delivers to Kinesis Data Firehose. Firehose batches records into S3 objects (by size or time window, e.g., 128 MB or 60 seconds) and writes them to the raw zone in Parquet or JSON. This delivers sub-minute latency from event to queryable data.

Storage Format: Parquet with Snappy Compression

Parquet is a columnar format: all values for a column are stored together. This enables column pruning (only read the columns in the SELECT clause) and vectorized execution (process column batches with SIMD). Snappy compression reduces storage cost and I/O volume with fast compression/decompression suited to query-time use.

Parquet is the standard format for all zones except raw (which preserves source format). Row group size of 128 MB is the standard Parquet default, balancing metadata overhead and parallelism.

Hive-Style Partitioning

Data is organized under S3 keys in the pattern s3://bucket/table/year=YYYY/month=MM/day=DD/. Query engines (Athena, Spark) use partition pruning: a query with WHERE event_date = '2024-03-15' scans only the day=15 partition, skipping all other partitions entirely. This is the single most impactful optimization for large datasets — a query that would scan 365 daily partitions is reduced to a single partition scan.

Over-partitioning (e.g., partitioning by hour and user_id) creates millions of small files, which degrades performance. Partition columns should have low cardinality (date, region, event_type) and be used in frequent WHERE filters.

Schema Registry

Schema evolution (adding fields, changing types) must be managed to prevent breaking downstream consumers. The schema registry stores versioned Avro or Protobuf schemas per topic:

Backward compatibility: new schema can read data written by old schema. Achieved by adding fields with defaults, never removing fields.
Forward compatibility: old schema can read data written by new schema. Achieved by ignoring unknown fields.
Full compatibility: both backward and forward. The safest policy for shared datasets.

AWS Glue Schema Registry or Confluent Schema Registry enforces compatibility rules at write time, rejecting incompatible schema versions.

Data Catalog

The catalog stores table metadata: S3 path, partition keys, column names and types, schema version, and row count. AWS Glue Data Catalog (Hive Metastore-compatible) is the standard catalog for S3-based lakes. Glue crawlers scan S3 prefixes, infer schemas from Parquet footers, and register tables automatically. The catalog makes tables discoverable to Athena, Spark, and Redshift Spectrum without manual DDL.

Query Layer

Athena is a serverless SQL engine that queries S3 directly using Presto/Trino. It is priced per data scanned, making partition pruning and Parquet columnar projection financially critical. Athena is optimal for ad-hoc analyst queries with no infrastructure to manage.

Spark (EMR or Databricks) is used for large batch transforms, ML feature engineering, and jobs requiring stateful processing, custom UDFs, or iterative computation not expressible in SQL.

SQL Schema (Metadata)


CREATE TABLE DataLakeTable (
    id              BIGSERIAL PRIMARY KEY,
    database_name   TEXT NOT NULL,
    table_name      TEXT NOT NULL,
    s3_path         TEXT NOT NULL,
    partition_keys  TEXT[] NOT NULL DEFAULT '{}',
    schema_version  INTEGER NOT NULL DEFAULT 1,
    row_count       BIGINT,
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE (database_name, table_name)
);

CREATE TABLE SchemaVersion (
    id              BIGSERIAL PRIMARY KEY,
    topic           TEXT NOT NULL,
    version         INTEGER NOT NULL,
    schema_json     JSONB NOT NULL,
    compatibility   TEXT NOT NULL DEFAULT 'BACKWARD', -- BACKWARD | FORWARD | FULL | NONE
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE (topic, version)
);

CREATE TABLE IngestionJob (
    id                BIGSERIAL PRIMARY KEY,
    source            TEXT NOT NULL,
    destination_zone  TEXT NOT NULL,   -- raw | cleansed | curated
    rows_ingested     BIGINT NOT NULL DEFAULT 0,
    bytes_written     BIGINT NOT NULL DEFAULT 0,
    status            TEXT NOT NULL DEFAULT 'RUNNING',
    job_at            TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Python Implementation


import json, boto3, hashlib
from datetime import datetime

s3 = boto3.client("s3")
athena = boto3.client("athena")
glue = boto3.client("glue")

BUCKET = "my-data-lake"
RESULT_BUCKET = "my-athena-results"

def ingest_batch(source_config: dict, destination_zone: str) -> dict:
    """Read records from source and write Parquet to the appropriate zone."""
    source_name = source_config["name"]
    records = source_config.get("records", [])
    today = datetime.utcnow()
    s3_key = (
        f"{destination_zone}/{source_name}/"
        f"year={today.year}/month={today.month:02d}/day={today.day:02d}/"
        f"batch_{today.strftime('%H%M%S')}.json"
    )
    body = "n".join(json.dumps(r) for r in records)
    s3.put_object(Bucket=BUCKET, Key=s3_key, Body=body.encode())
    return {"s3_path": f"s3://{BUCKET}/{s3_key}", "rows": len(records)}

def register_schema(topic: str, schema_json: dict, compatibility: str = "BACKWARD") -> int:
    """Register a new schema version for a topic."""
    schema_str = json.dumps(schema_json, sort_keys=True)
    version_hash = hashlib.sha256(schema_str.encode()).hexdigest()[:8]
    print(f"Registering schema for {topic} (hash={version_hash}, compat={compatibility})")
    # In production: call Glue Schema Registry or Confluent Registry API
    return 1

def query_table(sql: str, result_bucket: str = RESULT_BUCKET) -> str:
    """Submit an Athena query and return the query execution ID."""
    response = athena.start_query_execution(
        QueryString=sql,
        ResultConfiguration={"OutputLocation": f"s3://{result_bucket}/athena-results/"},
        QueryExecutionContext={"Database": "default"}
    )
    execution_id = response["QueryExecutionId"]
    print(f"Athena query submitted: {execution_id}")
    return execution_id

Frequently Asked Questions

What is the purpose of the three-zone architecture in a data lake?

The raw zone preserves source data exactly as received, enabling full reprocessing when transformation logic changes. The cleansed zone provides validated, typed, consistent data for general consumption without re-reading raw files. The curated zone delivers pre-aggregated, join-resolved datasets optimized for business analysts and ML pipelines. Each zone represents a higher level of trust and a narrower schema, with the raw zone as the immutable source of truth.

How does schema evolution maintain compatibility between producers and consumers?

The schema registry enforces compatibility rules at write time. Backward compatibility (new schema reads old data) is achieved by adding fields with defaults and never removing required fields. Forward compatibility (old schema reads new data) requires that consumers ignore unknown fields. Full compatibility requires both. The registry rejects a new schema version that violates the configured compatibility rule before any data is written with that schema, preventing silent data corruption.

Why is Parquet preferred over ORC for S3-based data lakes?

Both are columnar formats with similar performance characteristics. Parquet has broader ecosystem support: Spark, Athena, Hive, Flink, DuckDB, Arrow, and Pandas all read Parquet natively. ORC has stronger Hive integration and slightly better compression for string-heavy data. For new data lakes with multi-tool access, Parquet is the safer default. The choice matters less than ensuring columnar format is used at all — both dramatically outperform row-oriented formats (JSON, CSV) for analytical workloads.

How does partition pruning reduce query cost in Athena?

Athena charges per byte scanned. Without partitioning, every query scans the entire table. With Hive-style date partitioning, a query filtering on a single day scans only that day's files — 1/365 of a year's data. Athena reads partition metadata from the Glue catalog without touching S3 data files, then issues S3 GetObject requests only for matching partitions. For a 10 TB annual dataset, a single-day query scans approximately 27 GB instead of 10 TB, reducing cost by 370x.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the purpose of the three-zone architecture in a data lake?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The raw zone preserves source data exactly as received, enabling full reprocessing when transformation logic changes. The cleansed zone provides validated, typed, consistent data for general consumption. The curated zone delivers pre-aggregated datasets optimized for analysts and ML pipelines. Each zone represents a higher level of trust, with the raw zone as the immutable source of truth.”
}
},
{
“@type”: “Question”,
“name”: “How does schema evolution maintain compatibility between producers and consumers?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The schema registry enforces compatibility rules at write time. Backward compatibility is achieved by adding fields with defaults and never removing required fields. Forward compatibility requires consumers to ignore unknown fields. The registry rejects a new schema version that violates the configured rule before any data is written with that schema, preventing silent data corruption.”
}
},
{
“@type”: “Question”,
“name”: “Why is Parquet preferred over ORC for S3-based data lakes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Parquet has broader ecosystem support: Spark, Athena, Hive, Flink, DuckDB, Arrow, and Pandas all read Parquet natively. ORC has stronger Hive integration and slightly better compression for string-heavy data. For new data lakes with multi-tool access, Parquet is the safer default. Both dramatically outperform row-oriented formats for analytical workloads.”
}
},
{
“@type”: “Question”,
“name”: “How does partition pruning reduce query cost in Athena?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Athena charges per byte scanned. Without partitioning, every query scans the entire table. With Hive-style date partitioning, a query filtering on a single day scans only that day's files. Athena reads partition metadata from the Glue catalog without touching S3 data files, then issues S3 requests only for matching partitions. For a 10 TB annual dataset, a single-day query scans approximately 27 GB instead of 10 TB, reducing cost by 370x.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the purpose of the raw, cleansed, and curated zones?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Raw preserves original source data unmodified for reprocessing; cleansed applies type casting, null handling, and deduplication; curated contains pre-aggregated or joined datasets optimized for analytics queries.”
}
},
{
“@type”: “Question”,
“name”: “How does schema evolution work in the data lake?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Avro schemas are registered in a schema registry with a compatibility mode (BACKWARD, FORWARD, FULL); new schema versions must pass compatibility checks before being accepted, ensuring consumers can read data written by older or newer producers.”
}
},
{
“@type”: “Question”,
“name”: “Why is Hive-style partitioning important for query performance?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Queries with partition key predicates (e.g., year=2025/month=04) instruct the query engine to scan only matching S3 prefixes, drastically reducing data read and cost compared to full table scans.”
}
},
{
“@type”: “Question”,
“name”: “How is ingestion idempotency guaranteed in batch pipelines?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each batch job writes to a unique S3 prefix keyed by (source + batch_id + timestamp); re-running the job writes to the same path, and the Glue catalog partition registration uses CREATE OR REPLACE semantics.”
}
}
]
}