Low Level Design: ETL Service

What Is an ETL Service?

Extract, Transform, Load (ETL) is the classic pattern for moving data from operational systems into analytical stores. Unlike a streaming pipeline, ETL typically runs in scheduled batches: extract a snapshot or delta from the source, apply business-logic transformations, and load the result into a warehouse or data mart. The challenge is doing this reliably, repeatably, and without overloading the source system.

Data Model

-- ETL job definitions
CREATE TABLE etl_jobs (
  id             BIGINT PRIMARY KEY AUTO_INCREMENT,
  name           VARCHAR(255) NOT NULL UNIQUE,
  schedule_cron  VARCHAR(100) NOT NULL,    -- e.g. 0 2 * * *
  source_dsn     TEXT NOT NULL,            -- connection string (encrypted)
  source_query   TEXT NOT NULL,            -- parameterized SQL or API spec
  transform_ref  VARCHAR(255) NOT NULL,    -- class or script name
  target_table   VARCHAR(255) NOT NULL,
  load_strategy  ENUM('full', 'incremental', 'upsert') DEFAULT 'incremental',
  watermark_col  VARCHAR(100),             -- for incremental: updated_at
  last_watermark TIMESTAMP,
  enabled        BOOLEAN DEFAULT TRUE
);

-- Run history
CREATE TABLE etl_runs (
  id          BIGINT PRIMARY KEY AUTO_INCREMENT,
  job_id      BIGINT NOT NULL REFERENCES etl_jobs(id),
  run_start   TIMESTAMP NOT NULL,
  run_end     TIMESTAMP,
  rows_read   BIGINT DEFAULT 0,
  rows_loaded BIGINT DEFAULT 0,
  status      ENUM('running', 'success', 'failed', 'skipped') DEFAULT 'running',
  error_msg   TEXT,
  watermark_snapshot TIMESTAMP   -- value of watermark_col at run start
);

Core Workflow

  1. Extract: The scheduler triggers a job at the cron time. The extractor reads the last watermark from etl_jobs.last_watermark, issues a parameterized query (WHERE updated_at > :last_watermark), and streams results into local temp storage (disk-spilled if >100 MB). A snapshot of the current watermark is saved to etl_runs.watermark_snapshot before reading begins to avoid losing records written during the run.
  2. Transform: Records pass through a registered transform class. Typical operations: join with dimension tables (cached in memory for small dims, queried in batches for large ones), compute derived columns, normalize units, apply business rules. Output schema is validated against the target table DDL.
  3. Load: Rows are bulk-loaded into the warehouse. For full loads, swap a staging table into place atomically (RENAME TABLE or BigQuery table swap). For incremental/upsert loads, use the warehouse's native merge statement. On success, update etl_jobs.last_watermark to the snapshot value.

Failure Handling

  • Extract failure: Log the error, leave last_watermark unchanged, mark run as failed. The next scheduled run re-extracts from the same watermark — no data loss.
  • Transform failure: Bad rows are written to a quarantine table with the original payload and error message. The job does not fail; a separate alert fires when quarantine row count exceeds a threshold.
  • Load failure: Staging data is preserved for manual inspection. last_watermark is not advanced. Ops can replay the load step without re-extracting.
  • Idempotency: Upsert loads are naturally idempotent. Full loads use staging-table swap, which is atomic. Never advance the watermark until the load succeeds.

Scalability Considerations

  • Parallelism: Large extracts split by a partition key (e.g., date range or shard ID) and run concurrently. The coordinator manages a semaphore to cap source DB connections.
  • Source protection: Use read replicas for extraction. Rate-limit queries with a configurable max rows/second to avoid replication lag spikes.
  • Warehouse optimization: Load into partitioned/clustered tables. Use columnar formats (Parquet, ORC) for staging files on object storage when the warehouse supports external table loads.
  • Metadata store: A lightweight relational DB (PostgreSQL) tracks job state. At scale, add a Redis cache for scheduler hot-path reads to avoid thundering-herd on job-list queries.

Summary

A robust ETL service centers on three guarantees: safe watermark management (never advance until load succeeds), idempotent loads (upsert or atomic table swap), and source protection (replicas + rate limits). The transform step should be a pure function with a quarantine path for bad rows rather than a hard stop. Keep run history in a metadata store and expose metrics for row counts and lag to catch silent failures before they affect downstream consumers.

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale

Scroll to Top