What are the main design considerations for building a scalable ETL service?

A scalable ETL service must handle variable data volumes through horizontal scaling, support fault tolerance with restartable jobs, enforce schema validation and data quality checks during transformation, and provide observability through metrics and logging. Decoupling extract, transform, and load stages allows each to scale independently and fail without cascading failures.

How do you handle schema evolution in an ETL pipeline?

Schema evolution is managed by using a schema registry (e.g., Confluent Schema Registry) with compatibility policies such as backward or forward compatibility. ETL transforms should be written defensively to handle missing or added fields. Versioned schemas and migration scripts allow gradual rollouts without breaking existing consumers or requiring full pipeline restarts.

How would you design an ETL service to be idempotent?

Idempotency in an ETL service is achieved by assigning deterministic job IDs, using upsert (insert-or-update) semantics when writing to the destination, and tracking processed record checksums or offsets in a state store. Re-running the same ETL job for the same input data should produce identical output without duplication or data corruption.

What monitoring and alerting should be in place for an ETL service in production?

Production ETL services should monitor job success/failure rates, row counts at each stage (to detect data loss or duplication), processing latency, and resource utilization (CPU, memory, I/O). Alerts should fire on job failures, SLA breaches (jobs running longer than expected), data volume anomalies, and schema validation errors. A data quality dashboard helps detect silent data corruption early.

Low Level Design: ETL Service

⏱ 4 min read

What Is an ETL Service?

Extract, Transform, Load (ETL) is the classic pattern for moving data from operational systems into analytical stores. Unlike a streaming pipeline, ETL typically runs in scheduled batches: extract a snapshot or delta from the source, apply business-logic transformations, and load the result into a warehouse or data mart. The challenge is doing this reliably, repeatably, and without overloading the source system.

Data Model

-- ETL job definitions
CREATE TABLE etl_jobs (
  id             BIGINT PRIMARY KEY AUTO_INCREMENT,
  name           VARCHAR(255) NOT NULL UNIQUE,
  schedule_cron  VARCHAR(100) NOT NULL,    -- e.g. 0 2 * * *
  source_dsn     TEXT NOT NULL,            -- connection string (encrypted)
  source_query   TEXT NOT NULL,            -- parameterized SQL or API spec
  transform_ref  VARCHAR(255) NOT NULL,    -- class or script name
  target_table   VARCHAR(255) NOT NULL,
  load_strategy  ENUM('full', 'incremental', 'upsert') DEFAULT 'incremental',
  watermark_col  VARCHAR(100),             -- for incremental: updated_at
  last_watermark TIMESTAMP,
  enabled        BOOLEAN DEFAULT TRUE
);

-- Run history
CREATE TABLE etl_runs (
  id          BIGINT PRIMARY KEY AUTO_INCREMENT,
  job_id      BIGINT NOT NULL REFERENCES etl_jobs(id),
  run_start   TIMESTAMP NOT NULL,
  run_end     TIMESTAMP,
  rows_read   BIGINT DEFAULT 0,
  rows_loaded BIGINT DEFAULT 0,
  status      ENUM('running', 'success', 'failed', 'skipped') DEFAULT 'running',
  error_msg   TEXT,
  watermark_snapshot TIMESTAMP   -- value of watermark_col at run start
);

Core Workflow

Extract: The scheduler triggers a job at the cron time. The extractor reads the last watermark from etl_jobs.last_watermark, issues a parameterized query (WHERE updated_at > :last_watermark), and streams results into local temp storage (disk-spilled if >100 MB). A snapshot of the current watermark is saved to etl_runs.watermark_snapshot before reading begins to avoid losing records written during the run.
Transform: Records pass through a registered transform class. Typical operations: join with dimension tables (cached in memory for small dims, queried in batches for large ones), compute derived columns, normalize units, apply business rules. Output schema is validated against the target table DDL.
Load: Rows are bulk-loaded into the warehouse. For full loads, swap a staging table into place atomically (RENAME TABLE or BigQuery table swap). For incremental/upsert loads, use the warehouse's native merge statement. On success, update etl_jobs.last_watermark to the snapshot value.

Failure Handling

Extract failure: Log the error, leave last_watermark unchanged, mark run as failed. The next scheduled run re-extracts from the same watermark — no data loss.
Transform failure: Bad rows are written to a quarantine table with the original payload and error message. The job does not fail; a separate alert fires when quarantine row count exceeds a threshold.
Load failure: Staging data is preserved for manual inspection. last_watermark is not advanced. Ops can replay the load step without re-extracting.
Idempotency: Upsert loads are naturally idempotent. Full loads use staging-table swap, which is atomic. Never advance the watermark until the load succeeds.

Scalability Considerations

Parallelism: Large extracts split by a partition key (e.g., date range or shard ID) and run concurrently. The coordinator manages a semaphore to cap source DB connections.
Source protection: Use read replicas for extraction. Rate-limit queries with a configurable max rows/second to avoid replication lag spikes.
Warehouse optimization: Load into partitioned/clustered tables. Use columnar formats (Parquet, ORC) for staging files on object storage when the warehouse supports external table loads.
Metadata store: A lightweight relational DB (PostgreSQL) tracks job state. At scale, add a Redis cache for scheduler hot-path reads to avoid thundering-herd on job-list queries.

Summary

A robust ETL service centers on three guarantees: safe watermark management (never advance until load succeeds), idempotent loads (upsert or atomic table swap), and source protection (replicas + rate limits). The transform step should be a pure function with a quarantine path for bad rows rather than a hard stop. Keep run history in a metadata store and expose metrics for row counts and lag to catch silent failures before they affect downstream consumers.