Question 1

What validation steps should a data import service perform before persisting any records?

Accepted Answer

Validation occurs in two phases. Structural validation checks file format, encoding (UTF-8), delimiter consistency, header presence, and row count limits without parsing data values. Semantic validation checks each row: required fields present, data types match schema, string lengths within bounds, enum values in allowed set, foreign key references exist, and uniqueness constraints not violated within the batch. Reject the entire file on structural failure; for semantic errors, either reject the whole import or collect per-row errors and return a detailed error report, depending on the product contract.

Question 2

How do you stream-parse a multi-gigabyte CSV import file without exhausting server memory?

Accepted Answer

Read the file from object storage using a streaming HTTP GET (range requests or chunked download) and pipe bytes through a streaming CSV parser (e.g., Papa Parse in streaming mode, Python's csv.reader over a buffered reader, or Apache Commons CSV). Process rows one at a time or in small batches (e.g., 500 rows), accumulating them into a write buffer. Flush the buffer to the database via a bulk INSERT when full. This keeps memory usage to O(batch_size) regardless of file size. Track the last successfully committed row offset so a retry can resume after a crash.

Question 3

How do you implement idempotent upserts in a data import pipeline to handle duplicate submissions safely?

Accepted Answer

Assign each import job a client-provided or server-generated idempotency key stored in an imports table. On re-submission with the same key, return the original job result without reprocessing. For row-level idempotency, use a natural business key (e.g., external_id) and execute an INSERT ... ON CONFLICT (external_id) DO UPDATE SET ... with a condition that only updates when the new row differs (comparing a hash of column values). This ensures re-importing the same file produces identical state and retrying a partial failure is safe.

Question 4

How would you design the import job state machine to support pause, resume, and partial rollback?

Accepted Answer

Model the job with states: PENDING, VALIDATING, PROCESSING, PAUSED, COMPLETED, FAILED. Store the last committed batch offset in the jobs table. On PAUSED or FAILED, workers stop after finishing the current batch and persist the offset. On resume, a new worker picks up from the stored offset. For rollback, import into a staging table first; only copy to the live table after the entire file is validated. This makes rollback a simple DELETE from staging rather than issuing compensating deletes against production data. Use a database transaction per batch with explicit savepoints when supported.

Data Import Service Low-Level Design: File Validation, Streaming Parse, and Idempotent Upsert

Data Import Service Low-Level Design

Import Job Schema

Upload Flow

Validation Phase

Row Processing

Deduplication

Row-Level Error Reporting

Error Threshold and Abort

Rollback Strategy

Progress Tracking and Notification