Data Import Service Low-Level Design: File Validation, Streaming Parse, and Idempotent Upsert

Data Import Service Low-Level Design

A data import service ingests user-supplied files into the system's database. The main challenges are: handling large files without memory exhaustion, validating data quality before committing, deduplicating records on retry, and giving users actionable error feedback without hiding problems.

Import Job Schema

  • job_id — UUID
  • user_id — owner
  • file_s3_key — location of the uploaded file
  • entity_type — target table or entity (e.g., products, contacts)
  • mapping_config — JSON: maps file columns to database fields
  • statusUPLOADEDVALIDATINGPROCESSINGCOMPLETED | FAILED
  • total_rows, processed_rows, error_rows, created_at

Upload Flow

Never upload files through the application server — it adds latency and consumes server memory. Instead:

  1. Client requests a presigned S3 PUT URL from the API
  2. Client uploads the file directly to S3 (bypasses the application server entirely)
  3. Client notifies the API of upload completion, providing the S3 key
  4. API creates the import job record and enqueues processing

Validation Phase

Before processing any data rows, stream-parse the file to validate structure:

  • Check that required column headers are present and correctly named
  • Verify the file is parseable (encoding, delimiter, valid JSON if applicable)
  • Validate the mapping_config references columns that exist in the file

Structural errors are reported immediately with status FAILED before any rows are written. This saves time when the user uploaded the wrong file.

Row Processing

Stream rows from S3 in chunks of 1,000. For each chunk:

  1. Parse — apply column mapping, type coercion (string to int, date parsing)
  2. Validate — check required fields, value ranges, format constraints (email regex, positive integer)
  3. Dedup lookup — query by external_id
  4. Upsert — INSERT or UPDATE based on lookup result
  5. Collect errors — accumulate row-level errors without stopping

Deduplication

Each importable entity has an external_id field — a stable identifier from the user's source system. On every row, look up an existing record by external_id:

  • Found: UPDATE the existing record with new field values
  • Not found: INSERT a new record

This makes imports idempotent. Re-running the same file after a worker crash produces no duplicates — all UPDATEs on previously inserted rows are safe to repeat.

Row-Level Error Reporting

A row validation failure does not stop the import. The worker records the error with:

  • Row number in the original file
  • Field name that failed
  • Error message and the invalid value

Processing continues for subsequent rows. At completion, the job summary includes total errors and a downloadable error report CSV so users can fix only the failed rows and re-import.

Error Threshold and Abort

If the error rate exceeds 20% of processed rows, abort the import and mark it FAILED. A 20% error rate indicates a systemic problem — wrong file, wrong mapping, or corrupted data — and proceeding would import mostly bad data. Report the error rate and the first 50 errors so the user can diagnose the root cause.

Rollback Strategy

Two approaches with different tradeoffs:

  • Transaction rollback: wrap the entire import in a DB transaction; on critical failure, roll back all rows. Guarantees consistency but holds a long transaction, causing lock contention and replication lag for large imports.
  • Row-by-row commit with error accumulation: commit each chunk independently; on failure, report errors and leave successful rows in place. Better for large imports; user may need to re-import only failed rows.

For imports under 100K rows, use transaction rollback. For larger imports, use row-by-row commit with the error threshold abort as a safeguard.

Progress Tracking and Notification

Update job.processed_rows every 1,000 rows. Expose progress via the polling status endpoint. On completion, send an email summary: rows imported successfully, rows failed, link to the error report if applicable.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Shopify Interview Guide

Scroll to Top