Data Import Service Low-Level Design
A data import service ingests user-supplied files into the system's database. The main challenges are: handling large files without memory exhaustion, validating data quality before committing, deduplicating records on retry, and giving users actionable error feedback without hiding problems.
Import Job Schema
job_id— UUIDuser_id— ownerfile_s3_key— location of the uploaded fileentity_type— target table or entity (e.g.,products,contacts)mapping_config— JSON: maps file columns to database fieldsstatus—UPLOADED→VALIDATING→PROCESSING→COMPLETED|FAILEDtotal_rows,processed_rows,error_rows,created_at
Upload Flow
Never upload files through the application server — it adds latency and consumes server memory. Instead:
- Client requests a presigned S3 PUT URL from the API
- Client uploads the file directly to S3 (bypasses the application server entirely)
- Client notifies the API of upload completion, providing the S3 key
- API creates the import job record and enqueues processing
Validation Phase
Before processing any data rows, stream-parse the file to validate structure:
- Check that required column headers are present and correctly named
- Verify the file is parseable (encoding, delimiter, valid JSON if applicable)
- Validate the mapping_config references columns that exist in the file
Structural errors are reported immediately with status FAILED before any rows are written. This saves time when the user uploaded the wrong file.
Row Processing
Stream rows from S3 in chunks of 1,000. For each chunk:
- Parse — apply column mapping, type coercion (string to int, date parsing)
- Validate — check required fields, value ranges, format constraints (email regex, positive integer)
- Dedup lookup — query by
external_id - Upsert — INSERT or UPDATE based on lookup result
- Collect errors — accumulate row-level errors without stopping
Deduplication
Each importable entity has an external_id field — a stable identifier from the user's source system. On every row, look up an existing record by external_id:
- Found: UPDATE the existing record with new field values
- Not found: INSERT a new record
This makes imports idempotent. Re-running the same file after a worker crash produces no duplicates — all UPDATEs on previously inserted rows are safe to repeat.
Row-Level Error Reporting
A row validation failure does not stop the import. The worker records the error with:
- Row number in the original file
- Field name that failed
- Error message and the invalid value
Processing continues for subsequent rows. At completion, the job summary includes total errors and a downloadable error report CSV so users can fix only the failed rows and re-import.
Error Threshold and Abort
If the error rate exceeds 20% of processed rows, abort the import and mark it FAILED. A 20% error rate indicates a systemic problem — wrong file, wrong mapping, or corrupted data — and proceeding would import mostly bad data. Report the error rate and the first 50 errors so the user can diagnose the root cause.
Rollback Strategy
Two approaches with different tradeoffs:
- Transaction rollback: wrap the entire import in a DB transaction; on critical failure, roll back all rows. Guarantees consistency but holds a long transaction, causing lock contention and replication lag for large imports.
- Row-by-row commit with error accumulation: commit each chunk independently; on failure, report errors and leave successful rows in place. Better for large imports; user may need to re-import only failed rows.
For imports under 100K rows, use transaction rollback. For larger imports, use row-by-row commit with the error threshold abort as a safeguard.
Progress Tracking and Notification
Update job.processed_rows every 1,000 rows. Expose progress via the polling status endpoint. On completion, send an email summary: rows imported successfully, rows failed, link to the error report if applicable.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Shopify Interview Guide