What is a Bulk Operations System?
A bulk operations system allows users to create, update, or delete thousands of records in a single request — bulk-importing products to a catalog, mass-updating order statuses, batch-deleting spam accounts, or sending a campaign to 500K users. The core design challenge: HTTP requests time out in 30-60 seconds, but processing 100K records takes minutes. The solution is async processing with job tracking, partial failure handling, and idempotent retry.
Requirements
- Accept bulk create/update/delete via CSV upload or JSON array
- Process up to 1 million records per job
- Async: return job ID immediately; report progress and results when done
- Partial success: process all records; report which succeeded and which failed
- Idempotent: re-running a job with the same data should not create duplicates
- Result report: download CSV with per-row status (success/error/skipped)
Data Model
BulkJob(
job_id UUID PRIMARY KEY,
user_id UUID NOT NULL,
operation VARCHAR, -- 'product.create', 'order.update_status'
status ENUM(QUEUED, PROCESSING, COMPLETED, FAILED),
input_file_key VARCHAR, -- S3 key of uploaded CSV/JSON
result_file_key VARCHAR, -- S3 key of result CSV (set on completion)
total_records INT,
processed INT DEFAULT 0,
succeeded INT DEFAULT 0,
failed INT DEFAULT 0,
skipped INT DEFAULT 0,
created_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ
)
Processing Pipeline
def process_bulk_job(job_id):
job = db.get(BulkJob, job_id)
db.update(job_id, status='PROCESSING')
# Stream input file from S3
input_stream = s3.get_object(BUCKET, job.input_file_key)
reader = csv.DictReader(input_stream)
result_rows = []
succeeded = failed = skipped = 0
for i, row in enumerate(reader, 1):
try:
result = process_row(job.operation, row)
result_rows.append({**row, 'status': 'success',
'id': result.id, 'error': ''})
succeeded += 1
except SkipException as e:
result_rows.append({**row, 'status': 'skipped', 'error': str(e)})
skipped += 1
except ValidationError as e:
result_rows.append({**row, 'status': 'error', 'error': str(e)})
failed += 1
# Update progress every 1000 rows
if i % 1000 == 0:
db.update(job_id, processed=i, succeeded=succeeded,
failed=failed, skipped=skipped)
# Write result CSV to S3
result_key = f'bulk-results/{job_id}/results.csv'
write_csv_to_s3(result_key, result_rows)
db.update(job_id, status='COMPLETED', result_file_key=result_key,
processed=i, succeeded=succeeded, failed=failed, skipped=skipped,
completed_at=now())
notify_user(job.user_id, job_id=job_id, stats={
'total': i, 'succeeded': succeeded, 'failed': failed})
Idempotency via External ID
# For bulk create: include an external_id (client-generated) in each row
# If the record already exists with this external_id, skip instead of duplicate
def process_create_row(row):
external_id = row.get('external_id')
if external_id:
existing = db.get_by_external_id(external_id)
if existing:
raise SkipException(f'Already exists: {existing.id}')
return db.create(Product(
name=row['name'],
price=row['price'],
external_id=external_id,
...
))
Transaction Strategy: Per-Row vs Batch
Two approaches:
Per-row transactions: each row is its own transaction. A failure in row 500 doesn’t roll back rows 1-499. Results: partial success. Appropriate for imports where each record is independent.
Batch transactions: process N rows per transaction (e.g., 1000). If any row fails the batch, retry with per-row fallback. Faster for all-succeed cases; falls back gracefully.
# Batch transaction with per-row fallback
def process_batch(rows):
try:
with db.transaction():
for row in rows:
process_row(row)
return [{'status': 'success'} for _ in rows]
except Exception:
# Fallback: process one by one
return [safe_process_row(row) for row in rows]
Key Design Decisions
- S3 for input/output files — decouple upload from processing; result CSV available for re-download
- Async job + queue — 1M records takes minutes; HTTP timeout is 30-60s
- Per-row error tracking — partial success is more useful than all-or-nothing failure
- external_id for idempotency — safe to retry the entire job; already-processed rows are skipped
- Progress updates every 1000 rows — useful UX signal without excessive DB writes
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you handle partial failures in a bulk operation?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Process every record independently and track success/failure per row. Never abort the entire job because one row fails — the other 999,999 records should still be processed. Use a per-row transaction: each record is its own DB transaction; a validation error on row 500 doesn’t roll back rows 1-499. Store the error message per row in the result. Return a result CSV with columns: original data + status (success/error/skipped) + error message. Users can fix just the failed rows and re-import.”}},{“@type”:”Question”,”name”:”How do you make bulk imports idempotent?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Require an external_id field in each row (a client-generated unique identifier for the record). Before creating, check if a record with that external_id already exists for this user. If yes: skip (return status=skipped) rather than creating a duplicate. This allows safe retries: if the import job fails halfway through, re-submitting the same file produces the same result — already-created records are skipped. Store the external_id on the created record for future deduplication.”}},{“@type”:”Question”,”name”:”How do you handle a bulk import of 1 million records without timing out?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Never process synchronously in the HTTP request. Accept the file (upload to S3 directly via presigned URL), create a BulkJob record with status=QUEUED, and return the job_id immediately. A background worker picks up the job from a queue (SQS, Redis queue) and processes it asynchronously over minutes. The client polls GET /bulk-jobs/{job_id} for status and progress. When complete, the worker generates a result file on S3 and notifies the user via email/webhook.”}},{“@type”:”Question”,”name”:”What is a good batch size for bulk database inserts?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”1,000-10,000 rows per batch is typical. Smaller batches (100 rows) generate many round trips to the DB. Larger batches (100K rows) can cause lock contention, exceed DB query size limits, and make partial failure recovery harder. For PostgreSQL, 1,000-row INSERT batches with explicit transactions achieve near-peak INSERT throughput. For updates: use UPDATE…WHERE id IN (:ids) with batches of 1,000 IDs. Monitor DB CPU and lock wait metrics to tune the batch size for your specific workload.”}},{“@type”:”Question”,”name”:”How do you notify users when a bulk job completes?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Send an email with the job summary (X succeeded, Y failed) and a link to download the result file. Also support webhooks: if the user has registered a webhook URL, POST the job completion event there. For in-product UX: show a notification in the UI (real-time via WebSocket or polling the job status endpoint). Generate a presigned S3 URL valid for 24 hours for the result CSV download — users can re-download without expiry issues during that window.”}}]}
Bulk operations and product import system design is discussed in Shopify system design interview questions.
Bulk data processing and import system design is covered in Amazon system design interview preparation.
Bulk operations and large-scale data processing design is in Databricks system design interview guide.