Question 1

How do you export 10 million rows without running out of memory?

Accepted Answer

Stream the data in batches using keyset pagination: read 10K rows at a time using a WHERE id > last_seen_id cursor, write each batch to S3 via multipart upload (minimum 5MB per part), then advance the cursor. This keeps memory usage constant at O(batch_size) regardless of total row count. S3 multipart upload lets you stream arbitrarily large files without buffering the entire export in RAM. Never SELECT * with LIMIT/OFFSET for large exports — the OFFSET cost grows linearly.

Question 2

How does an asynchronous data export job work end to end?

Accepted Answer

(1) User requests an export via the API — server validates, creates an ExportJob record with status=QUEUED, enqueues a job message, and returns {job_id} immediately. (2) Worker picks up the job from the queue, updates status to PROCESSING. (3) Worker streams data to S3 in batches, updating the progress field every 10K rows. (4) On completion: worker generates a presigned S3 URL valid for 24 hours, updates status=COMPLETED, sends user an email/notification with the download link. (5) User downloads directly from S3 — no bandwidth cost on your servers.

Question 3

How do you track export progress for the UI?

Accepted Answer

Update the progress INT field (0-100) in the ExportJob table every N rows processed. The UI polls GET /exports/{job_id} every 2 seconds while status=PROCESSING and displays a progress bar. Alternatively, use Server-Sent Events (SSE) from the API to push progress updates without polling. To compute percentage: (rows_processed / total_rows) * 100. Get total_rows at job start with a COUNT(*) query on the filtered dataset (fast with the right index, or approximate with pg_class.reltuples).

Question 4

How do you prevent export jobs from overwhelming the database?

Accepted Answer

Limit concurrent exports per user (max 2 active jobs). Limit total concurrent export workers system-wide (queue concurrency cap). Schedule exports to run during off-peak hours for very large exports. Use DB read replicas for export queries — never run bulk export queries against the primary write database. Add query timeouts so a stuck export doesn't hold locks. Rate-limit the export API endpoint: max 10 export requests per user per hour.

Question 5

How do you generate an Excel (XLSX) export for large datasets?

Accepted Answer

Use a streaming XLSX library (xlsxwriter for Python, ExcelJS for Node.js) that writes directly to a file or stream without building the entire workbook in memory. Write headers first, then stream rows in batches. XLSX has a 1M row limit per sheet — for larger exports, split into multiple sheets or fall back to CSV with a warning. For complex formatting needs (charts, formulas), generate a template XLSX and populate it with openpyxl. For very large exports (>100K rows), CSV is more practical than XLSX.

Data Export Service Low-Level Design

What is a Data Export Service?

Requirements

Data Model

Export Pipeline

Streaming to S3 (Memory-Efficient)

Key Design Decisions