Data Export Service Low-Level Design
A data export service lets users extract large datasets in their preferred format. The core constraint is that exports can involve millions of rows — the service must process them without exhausting memory, deliver results securely, and handle retries gracefully.
Export Job Schema
Each export is modeled as a job record in the database:
job_id— UUID, primary keyuser_id— owner, used for access control on downloadentity_type— which table or dataset to export (e.g.,orders,users)filters— JSON blob of query parameters (date range, status, etc.)format—csv,json,parquet, orxlsxstatus—PENDING→PROCESSING→READY|FAILEDrow_count,file_size,progress_pctdownload_url,expires_at,created_at
Async Processing Flow
Export jobs are never synchronous. The API endpoint creates the job record, enqueues a work item, and returns the job_id immediately. The client polls a status endpoint until status is READY, then downloads the file. This decouples the HTTP request lifecycle from potentially multi-minute processing time.
Data Extraction
The worker fetches data using cursor-based pagination rather than OFFSET, which degrades at large offsets. Each page is 10,000 rows. The cursor is the last seen primary key from the previous page:
SELECT * FROM orders WHERE id > :last_id AND created_at BETWEEN :start AND :end
ORDER BY id LIMIT 10000
The worker never loads the full dataset into memory — it processes and writes one batch at a time.
Streaming Write by Format
- CSV: write the header row first, then stream each batch of rows with proper comma and quote escaping; use Python's
csvmodule or equivalent - JSON: write the opening bracket, stream batches as JSON objects separated by commas, close the array on completion
- Parquet: use PyArrow's
ParquetWriterwith schema defined upfront; write row groups batch by batch — columnar format reduces storage 50–70% vs CSV for analytics - XLSX: use xlsxwriter in constant-memory mode (
use_constant_memory=True) to avoid building the full workbook in RAM
Large Export Chunking
If the estimated row count exceeds 1 million rows, split the export into multiple files (e.g., data_part1.csv, data_part2.csv), each capped at 500K rows. Once all parts are written, zip them together into a single archive. The download URL points to the zip file.
Upload to S3
Temp files are written to local disk during processing, then uploaded to S3 with multipart upload for files over 100 MB. The S3 key pattern is exports/{user_id}/{job_id}/data.{ext}. The bucket is private — files are never directly accessible.
Secure Download
When the job reaches READY, generate a presigned S3 URL with a 24-hour expiry. Before generating the URL, verify the requesting user matches job.user_id. Never expose the internal S3 key or bucket name in the API response. Presigned URLs expire automatically — no separate cleanup needed for access control.
Progress Tracking
The worker updates job.progress_pct every 10,000 rows based on estimated total row count. The client's polling endpoint returns current progress so the UI can show a progress bar. Progress estimation is approximate — use COUNT(*) with the same filters before starting, cached for the job duration.
Cleanup, Rate Limiting, and Notification
- Cleanup: a scheduled job deletes S3 files and job records older than 7 days
- Rate limiting: max 3 concurrent export jobs per user; free tier capped at 100K rows per export; enforce at job creation time, return 429 if exceeded
- Notification: send an email with the download link when the export reaches
READY; this eliminates the need for the client to poll at all for non-interactive use cases
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Shopify Interview Guide