Data Export Service Low-Level Design
A data export service lets users extract large datasets in their preferred format. The core constraint is that exports can involve millions of rows — the service must process them without exhausting memory, deliver results securely, and handle retries gracefully.
Export Job Schema
Each export is modeled as a job record in the database:
job_id— UUID, primary keyuser_id— owner, used for access control on downloadentity_type— which table or dataset to export (e.g.,orders,users)filters— JSON blob of query parameters (date range, status, etc.)format—csv,json,parquet, orxlsxstatus—PENDING→PROCESSING→READY|FAILEDrow_count,file_size,progress_pctdownload_url,expires_at,created_at
Async Processing Flow
Export jobs are never synchronous. The API endpoint creates the job record, enqueues a work item, and returns the job_id immediately. The client polls a status endpoint until status is READY, then downloads the file. This decouples the HTTP request lifecycle from potentially multi-minute processing time.
Data Extraction
The worker fetches data using cursor-based pagination rather than OFFSET, which degrades at large offsets. Each page is 10,000 rows. The cursor is the last seen primary key from the previous page:
SELECT * FROM orders WHERE id > :last_id AND created_at BETWEEN :start AND :end
ORDER BY id LIMIT 10000
The worker never loads the full dataset into memory — it processes and writes one batch at a time.
Streaming Write by Format
- CSV: write the header row first, then stream each batch of rows with proper comma and quote escaping; use Python's
csvmodule or equivalent - JSON: write the opening bracket, stream batches as JSON objects separated by commas, close the array on completion
- Parquet: use PyArrow's
ParquetWriterwith schema defined upfront; write row groups batch by batch — columnar format reduces storage 50–70% vs CSV for analytics - XLSX: use xlsxwriter in constant-memory mode (
use_constant_memory=True) to avoid building the full workbook in RAM
Large Export Chunking
If the estimated row count exceeds 1 million rows, split the export into multiple files (e.g., data_part1.csv, data_part2.csv), each capped at 500K rows. Once all parts are written, zip them together into a single archive. The download URL points to the zip file.
Upload to S3
Temp files are written to local disk during processing, then uploaded to S3 with multipart upload for files over 100 MB. The S3 key pattern is exports/{user_id}/{job_id}/data.{ext}. The bucket is private — files are never directly accessible.
Secure Download
When the job reaches READY, generate a presigned S3 URL with a 24-hour expiry. Before generating the URL, verify the requesting user matches job.user_id. Never expose the internal S3 key or bucket name in the API response. Presigned URLs expire automatically — no separate cleanup needed for access control.
Progress Tracking
The worker updates job.progress_pct every 10,000 rows based on estimated total row count. The client's polling endpoint returns current progress so the UI can show a progress bar. Progress estimation is approximate — use COUNT(*) with the same filters before starting, cached for the job duration.
Cleanup, Rate Limiting, and Notification
- Cleanup: a scheduled job deletes S3 files and job records older than 7 days
- Rate limiting: max 3 concurrent export jobs per user; free tier capped at 100K rows per export; enforce at job creation time, return 429 if exceeded
- Notification: send an email with the download link when the export reaches
READY; this eliminates the need for the client to poll at all for non-interactive use cases
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you design an async export job system that handles large datasets without timing out?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Accept the export request synchronously and immediately return a job ID, then process the export asynchronously via a worker queue (e.g., SQS + Lambda or Celery). The worker streams data from the database in paginated chunks to avoid loading the full dataset into memory, writes the output incrementally to object storage (S3), and updates job status in a jobs table. The client polls a status endpoint or receives a webhook on completion. Set a TTL on the output file and pre-sign the download URL so it expires after a safe window.”
}
},
{
“@type”: “Question”,
“name”: “What are the trade-offs between CSV, JSON, and Parquet for a bulk data export service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “CSV is universally readable and compact for flat data but loses type information and handles nested structures poorly. JSON preserves types and nesting but is verbose and slow to parse at scale. Parquet is a columnar binary format that offers 5-10x compression over CSV, retains schema, and enables predicate pushdown for downstream analytics tools like Spark or Athena — but requires a library to read and is unsuitable for human inspection. Choose CSV for interoperability with spreadsheet tools, JSON for API consumers, and Parquet for data pipeline or warehouse ingestion.”
}
},
{
“@type”: “Question”,
“name”: “How do you generate secure, time-limited download links for exported files stored in S3?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use S3 pre-signed URLs generated server-side with the AWS SDK. The URL embeds the bucket, key, expiry timestamp, and a HMAC signature derived from your IAM credentials. Set the expiry to a short window (e.g., 15 minutes) appropriate for the use case. Never expose the S3 bucket publicly. For additional control — such as single-use enforcement or IP binding — put a signed token in your own API that validates the token, checks it's unused, marks it consumed, then issues a 302 redirect to the pre-signed URL.”
}
},
{
“@type”: “Question”,
“name”: “How would you implement format conversion from a normalized relational schema to a nested JSON export without loading the full dataset into memory?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Stream rows from the database using a server-side cursor (e.g., PostgreSQL's named cursor or JDBC streaming mode) and buffer rows for the same parent entity ID to assemble nested objects on the fly. Write completed parent objects as newline-delimited JSON (NDJSON) directly to an output stream connected to S3 via a multipart upload. This keeps memory usage bounded to a single parent's children at a time. For very wide joins, denormalize with a database view or use a columnar read path to avoid row-by-row overhead.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Shopify Interview Guide