Data Export Service Low-Level Design: Async Export Jobs, Format Conversion, and Secure Download

Data Export Service Low-Level Design

A data export service lets users extract large datasets in their preferred format. The core constraint is that exports can involve millions of rows — the service must process them without exhausting memory, deliver results securely, and handle retries gracefully.

Export Job Schema

Each export is modeled as a job record in the database:

job_id — UUID, primary key
user_id — owner, used for access control on download
entity_type — which table or dataset to export (e.g., orders, users)
filters — JSON blob of query parameters (date range, status, etc.)
format — csv, json, parquet, or xlsx
status — PENDING → PROCESSING → READY | FAILED
row_count, file_size, progress_pct
download_url, expires_at, created_at

Async Processing Flow

Export jobs are never synchronous. The API endpoint creates the job record, enqueues a work item, and returns the job_id immediately. The client polls a status endpoint until status is READY, then downloads the file. This decouples the HTTP request lifecycle from potentially multi-minute processing time.

Data Extraction

The worker fetches data using cursor-based pagination rather than OFFSET, which degrades at large offsets. Each page is 10,000 rows. The cursor is the last seen primary key from the previous page:

SELECT * FROM orders WHERE id > :last_id AND created_at BETWEEN :start AND :end
ORDER BY id LIMIT 10000

The worker never loads the full dataset into memory — it processes and writes one batch at a time.

Streaming Write by Format

CSV: write the header row first, then stream each batch of rows with proper comma and quote escaping; use Python's csv module or equivalent
JSON: write the opening bracket, stream batches as JSON objects separated by commas, close the array on completion
Parquet: use PyArrow's ParquetWriter with schema defined upfront; write row groups batch by batch — columnar format reduces storage 50–70% vs CSV for analytics
XLSX: use xlsxwriter in constant-memory mode (use_constant_memory=True) to avoid building the full workbook in RAM

Large Export Chunking

If the estimated row count exceeds 1 million rows, split the export into multiple files (e.g., data_part1.csv, data_part2.csv), each capped at 500K rows. Once all parts are written, zip them together into a single archive. The download URL points to the zip file.

Upload to S3

Temp files are written to local disk during processing, then uploaded to S3 with multipart upload for files over 100 MB. The S3 key pattern is exports/{user_id}/{job_id}/data.{ext}. The bucket is private — files are never directly accessible.

Secure Download

When the job reaches READY, generate a presigned S3 URL with a 24-hour expiry. Before generating the URL, verify the requesting user matches job.user_id. Never expose the internal S3 key or bucket name in the API response. Presigned URLs expire automatically — no separate cleanup needed for access control.

Progress Tracking

The worker updates job.progress_pct every 10,000 rows based on estimated total row count. The client's polling endpoint returns current progress so the UI can show a progress bar. Progress estimation is approximate — use COUNT(*) with the same filters before starting, cached for the job duration.

Cleanup, Rate Limiting, and Notification

Cleanup: a scheduled job deletes S3 files and job records older than 7 days
Rate limiting: max 3 concurrent export jobs per user; free tier capped at 100K rows per export; enforce at job creation time, return 429 if exceeded
Notification: send an email with the download link when the export reaches READY; this eliminates the need for the client to poll at all for non-interactive use cases

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you design an async export job system that handles large datasets without timing out?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Accept the export request synchronously and immediately return a job ID, then process the export asynchronously via a worker queue (e.g., SQS + Lambda or Celery). The worker streams data from the database in paginated chunks to avoid loading the full dataset into memory, writes the output incrementally to object storage (S3), and updates job status in a jobs table. The client polls a status endpoint or receives a webhook on completion. Set a TTL on the output file and pre-sign the download URL so it expires after a safe window.”
}
},
{
“@type”: “Question”,
“name”: “What are the trade-offs between CSV, JSON, and Parquet for a bulk data export service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “CSV is universally readable and compact for flat data but loses type information and handles nested structures poorly. JSON preserves types and nesting but is verbose and slow to parse at scale. Parquet is a columnar binary format that offers 5-10x compression over CSV, retains schema, and enables predicate pushdown for downstream analytics tools like Spark or Athena — but requires a library to read and is unsuitable for human inspection. Choose CSV for interoperability with spreadsheet tools, JSON for API consumers, and Parquet for data pipeline or warehouse ingestion.”
}
},
{
“@type”: “Question”,
“name”: “How do you generate secure, time-limited download links for exported files stored in S3?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use S3 pre-signed URLs generated server-side with the AWS SDK. The URL embeds the bucket, key, expiry timestamp, and a HMAC signature derived from your IAM credentials. Set the expiry to a short window (e.g., 15 minutes) appropriate for the use case. Never expose the S3 bucket publicly. For additional control — such as single-use enforcement or IP binding — put a signed token in your own API that validates the token, checks it's unused, marks it consumed, then issues a 302 redirect to the pre-signed URL.”
}
},
{
“@type”: “Question”,
“name”: “How would you implement format conversion from a normalized relational schema to a nested JSON export without loading the full dataset into memory?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Stream rows from the database using a server-side cursor (e.g., PostgreSQL's named cursor or JDBC streaming mode) and buffer rows for the same parent entity ID to assemble nested objects on the fly. Write completed parent objects as newline-delimited JSON (NDJSON) directly to an output stream connected to S3 via a multipart upload. This keeps memory usage bounded to a single parent's children at a time. For very wide joins, denormalize with a database view or use a columnar read path to avoid row-by-row overhead.”
}
}
]
}