Data Export Service Low-Level Design: Async Export Jobs, Format Conversion, and Secure Download

Data Export Service Low-Level Design

A data export service lets users extract large datasets in their preferred format. The core constraint is that exports can involve millions of rows — the service must process them without exhausting memory, deliver results securely, and handle retries gracefully.

Export Job Schema

Each export is modeled as a job record in the database:

  • job_id — UUID, primary key
  • user_id — owner, used for access control on download
  • entity_type — which table or dataset to export (e.g., orders, users)
  • filters — JSON blob of query parameters (date range, status, etc.)
  • formatcsv, json, parquet, or xlsx
  • statusPENDINGPROCESSINGREADY | FAILED
  • row_count, file_size, progress_pct
  • download_url, expires_at, created_at

Async Processing Flow

Export jobs are never synchronous. The API endpoint creates the job record, enqueues a work item, and returns the job_id immediately. The client polls a status endpoint until status is READY, then downloads the file. This decouples the HTTP request lifecycle from potentially multi-minute processing time.

Data Extraction

The worker fetches data using cursor-based pagination rather than OFFSET, which degrades at large offsets. Each page is 10,000 rows. The cursor is the last seen primary key from the previous page:

SELECT * FROM orders WHERE id > :last_id AND created_at BETWEEN :start AND :end
ORDER BY id LIMIT 10000

The worker never loads the full dataset into memory — it processes and writes one batch at a time.

Streaming Write by Format

  • CSV: write the header row first, then stream each batch of rows with proper comma and quote escaping; use Python's csv module or equivalent
  • JSON: write the opening bracket, stream batches as JSON objects separated by commas, close the array on completion
  • Parquet: use PyArrow's ParquetWriter with schema defined upfront; write row groups batch by batch — columnar format reduces storage 50–70% vs CSV for analytics
  • XLSX: use xlsxwriter in constant-memory mode (use_constant_memory=True) to avoid building the full workbook in RAM

Large Export Chunking

If the estimated row count exceeds 1 million rows, split the export into multiple files (e.g., data_part1.csv, data_part2.csv), each capped at 500K rows. Once all parts are written, zip them together into a single archive. The download URL points to the zip file.

Upload to S3

Temp files are written to local disk during processing, then uploaded to S3 with multipart upload for files over 100 MB. The S3 key pattern is exports/{user_id}/{job_id}/data.{ext}. The bucket is private — files are never directly accessible.

Secure Download

When the job reaches READY, generate a presigned S3 URL with a 24-hour expiry. Before generating the URL, verify the requesting user matches job.user_id. Never expose the internal S3 key or bucket name in the API response. Presigned URLs expire automatically — no separate cleanup needed for access control.

Progress Tracking

The worker updates job.progress_pct every 10,000 rows based on estimated total row count. The client's polling endpoint returns current progress so the UI can show a progress bar. Progress estimation is approximate — use COUNT(*) with the same filters before starting, cached for the job duration.

Cleanup, Rate Limiting, and Notification

  • Cleanup: a scheduled job deletes S3 files and job records older than 7 days
  • Rate limiting: max 3 concurrent export jobs per user; free tier capped at 100K rows per export; enforce at job creation time, return 429 if exceeded
  • Notification: send an email with the download link when the export reaches READY; this eliminates the need for the client to poll at all for non-interactive use cases

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Shopify Interview Guide

Scroll to Top