Question 1

How do you design an async export job system that handles large datasets without timing out?

Accepted Answer

Accept the export request synchronously and immediately return a job ID, then process the export asynchronously via a worker queue (e.g., SQS + Lambda or Celery). The worker streams data from the database in paginated chunks to avoid loading the full dataset into memory, writes the output incrementally to object storage (S3), and updates job status in a jobs table. The client polls a status endpoint or receives a webhook on completion. Set a TTL on the output file and pre-sign the download URL so it expires after a safe window.

Question 2

What are the trade-offs between CSV, JSON, and Parquet for a bulk data export service?

Accepted Answer

CSV is universally readable and compact for flat data but loses type information and handles nested structures poorly. JSON preserves types and nesting but is verbose and slow to parse at scale. Parquet is a columnar binary format that offers 5-10x compression over CSV, retains schema, and enables predicate pushdown for downstream analytics tools like Spark or Athena — but requires a library to read and is unsuitable for human inspection. Choose CSV for interoperability with spreadsheet tools, JSON for API consumers, and Parquet for data pipeline or warehouse ingestion.

Question 3

How do you generate secure, time-limited download links for exported files stored in S3?

Accepted Answer

Use S3 pre-signed URLs generated server-side with the AWS SDK. The URL embeds the bucket, key, expiry timestamp, and a HMAC signature derived from your IAM credentials. Set the expiry to a short window (e.g., 15 minutes) appropriate for the use case. Never expose the S3 bucket publicly. For additional control — such as single-use enforcement or IP binding — put a signed token in your own API that validates the token, checks it's unused, marks it consumed, then issues a 302 redirect to the pre-signed URL.

Question 4

How would you implement format conversion from a normalized relational schema to a nested JSON export without loading the full dataset into memory?

Accepted Answer

Stream rows from the database using a server-side cursor (e.g., PostgreSQL's named cursor or JDBC streaming mode) and buffer rows for the same parent entity ID to assemble nested objects on the fly. Write completed parent objects as newline-delimited JSON (NDJSON) directly to an output stream connected to S3 via a multipart upload. This keeps memory usage bounded to a single parent's children at a time. For very wide joins, denormalize with a database view or use a columnar read path to avoid row-by-row overhead.

Data Export Service Low-Level Design: Async Export Jobs, Format Conversion, and Secure Download

Data Export Service Low-Level Design

Export Job Schema

Async Processing Flow

Data Extraction

Streaming Write by Format

Large Export Chunking

Upload to S3

Secure Download

Progress Tracking

Cleanup, Rate Limiting, and Notification