Data Export Service Low-Level Design – Tech Interview Dot Org

What is a Data Export Service?

A data export service allows users to download their data in bulk — CSV, JSON, or PDF. Use cases: GDPR data portability (user requests all their data), business analytics exports (download 1M orders to Excel), report generation, and data migration. The design challenges: exports are expensive (large queries), synchronous responses time out for large datasets, and multiple concurrent exports can overwhelm the database.

Requirements

Export up to 10 million rows per request
Formats: CSV, JSON, XLSX, PDF
Async: return a job ID immediately; notify user when ready (email + webhook)
Downloadable via a time-limited presigned URL (valid 24 hours)
Progress tracking: show percentage complete during generation
Concurrency limit: max 2 active exports per user at a time

Data Model

ExportJob(
    job_id      UUID PRIMARY KEY,
    user_id     UUID NOT NULL,
    tenant_id   UUID,
    type        VARCHAR,             -- 'orders', 'customers', 'transactions'
    filters     JSONB,               -- e.g., {"date_range": ["2026-01-01","2026-04-01"]}
    format      ENUM(CSV, JSON, XLSX, PDF),
    status      ENUM(QUEUED, PROCESSING, COMPLETED, FAILED),
    progress    INT DEFAULT 0,       -- 0-100 percent
    row_count   INT,
    file_key    VARCHAR,             -- S3 object key
    download_url VARCHAR,            -- presigned S3 URL (set on completion)
    url_expires_at TIMESTAMPTZ,
    error_message TEXT,
    created_at  TIMESTAMPTZ,
    completed_at TIMESTAMPTZ
)

Export Pipeline

# 1. API endpoint: create job immediately
def create_export(user_id, export_type, filters, format):
    # Check concurrency limit
    active = db.count(ExportJob,
                      user_id=user_id,
                      status__in=['QUEUED','PROCESSING'])
    if active >= 2:
        raise TooManyExports('Max 2 concurrent exports allowed')

    job = db.create(ExportJob(user_id=user_id, type=export_type,
                               filters=filters, format=format,
                               status='QUEUED'))
    queue.enqueue('process_export', job_id=job.job_id)
    return {'job_id': job.job_id, 'status': 'QUEUED'}

# 2. Worker: processes the export asynchronously
def process_export(job_id):
    job = db.get(ExportJob, job_id)
    db.update(job_id, status='PROCESSING')

    s3_key = f'exports/{job.user_id}/{job_id}.{job.format.lower()}'
    total_rows = get_row_count(job.type, job.filters)

    with S3StreamWriter(s3_key, job.format) as writer:
        processed = 0
        for batch in stream_data(job.type, job.filters, batch_size=10_000):
            writer.write_batch(batch)
            processed += len(batch)
            # Update progress every 10K rows
            db.update(job_id, progress=int(processed * 100 / total_rows),
                      row_count=processed)

    # Generate presigned download URL (valid 24h)
    url = s3.generate_presigned_url('get_object',
                                    Params={'Bucket': EXPORT_BUCKET, 'Key': s3_key},
                                    ExpiresIn=86400)
    db.update(job_id, status='COMPLETED', file_key=s3_key,
              download_url=url, url_expires_at=now()+timedelta(hours=24),
              completed_at=now())

    notify_user(job.user_id, job_id=job_id, download_url=url)

Streaming to S3 (Memory-Efficient)

class S3StreamWriter:
    def __init__(self, key, format):
        self.upload = s3.create_multipart_upload(Bucket=EXPORT_BUCKET, Key=key)
        self.buffer = io.BytesIO()
        self.parts = []
        self.part_num = 1

    def write_batch(self, rows):
        if self.format == 'CSV':
            writer = csv.DictWriter(self.buffer, fieldnames=rows[0].keys())
            if self.part_num == 1:
                writer.writeheader()
            writer.writerows(rows)

        # Flush to S3 part every 5MB (minimum part size)
        if self.buffer.tell() > 5 * 1024 * 1024:
            self._flush_part()

    def _flush_part(self):
        data = self.buffer.getvalue()
        part = s3.upload_part(Body=data, PartNumber=self.part_num, ...)
        self.parts.append({'PartNumber': self.part_num, 'ETag': part['ETag']})
        self.part_num += 1
        self.buffer = io.BytesIO()

    def __exit__(self, *args):
        self._flush_part()
        s3.complete_multipart_upload(MultipartUpload={'Parts': self.parts}, ...)

Key Design Decisions

Async job + queue — never block the HTTP response for large exports; return job_id immediately
S3 multipart upload — stream data to S3 without loading the entire export into memory
Batch DB reads — read 10K rows at a time using keyset pagination; avoids loading 10M rows into RAM
Presigned URL for download — S3 serves the file directly; no bandwidth cost on your servers
Concurrency limit per user — prevents one user’s mass-export job from overwhelming the DB

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you export 10 million rows without running out of memory?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Stream the data in batches using keyset pagination: read 10K rows at a time using a WHERE id > last_seen_id cursor, write each batch to S3 via multipart upload (minimum 5MB per part), then advance the cursor. This keeps memory usage constant at O(batch_size) regardless of total row count. S3 multipart upload lets you stream arbitrarily large files without buffering the entire export in RAM. Never SELECT * with LIMIT/OFFSET for large exports — the OFFSET cost grows linearly.”}},{“@type”:”Question”,”name”:”How does an asynchronous data export job work end to end?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”(1) User requests an export via the API — server validates, creates an ExportJob record with status=QUEUED, enqueues a job message, and returns {job_id} immediately. (2) Worker picks up the job from the queue, updates status to PROCESSING. (3) Worker streams data to S3 in batches, updating the progress field every 10K rows. (4) On completion: worker generates a presigned S3 URL valid for 24 hours, updates status=COMPLETED, sends user an email/notification with the download link. (5) User downloads directly from S3 — no bandwidth cost on your servers.”}},{“@type”:”Question”,”name”:”How do you track export progress for the UI?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Update the progress INT field (0-100) in the ExportJob table every N rows processed. The UI polls GET /exports/{job_id} every 2 seconds while status=PROCESSING and displays a progress bar. Alternatively, use Server-Sent Events (SSE) from the API to push progress updates without polling. To compute percentage: (rows_processed / total_rows) * 100. Get total_rows at job start with a COUNT(*) query on the filtered dataset (fast with the right index, or approximate with pg_class.reltuples).”}},{“@type”:”Question”,”name”:”How do you prevent export jobs from overwhelming the database?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Limit concurrent exports per user (max 2 active jobs). Limit total concurrent export workers system-wide (queue concurrency cap). Schedule exports to run during off-peak hours for very large exports. Use DB read replicas for export queries — never run bulk export queries against the primary write database. Add query timeouts so a stuck export doesn’t hold locks. Rate-limit the export API endpoint: max 10 export requests per user per hour.”}},{“@type”:”Question”,”name”:”How do you generate an Excel (XLSX) export for large datasets?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use a streaming XLSX library (xlsxwriter for Python, ExcelJS for Node.js) that writes directly to a file or stream without building the entire workbook in memory. Write headers first, then stream rows in batches. XLSX has a 1M row limit per sheet — for larger exports, split into multiple sheets or fall back to CSV with a warning. For complex formatting needs (charts, formulas), generate a template XLSX and populate it with openpyxl. For very large exports (>100K rows), CSV is more practical than XLSX.”}}]}

Data export service and async job design is discussed in Shopify system design interview questions.

Large-scale data export and streaming pipeline design is covered in Databricks system design interview preparation.

Data export service and S3 streaming design is discussed in Amazon system design interview guide.