Data Export Service Low-Level Design

What is a Data Export Service?

A data export service allows users to download their data in bulk — CSV, JSON, or PDF. Use cases: GDPR data portability (user requests all their data), business analytics exports (download 1M orders to Excel), report generation, and data migration. The design challenges: exports are expensive (large queries), synchronous responses time out for large datasets, and multiple concurrent exports can overwhelm the database.

Requirements

  • Export up to 10 million rows per request
  • Formats: CSV, JSON, XLSX, PDF
  • Async: return a job ID immediately; notify user when ready (email + webhook)
  • Downloadable via a time-limited presigned URL (valid 24 hours)
  • Progress tracking: show percentage complete during generation
  • Concurrency limit: max 2 active exports per user at a time

Data Model

ExportJob(
    job_id      UUID PRIMARY KEY,
    user_id     UUID NOT NULL,
    tenant_id   UUID,
    type        VARCHAR,             -- 'orders', 'customers', 'transactions'
    filters     JSONB,               -- e.g., {"date_range": ["2026-01-01","2026-04-01"]}
    format      ENUM(CSV, JSON, XLSX, PDF),
    status      ENUM(QUEUED, PROCESSING, COMPLETED, FAILED),
    progress    INT DEFAULT 0,       -- 0-100 percent
    row_count   INT,
    file_key    VARCHAR,             -- S3 object key
    download_url VARCHAR,            -- presigned S3 URL (set on completion)
    url_expires_at TIMESTAMPTZ,
    error_message TEXT,
    created_at  TIMESTAMPTZ,
    completed_at TIMESTAMPTZ
)

Export Pipeline

# 1. API endpoint: create job immediately
def create_export(user_id, export_type, filters, format):
    # Check concurrency limit
    active = db.count(ExportJob,
                      user_id=user_id,
                      status__in=['QUEUED','PROCESSING'])
    if active >= 2:
        raise TooManyExports('Max 2 concurrent exports allowed')

    job = db.create(ExportJob(user_id=user_id, type=export_type,
                               filters=filters, format=format,
                               status='QUEUED'))
    queue.enqueue('process_export', job_id=job.job_id)
    return {'job_id': job.job_id, 'status': 'QUEUED'}

# 2. Worker: processes the export asynchronously
def process_export(job_id):
    job = db.get(ExportJob, job_id)
    db.update(job_id, status='PROCESSING')

    s3_key = f'exports/{job.user_id}/{job_id}.{job.format.lower()}'
    total_rows = get_row_count(job.type, job.filters)

    with S3StreamWriter(s3_key, job.format) as writer:
        processed = 0
        for batch in stream_data(job.type, job.filters, batch_size=10_000):
            writer.write_batch(batch)
            processed += len(batch)
            # Update progress every 10K rows
            db.update(job_id, progress=int(processed * 100 / total_rows),
                      row_count=processed)

    # Generate presigned download URL (valid 24h)
    url = s3.generate_presigned_url('get_object',
                                    Params={'Bucket': EXPORT_BUCKET, 'Key': s3_key},
                                    ExpiresIn=86400)
    db.update(job_id, status='COMPLETED', file_key=s3_key,
              download_url=url, url_expires_at=now()+timedelta(hours=24),
              completed_at=now())

    notify_user(job.user_id, job_id=job_id, download_url=url)

Streaming to S3 (Memory-Efficient)

class S3StreamWriter:
    def __init__(self, key, format):
        self.upload = s3.create_multipart_upload(Bucket=EXPORT_BUCKET, Key=key)
        self.buffer = io.BytesIO()
        self.parts = []
        self.part_num = 1

    def write_batch(self, rows):
        if self.format == 'CSV':
            writer = csv.DictWriter(self.buffer, fieldnames=rows[0].keys())
            if self.part_num == 1:
                writer.writeheader()
            writer.writerows(rows)

        # Flush to S3 part every 5MB (minimum part size)
        if self.buffer.tell() > 5 * 1024 * 1024:
            self._flush_part()

    def _flush_part(self):
        data = self.buffer.getvalue()
        part = s3.upload_part(Body=data, PartNumber=self.part_num, ...)
        self.parts.append({'PartNumber': self.part_num, 'ETag': part['ETag']})
        self.part_num += 1
        self.buffer = io.BytesIO()

    def __exit__(self, *args):
        self._flush_part()
        s3.complete_multipart_upload(MultipartUpload={'Parts': self.parts}, ...)

Key Design Decisions

  • Async job + queue — never block the HTTP response for large exports; return job_id immediately
  • S3 multipart upload — stream data to S3 without loading the entire export into memory
  • Batch DB reads — read 10K rows at a time using keyset pagination; avoids loading 10M rows into RAM
  • Presigned URL for download — S3 serves the file directly; no bandwidth cost on your servers
  • Concurrency limit per user — prevents one user’s mass-export job from overwhelming the DB

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you export 10 million rows without running out of memory?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Stream the data in batches using keyset pagination: read 10K rows at a time using a WHERE id > last_seen_id cursor, write each batch to S3 via multipart upload (minimum 5MB per part), then advance the cursor. This keeps memory usage constant at O(batch_size) regardless of total row count. S3 multipart upload lets you stream arbitrarily large files without buffering the entire export in RAM. Never SELECT * with LIMIT/OFFSET for large exports — the OFFSET cost grows linearly.”}},{“@type”:”Question”,”name”:”How does an asynchronous data export job work end to end?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”(1) User requests an export via the API — server validates, creates an ExportJob record with status=QUEUED, enqueues a job message, and returns {job_id} immediately. (2) Worker picks up the job from the queue, updates status to PROCESSING. (3) Worker streams data to S3 in batches, updating the progress field every 10K rows. (4) On completion: worker generates a presigned S3 URL valid for 24 hours, updates status=COMPLETED, sends user an email/notification with the download link. (5) User downloads directly from S3 — no bandwidth cost on your servers.”}},{“@type”:”Question”,”name”:”How do you track export progress for the UI?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Update the progress INT field (0-100) in the ExportJob table every N rows processed. The UI polls GET /exports/{job_id} every 2 seconds while status=PROCESSING and displays a progress bar. Alternatively, use Server-Sent Events (SSE) from the API to push progress updates without polling. To compute percentage: (rows_processed / total_rows) * 100. Get total_rows at job start with a COUNT(*) query on the filtered dataset (fast with the right index, or approximate with pg_class.reltuples).”}},{“@type”:”Question”,”name”:”How do you prevent export jobs from overwhelming the database?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Limit concurrent exports per user (max 2 active jobs). Limit total concurrent export workers system-wide (queue concurrency cap). Schedule exports to run during off-peak hours for very large exports. Use DB read replicas for export queries — never run bulk export queries against the primary write database. Add query timeouts so a stuck export doesn’t hold locks. Rate-limit the export API endpoint: max 10 export requests per user per hour.”}},{“@type”:”Question”,”name”:”How do you generate an Excel (XLSX) export for large datasets?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use a streaming XLSX library (xlsxwriter for Python, ExcelJS for Node.js) that writes directly to a file or stream without building the entire workbook in memory. Write headers first, then stream rows in batches. XLSX has a 1M row limit per sheet — for larger exports, split into multiple sheets or fall back to CSV with a warning. For complex formatting needs (charts, formulas), generate a template XLSX and populate it with openpyxl. For very large exports (>100K rows), CSV is more practical than XLSX.”}}]}

Data export service and async job design is discussed in Shopify system design interview questions.

Large-scale data export and streaming pipeline design is covered in Databricks system design interview preparation.

Data export service and S3 streaming design is discussed in Amazon system design interview guide.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems

Scroll to Top