What is a Data Export Service?
A data export service allows users to download their data in bulk — CSV, JSON, or PDF. Use cases: GDPR data portability (user requests all their data), business analytics exports (download 1M orders to Excel), report generation, and data migration. The design challenges: exports are expensive (large queries), synchronous responses time out for large datasets, and multiple concurrent exports can overwhelm the database.
Requirements
- Export up to 10 million rows per request
- Formats: CSV, JSON, XLSX, PDF
- Async: return a job ID immediately; notify user when ready (email + webhook)
- Downloadable via a time-limited presigned URL (valid 24 hours)
- Progress tracking: show percentage complete during generation
- Concurrency limit: max 2 active exports per user at a time
Data Model
ExportJob(
job_id UUID PRIMARY KEY,
user_id UUID NOT NULL,
tenant_id UUID,
type VARCHAR, -- 'orders', 'customers', 'transactions'
filters JSONB, -- e.g., {"date_range": ["2026-01-01","2026-04-01"]}
format ENUM(CSV, JSON, XLSX, PDF),
status ENUM(QUEUED, PROCESSING, COMPLETED, FAILED),
progress INT DEFAULT 0, -- 0-100 percent
row_count INT,
file_key VARCHAR, -- S3 object key
download_url VARCHAR, -- presigned S3 URL (set on completion)
url_expires_at TIMESTAMPTZ,
error_message TEXT,
created_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ
)
Export Pipeline
# 1. API endpoint: create job immediately
def create_export(user_id, export_type, filters, format):
# Check concurrency limit
active = db.count(ExportJob,
user_id=user_id,
status__in=['QUEUED','PROCESSING'])
if active >= 2:
raise TooManyExports('Max 2 concurrent exports allowed')
job = db.create(ExportJob(user_id=user_id, type=export_type,
filters=filters, format=format,
status='QUEUED'))
queue.enqueue('process_export', job_id=job.job_id)
return {'job_id': job.job_id, 'status': 'QUEUED'}
# 2. Worker: processes the export asynchronously
def process_export(job_id):
job = db.get(ExportJob, job_id)
db.update(job_id, status='PROCESSING')
s3_key = f'exports/{job.user_id}/{job_id}.{job.format.lower()}'
total_rows = get_row_count(job.type, job.filters)
with S3StreamWriter(s3_key, job.format) as writer:
processed = 0
for batch in stream_data(job.type, job.filters, batch_size=10_000):
writer.write_batch(batch)
processed += len(batch)
# Update progress every 10K rows
db.update(job_id, progress=int(processed * 100 / total_rows),
row_count=processed)
# Generate presigned download URL (valid 24h)
url = s3.generate_presigned_url('get_object',
Params={'Bucket': EXPORT_BUCKET, 'Key': s3_key},
ExpiresIn=86400)
db.update(job_id, status='COMPLETED', file_key=s3_key,
download_url=url, url_expires_at=now()+timedelta(hours=24),
completed_at=now())
notify_user(job.user_id, job_id=job_id, download_url=url)
Streaming to S3 (Memory-Efficient)
class S3StreamWriter:
def __init__(self, key, format):
self.upload = s3.create_multipart_upload(Bucket=EXPORT_BUCKET, Key=key)
self.buffer = io.BytesIO()
self.parts = []
self.part_num = 1
def write_batch(self, rows):
if self.format == 'CSV':
writer = csv.DictWriter(self.buffer, fieldnames=rows[0].keys())
if self.part_num == 1:
writer.writeheader()
writer.writerows(rows)
# Flush to S3 part every 5MB (minimum part size)
if self.buffer.tell() > 5 * 1024 * 1024:
self._flush_part()
def _flush_part(self):
data = self.buffer.getvalue()
part = s3.upload_part(Body=data, PartNumber=self.part_num, ...)
self.parts.append({'PartNumber': self.part_num, 'ETag': part['ETag']})
self.part_num += 1
self.buffer = io.BytesIO()
def __exit__(self, *args):
self._flush_part()
s3.complete_multipart_upload(MultipartUpload={'Parts': self.parts}, ...)
Key Design Decisions
- Async job + queue — never block the HTTP response for large exports; return job_id immediately
- S3 multipart upload — stream data to S3 without loading the entire export into memory
- Batch DB reads — read 10K rows at a time using keyset pagination; avoids loading 10M rows into RAM
- Presigned URL for download — S3 serves the file directly; no bandwidth cost on your servers
- Concurrency limit per user — prevents one user’s mass-export job from overwhelming the DB
Data export service and async job design is discussed in Shopify system design interview questions.
Large-scale data export and streaming pipeline design is covered in Databricks system design interview preparation.
Data export service and S3 streaming design is discussed in Amazon system design interview guide.
See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems