Overview
A reporting service allows business users and automated processes to define, schedule, execute, and distribute reports over operational data. Reports are parameterized queries against one or more data sources, producing structured output exported in multiple formats (CSV, PDF, Excel) and delivered via email, webhook, or object storage. This LLD covers the report definition model, execution engine, scheduler, export pipeline, delivery channels, caching strategy, and all key design trade-offs.
Data Model
CREATE TABLE data_sources (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
team_id BIGINT UNSIGNED NOT NULL,
name VARCHAR(128) NOT NULL,
type ENUM('postgresql','mysql','bigquery','redshift','snowflake','http_api') NOT NULL,
connection_dsn TEXT NOT NULL COMMENT 'encrypted at rest; contains host, port, db, credentials',
max_rows INT UNSIGNED NOT NULL DEFAULT 100000 COMMENT 'safety cap on result set size',
timeout_seconds SMALLINT UNSIGNED NOT NULL DEFAULT 30,
is_active TINYINT(1) NOT NULL DEFAULT 1,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
INDEX idx_team (team_id)
);
CREATE TABLE report_definitions (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
external_id CHAR(36) NOT NULL COMMENT 'UUID for external references',
team_id BIGINT UNSIGNED NOT NULL,
owner_user_id BIGINT UNSIGNED NOT NULL,
name VARCHAR(256) NOT NULL,
description TEXT NULL,
data_source_id BIGINT UNSIGNED NOT NULL,
query_template TEXT NOT NULL COMMENT 'SQL or API request template with {{param}} placeholders',
query_type ENUM('sql','api') NOT NULL DEFAULT 'sql',
param_schema JSON NOT NULL COMMENT 'JSON Schema describing accepted parameters',
default_params JSON NULL,
version SMALLINT UNSIGNED NOT NULL DEFAULT 1,
is_active TINYINT(1) NOT NULL DEFAULT 1,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
FOREIGN KEY (data_source_id) REFERENCES data_sources(id),
UNIQUE KEY uq_external (external_id),
INDEX idx_team (team_id)
);
CREATE TABLE report_schedules (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
report_definition_id BIGINT UNSIGNED NOT NULL,
cron_expression VARCHAR(64) NOT NULL COMMENT 'standard 5-field cron or 6-field with seconds',
timezone VARCHAR(64) NOT NULL DEFAULT 'UTC',
params_override JSON NULL COMMENT 'parameter values for this schedule; merged with default_params',
export_formats JSON NOT NULL COMMENT 'array of: csv, pdf, xlsx',
delivery_channel_id BIGINT UNSIGNED NOT NULL,
is_active TINYINT(1) NOT NULL DEFAULT 1,
last_run_at DATETIME NULL,
next_run_at DATETIME NULL,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (report_definition_id) REFERENCES report_definitions(id),
INDEX idx_next_run (is_active, next_run_at)
);
CREATE TABLE report_runs (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
external_id CHAR(36) NOT NULL,
report_definition_id BIGINT UNSIGNED NOT NULL,
schedule_id BIGINT UNSIGNED NULL COMMENT 'NULL for ad-hoc runs',
triggered_by BIGINT UNSIGNED NULL COMMENT 'user ID or NULL for scheduled',
params JSON NOT NULL,
status ENUM('queued','running','exporting','delivering','done','failed','cancelled') NOT NULL DEFAULT 'queued',
row_count INT UNSIGNED NULL,
error_message TEXT NULL,
cache_key CHAR(64) NULL,
cache_hit TINYINT(1) NOT NULL DEFAULT 0,
queued_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
started_at DATETIME NULL,
query_finished_at DATETIME NULL,
export_finished_at DATETIME NULL,
finished_at DATETIME NULL,
UNIQUE KEY uq_external (external_id),
FOREIGN KEY (report_definition_id) REFERENCES report_definitions(id),
INDEX idx_definition (report_definition_id, queued_at),
INDEX idx_status (status, queued_at)
);
CREATE TABLE report_outputs (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
run_id BIGINT UNSIGNED NOT NULL,
format ENUM('csv','pdf','xlsx') NOT NULL,
file_key VARCHAR(512) NOT NULL COMMENT 'S3 key',
file_size_bytes BIGINT UNSIGNED NOT NULL,
file_hash CHAR(64) NOT NULL,
expires_at DATETIME NULL,
FOREIGN KEY (run_id) REFERENCES report_runs(id),
INDEX idx_run (run_id)
);
CREATE TABLE delivery_channels (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
team_id BIGINT UNSIGNED NOT NULL,
name VARCHAR(128) NOT NULL,
type ENUM('email','webhook','s3','sftp') NOT NULL,
config JSON NOT NULL COMMENT 'type-specific: {to, cc, subject_template} or {url, headers} or {bucket, prefix}',
is_active TINYINT(1) NOT NULL DEFAULT 1,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
INDEX idx_team (team_id)
);
CREATE TABLE report_cache (
cache_key CHAR(64) PRIMARY KEY,
run_id BIGINT UNSIGNED NOT NULL,
result_file_key VARCHAR(512) NOT NULL COMMENT 'S3 key of raw result (parquet or compressed JSON)',
row_count INT UNSIGNED NOT NULL,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
expires_at DATETIME NOT NULL,
INDEX idx_expires (expires_at)
);
Core Workflow
1. Report Definition and Parameter Schema
Report definitions store a query template with named placeholders in double-brace syntax: WHERE created_at >= {{start_date}} AND status = {{status}}. The param_schema JSON Schema field describes each parameter’s type, allowed values, default, and whether it is required. At run time, submitted parameters are validated against this schema before query construction.
SQL query templates are never interpolated with raw string substitution. Parameters are extracted and passed as prepared statement bind parameters to prevent SQL injection. For date range parameters, the service generates parameterized clauses and binds the values separately. Only the query structure (table names, column selections) can vary by template; all data values are always bound.
2. Scheduling
The scheduler is a single leader process (elected via a database lock or distributed mutex such as Redis SETNX with TTL). Every 30 seconds it queries report_schedules for rows where is_active = 1 AND next_run_at <= NOW(), creates a report_runs row for each, publishes the run IDs to the execution queue, and updates last_run_at and next_run_at (computed from the cron expression and timezone using a cron parser library).
Timezone-aware scheduling is important for business reports that must run at "9:00 AM local time". Cron expressions are stored in the report’s timezone; the scheduler converts next_run_at to UTC for storage and comparison.
3. Execution Engine
- A worker dequeues a run ID, transitions status to
running, and recordsstarted_at. - Cache check: The cache key is SHA-256 of
report_definition_id + version + canonical_json(params) + floor(now / cache_ttl_bucket). The time bucket quantizes time so that reports with a 1-hour cache TTL reuse results from the same clock-hour. If a validreport_cacheentry exists, the worker skips execution, markscache_hit = 1, and jumps to export. - Query execution: The worker connects to the data source using a connection pool (pgBouncer for PostgreSQL, JDBC pool for others). The parameterized query executes with a
statement_timeoutmatchingdata_sources.timeout_seconds. Rows are streamed from the database cursor rather than fetched into memory all at once. - Result storage: Streamed rows are written to S3 in Parquet format (columnar, compressed with Snappy). Parquet allows efficient re-reading for multiple export formats without re-querying the database. The S3 key is recorded in
report_cachewith an expiry. query_finished_atandrow_countare recorded. Status transitions toexporting.
4. Export Pipeline
For each requested format, the exporter reads the Parquet result file from S3 and converts it:
- CSV: Stream rows through a CSV encoder with proper quoting and encoding (UTF-8 BOM for Excel compatibility). Compress with gzip if the row count exceeds a threshold.
- XLSX: Use a streaming XLSX writer (e.g., xlsxwriter in Python, or Apache POI streaming API in Java) that writes rows directly to disk without building the entire workbook in memory. Apply column auto-width and freeze-pane headers. Limit to 1 million rows (Excel's row cap).
- PDF: Render the result set as an HTML table using a parameterized template (report name, timestamp, parameters in header, paginated table), then convert to PDF via the PDF generation service (internal API call) or directly via headless Chromium. For large result sets, paginate the table across multiple PDF pages.
Each export output is written to S3 and recorded in report_outputs. Status transitions to delivering.
5. Delivery Channels
- Email: The delivery worker fetches the output files, generates pre-signed S3 download links (valid 7 days), and sends an email via SES or SendGrid with the links embedded. Files under 10 MB may be attached directly; larger files are link-only.
- Webhook: The worker POSTs a JSON payload to the configured URL containing the run metadata and signed download URLs. Delivery is retried with exponential backoff on non-2xx responses. After 5 failures the channel is marked as degraded and the team is alerted.
- S3: The output files are copied to a team-configured S3 bucket and prefix. Useful for downstream pipeline consumption.
- SFTP: Files are pushed to an SFTP server. SFTP credentials are stored encrypted in
delivery_channels.config.
After successful delivery, status transitions to done and finished_at is recorded.
Key Design Decisions and Trade-offs
Intermediate Parquet Storage
Storing query results as Parquet on S3 before export decouples query execution from export format generation. This means: (1) the cache can be shared across runs requesting different formats, (2) a failed Excel export can be retried without re-querying the database, and (3) multiple formats can be exported in parallel by independent workers. The trade-off is additional S3 read/write cost and the overhead of a Parquet library dependency.
Read Replica Routing
All report queries must run against read replicas, never the primary database. Enforce this at the data source level: connection DSNs for reporting data sources must point to replica endpoints. A query that accidentally runs on the primary under heavy load can cause replication lag spikes that affect production writes. Consider read replica lag monitoring; if replica lag exceeds a threshold, delay scheduled reports rather than serving stale data silently.
Query Safety and Resource Limits
Untrusted report definitions (if users can write their own SQL) are extremely dangerous. Options in increasing security order: (1) whitelist-only query builder UI with no raw SQL; (2) raw SQL but run under a database role with SELECT-only privileges on specific schemas; (3) raw SQL parsed through a query validator that rejects mutations (INSERT, UPDATE, DELETE, DDL) and disallows subqueries to untrusted tables. Always combine with statement timeouts, max-rows caps, and cost-based query plan analysis (EXPLAIN before execution, reject plans above a cost threshold).
Time-Bucketed Caching
The cache key includes a time bucket (floor of current epoch divided by TTL seconds) so that the cached result is automatically invalidated when the bucket rolls over. This avoids storing explicit TTL logic in the application layer per report and ensures all workers agree on cache validity without coordination. The trade-off is that reports can see data that is up to TTL seconds stale. TTL is configurable per report definition; time-sensitive reports set TTL = 0 to disable caching.
Scheduler Leader Election
Running multiple scheduler instances without coordination causes duplicate runs. Use database-level advisory locking (PostgreSQL pg_try_advisory_lock) or a Redis SETNX lock with a TTL equal to the scheduling interval plus a safety margin. The lock holder runs the scheduling loop; others poll until they can acquire the lock. This is simpler than Raft-based leader election for a low-frequency scheduling loop.
Failure Handling and Edge Cases
- Query timeout: The database kills the query after
timeout_seconds. The worker catches the timeout exception, setserror_message, and transitions the run tofailed. Scheduled runs are retried at the next scheduled time (not immediately), preventing a slow query from flooding the database with retries. - Result set too large: The worker enforces a row count cap by adding
LIMIT max_rows + 1to every query. If the result exceedsmax_rows, the run fails with a descriptive error suggesting the user add filters or reduce the date range. This prevents OOM conditions in the export stage. - Export OOM for large XLSX: XLSX streaming writers use constant memory regardless of row count, but the Parquet read buffer can be large. Read the Parquet file in row group chunks (Parquet's natural unit) rather than loading the entire file. Each chunk is written to XLSX and discarded before the next chunk is read.
- Delivery channel failure: Delivery is decoupled from export: if email delivery fails, the output files remain in S3 and can be re-delivered by resubmitting the delivery step. The system records delivery attempts and outcomes per channel in a delivery_attempts table (omitted from schema above for brevity) to support debugging.
- Duplicate scheduled runs: If the scheduler runs twice (split-brain during leader re-election), the same schedule may enqueue two runs for the same window. Guard against this with a unique constraint on
(schedule_id, DATE_FORMAT(queued_at, '%Y-%m-%d %H'))or by checking for an existing run in the same time window before insertion. - Parameter injection via template: If the query template contains structural SQL (table name varies by parameter), validate the parameter value against an explicit allowlist of table names stored in the report definition, never trusting caller-supplied values for structural SQL elements.
Scalability Considerations
Query execution workers and export workers should be separate fleets since they have different resource profiles: query workers are network-I/O bound (waiting on the database), while export workers are CPU and disk-I/O bound (encoding large files). Scale each independently. Use separate queues for each stage so a backlog of large exports does not block fast query executions.
For very large reports (millions of rows), the Parquet intermediate file can be partitioned into multiple S3 objects (one per 500K rows). Export workers process each partition in parallel and merge the results. For CSV this is trivial (concatenate). For XLSX, each partition writes a separate sheet or the worker stitches sheets in a post-processing step.
The report_cache table grows unboundedly without maintenance. A cleanup job deletes rows where expires_at < NOW() and removes the corresponding S3 Parquet files. Run this cleanup hourly on a low-priority worker to avoid impacting the execution path.
For multi-tenant SaaS deployments, enforce per-team concurrency limits. A team running 50 simultaneous reports should not starve other teams. Use a weighted fair-share scheduler at the queue level: each team gets a credit budget, and jobs from a team that has exhausted its credits are deprioritized until the window resets.
Summary
A reporting service is a multi-stage pipeline: definition and parameter validation, scheduled or ad-hoc execution, cached query execution against read replicas, intermediate Parquet storage, multi-format export, and multi-channel delivery. The Parquet intermediate layer is the key architectural decision that enables caching, format flexibility, and retry isolation. Query safety (injection prevention, timeouts, row caps, read-replica enforcement) and scheduler deduplication are the most critical correctness concerns. Scaling the execution and export fleets independently, combined with time-bucketed caching and per-tenant fairness, ensures the service remains responsive under mixed workloads of frequent small reports and infrequent large batch exports.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Atlassian Interview Guide