What is a reporting service and how does it differ from a data warehouse query?

A reporting service is a managed layer that translates user-facing report definitions (filters, groupings, date ranges, metrics) into queries, executes them against one or more data sources, and formats the results for presentation or export. Unlike ad-hoc data warehouse queries, a reporting service enforces tenant-level access control, applies predefined business logic and metric definitions, caches results, supports scheduling and delivery, and abstracts the underlying query engine from end users. Data warehouse tools (BigQuery, Redshift) are often one of several backends the reporting service can delegate to, but the service adds the governance, UI contract, and operational features that raw warehouse access lacks.

How are parametric report queries executed safely to prevent SQL injection?

User-supplied filter values are never interpolated directly into SQL strings. Instead, the service uses parameterized queries (prepared statements) for all database backends, passing user values as bound parameters that the database driver treats as data, not syntax. For dynamic column selection or ORDER BY clauses—where bind parameters cannot be used—the service validates each value against a strict allowlist of known column names and sort directions before constructing the query fragment. A query-builder layer (e.g., Knex, SQLAlchemy, or a custom AST) is used rather than raw string concatenation, making accidental injection structurally impossible. All generated SQL is logged for audit, and query timeouts plus row-count limits are enforced to prevent resource exhaustion from malicious or pathological inputs.

How does a reporting service support scheduled report delivery?

Users or administrators define schedules (cron expressions) and delivery targets (email addresses, S3 buckets, Slack channels, webhooks) in the service's configuration store. A scheduler process (e.g., backed by pg_cron, Quartz, or a distributed cron like Airflow) fires at the specified cadence, enqueues a report-generation job with the saved parameters, and associates it with the delivery target. A worker executes the report query, formats the output (PDF, CSV, inline HTML), and hands it to a delivery service that sends the email via SES or posts to the webhook. Delivery status, timestamps, and any errors are recorded per schedule run, and failed deliveries are retried. Users receive an in-app notification and email with a secure link regardless of the delivery channel, ensuring they can access results even if a downstream channel is unavailable.

How are large reports exported to CSV or Excel without memory issues?

Large exports use streaming rather than materializing the full result set in memory. The query is executed with a server-side cursor (e.g., PostgreSQL's named cursor or JDBC streaming result set), and rows are fetched in small batches (e.g., 1,000 rows at a time). Each batch is serialized—appended to a streaming CSV writer or passed to a streaming Excel library such as xlsxwriter in Python or Apache POI's streaming SXSSF in Java—and flushed to the response stream or to a temporary file on disk. For very large exports the output is written directly to object storage (S3) in a multipart upload, and the user is notified with a download link when the upload completes. This approach keeps memory consumption bounded at O(batch size) regardless of the total row count, allowing exports of millions of rows without OOM risk.

Low Level Design: Reporting Service

⏱ 14 min read

Overview

A reporting service allows business users and automated processes to define, schedule, execute, and distribute reports over operational data. Reports are parameterized queries against one or more data sources, producing structured output exported in multiple formats (CSV, PDF, Excel) and delivered via email, webhook, or object storage. This LLD covers the report definition model, execution engine, scheduler, export pipeline, delivery channels, caching strategy, and all key design trade-offs.

Data Model

CREATE TABLE data_sources (
    id              BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    team_id         BIGINT UNSIGNED NOT NULL,
    name            VARCHAR(128) NOT NULL,
    type            ENUM('postgresql','mysql','bigquery','redshift','snowflake','http_api') NOT NULL,
    connection_dsn  TEXT NOT NULL COMMENT 'encrypted at rest; contains host, port, db, credentials',
    max_rows        INT UNSIGNED NOT NULL DEFAULT 100000 COMMENT 'safety cap on result set size',
    timeout_seconds SMALLINT UNSIGNED NOT NULL DEFAULT 30,
    is_active       TINYINT(1) NOT NULL DEFAULT 1,
    created_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_team (team_id)
);

CREATE TABLE report_definitions (
    id              BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    external_id     CHAR(36) NOT NULL COMMENT 'UUID for external references',
    team_id         BIGINT UNSIGNED NOT NULL,
    owner_user_id   BIGINT UNSIGNED NOT NULL,
    name            VARCHAR(256) NOT NULL,
    description     TEXT NULL,
    data_source_id  BIGINT UNSIGNED NOT NULL,
    query_template  TEXT NOT NULL COMMENT 'SQL or API request template with {{param}} placeholders',
    query_type      ENUM('sql','api') NOT NULL DEFAULT 'sql',
    param_schema    JSON NOT NULL COMMENT 'JSON Schema describing accepted parameters',
    default_params  JSON NULL,
    version         SMALLINT UNSIGNED NOT NULL DEFAULT 1,
    is_active       TINYINT(1) NOT NULL DEFAULT 1,
    created_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    FOREIGN KEY (data_source_id) REFERENCES data_sources(id),
    UNIQUE KEY uq_external (external_id),
    INDEX idx_team (team_id)
);

CREATE TABLE report_schedules (
    id                  BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    report_definition_id BIGINT UNSIGNED NOT NULL,
    cron_expression     VARCHAR(64) NOT NULL COMMENT 'standard 5-field cron or 6-field with seconds',
    timezone            VARCHAR(64) NOT NULL DEFAULT 'UTC',
    params_override     JSON NULL COMMENT 'parameter values for this schedule; merged with default_params',
    export_formats      JSON NOT NULL COMMENT 'array of: csv, pdf, xlsx',
    delivery_channel_id BIGINT UNSIGNED NOT NULL,
    is_active           TINYINT(1) NOT NULL DEFAULT 1,
    last_run_at         DATETIME NULL,
    next_run_at         DATETIME NULL,
    created_at          DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (report_definition_id) REFERENCES report_definitions(id),
    INDEX idx_next_run (is_active, next_run_at)
);

CREATE TABLE report_runs (
    id                  BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    external_id         CHAR(36) NOT NULL,
    report_definition_id BIGINT UNSIGNED NOT NULL,
    schedule_id         BIGINT UNSIGNED NULL COMMENT 'NULL for ad-hoc runs',
    triggered_by        BIGINT UNSIGNED NULL COMMENT 'user ID or NULL for scheduled',
    params              JSON NOT NULL,
    status              ENUM('queued','running','exporting','delivering','done','failed','cancelled') NOT NULL DEFAULT 'queued',
    row_count           INT UNSIGNED NULL,
    error_message       TEXT NULL,
    cache_key           CHAR(64) NULL,
    cache_hit           TINYINT(1) NOT NULL DEFAULT 0,
    queued_at           DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    started_at          DATETIME NULL,
    query_finished_at   DATETIME NULL,
    export_finished_at  DATETIME NULL,
    finished_at         DATETIME NULL,
    UNIQUE KEY uq_external (external_id),
    FOREIGN KEY (report_definition_id) REFERENCES report_definitions(id),
    INDEX idx_definition (report_definition_id, queued_at),
    INDEX idx_status (status, queued_at)
);

CREATE TABLE report_outputs (
    id              BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    run_id          BIGINT UNSIGNED NOT NULL,
    format          ENUM('csv','pdf','xlsx') NOT NULL,
    file_key        VARCHAR(512) NOT NULL COMMENT 'S3 key',
    file_size_bytes BIGINT UNSIGNED NOT NULL,
    file_hash       CHAR(64) NOT NULL,
    expires_at      DATETIME NULL,
    FOREIGN KEY (run_id) REFERENCES report_runs(id),
    INDEX idx_run (run_id)
);

CREATE TABLE delivery_channels (
    id          BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    team_id     BIGINT UNSIGNED NOT NULL,
    name        VARCHAR(128) NOT NULL,
    type        ENUM('email','webhook','s3','sftp') NOT NULL,
    config      JSON NOT NULL COMMENT 'type-specific: {to, cc, subject_template} or {url, headers} or {bucket, prefix}',
    is_active   TINYINT(1) NOT NULL DEFAULT 1,
    created_at  DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_team (team_id)
);

CREATE TABLE report_cache (
    cache_key       CHAR(64) PRIMARY KEY,
    run_id          BIGINT UNSIGNED NOT NULL,
    result_file_key VARCHAR(512) NOT NULL COMMENT 'S3 key of raw result (parquet or compressed JSON)',
    row_count       INT UNSIGNED NOT NULL,
    created_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    expires_at      DATETIME NOT NULL,
    INDEX idx_expires (expires_at)
);

Core Workflow

1. Report Definition and Parameter Schema

Report definitions store a query template with named placeholders in double-brace syntax: WHERE created_at >= {{start_date}} AND status = {{status}}. The param_schema JSON Schema field describes each parameter’s type, allowed values, default, and whether it is required. At run time, submitted parameters are validated against this schema before query construction.

SQL query templates are never interpolated with raw string substitution. Parameters are extracted and passed as prepared statement bind parameters to prevent SQL injection. For date range parameters, the service generates parameterized clauses and binds the values separately. Only the query structure (table names, column selections) can vary by template; all data values are always bound.

2. Scheduling

The scheduler is a single leader process (elected via a database lock or distributed mutex such as Redis SETNX with TTL). Every 30 seconds it queries report_schedules for rows where is_active = 1 AND next_run_at <= NOW(), creates a report_runs row for each, publishes the run IDs to the execution queue, and updates last_run_at and next_run_at (computed from the cron expression and timezone using a cron parser library).

Timezone-aware scheduling is important for business reports that must run at "9:00 AM local time". Cron expressions are stored in the report’s timezone; the scheduler converts next_run_at to UTC for storage and comparison.

3. Execution Engine

A worker dequeues a run ID, transitions status to running, and records started_at.
Cache check: The cache key is SHA-256 of report_definition_id + version + canonical_json(params) + floor(now / cache_ttl_bucket). The time bucket quantizes time so that reports with a 1-hour cache TTL reuse results from the same clock-hour. If a valid report_cache entry exists, the worker skips execution, marks cache_hit = 1, and jumps to export.
Query execution: The worker connects to the data source using a connection pool (pgBouncer for PostgreSQL, JDBC pool for others). The parameterized query executes with a statement_timeout matching data_sources.timeout_seconds. Rows are streamed from the database cursor rather than fetched into memory all at once.
Result storage: Streamed rows are written to S3 in Parquet format (columnar, compressed with Snappy). Parquet allows efficient re-reading for multiple export formats without re-querying the database. The S3 key is recorded in report_cache with an expiry.
query_finished_at and row_count are recorded. Status transitions to exporting.

4. Export Pipeline

For each requested format, the exporter reads the Parquet result file from S3 and converts it:

CSV: Stream rows through a CSV encoder with proper quoting and encoding (UTF-8 BOM for Excel compatibility). Compress with gzip if the row count exceeds a threshold.
XLSX: Use a streaming XLSX writer (e.g., xlsxwriter in Python, or Apache POI streaming API in Java) that writes rows directly to disk without building the entire workbook in memory. Apply column auto-width and freeze-pane headers. Limit to 1 million rows (Excel's row cap).
PDF: Render the result set as an HTML table using a parameterized template (report name, timestamp, parameters in header, paginated table), then convert to PDF via the PDF generation service (internal API call) or directly via headless Chromium. For large result sets, paginate the table across multiple PDF pages.

Each export output is written to S3 and recorded in report_outputs. Status transitions to delivering.

5. Delivery Channels

Email: The delivery worker fetches the output files, generates pre-signed S3 download links (valid 7 days), and sends an email via SES or SendGrid with the links embedded. Files under 10 MB may be attached directly; larger files are link-only.
Webhook: The worker POSTs a JSON payload to the configured URL containing the run metadata and signed download URLs. Delivery is retried with exponential backoff on non-2xx responses. After 5 failures the channel is marked as degraded and the team is alerted.
S3: The output files are copied to a team-configured S3 bucket and prefix. Useful for downstream pipeline consumption.
SFTP: Files are pushed to an SFTP server. SFTP credentials are stored encrypted in delivery_channels.config.

After successful delivery, status transitions to done and finished_at is recorded.

Key Design Decisions and Trade-offs

Intermediate Parquet Storage

Storing query results as Parquet on S3 before export decouples query execution from export format generation. This means: (1) the cache can be shared across runs requesting different formats, (2) a failed Excel export can be retried without re-querying the database, and (3) multiple formats can be exported in parallel by independent workers. The trade-off is additional S3 read/write cost and the overhead of a Parquet library dependency.

Read Replica Routing

All report queries must run against read replicas, never the primary database. Enforce this at the data source level: connection DSNs for reporting data sources must point to replica endpoints. A query that accidentally runs on the primary under heavy load can cause replication lag spikes that affect production writes. Consider read replica lag monitoring; if replica lag exceeds a threshold, delay scheduled reports rather than serving stale data silently.

Query Safety and Resource Limits

Untrusted report definitions (if users can write their own SQL) are extremely dangerous. Options in increasing security order: (1) whitelist-only query builder UI with no raw SQL; (2) raw SQL but run under a database role with SELECT-only privileges on specific schemas; (3) raw SQL parsed through a query validator that rejects mutations (INSERT, UPDATE, DELETE, DDL) and disallows subqueries to untrusted tables. Always combine with statement timeouts, max-rows caps, and cost-based query plan analysis (EXPLAIN before execution, reject plans above a cost threshold).

Time-Bucketed Caching

The cache key includes a time bucket (floor of current epoch divided by TTL seconds) so that the cached result is automatically invalidated when the bucket rolls over. This avoids storing explicit TTL logic in the application layer per report and ensures all workers agree on cache validity without coordination. The trade-off is that reports can see data that is up to TTL seconds stale. TTL is configurable per report definition; time-sensitive reports set TTL = 0 to disable caching.

Scheduler Leader Election

Running multiple scheduler instances without coordination causes duplicate runs. Use database-level advisory locking (PostgreSQL pg_try_advisory_lock) or a Redis SETNX lock with a TTL equal to the scheduling interval plus a safety margin. The lock holder runs the scheduling loop; others poll until they can acquire the lock. This is simpler than Raft-based leader election for a low-frequency scheduling loop.

Failure Handling and Edge Cases

Query timeout: The database kills the query after timeout_seconds. The worker catches the timeout exception, sets error_message, and transitions the run to failed. Scheduled runs are retried at the next scheduled time (not immediately), preventing a slow query from flooding the database with retries.
Result set too large: The worker enforces a row count cap by adding LIMIT max_rows + 1 to every query. If the result exceeds max_rows, the run fails with a descriptive error suggesting the user add filters or reduce the date range. This prevents OOM conditions in the export stage.
Export OOM for large XLSX: XLSX streaming writers use constant memory regardless of row count, but the Parquet read buffer can be large. Read the Parquet file in row group chunks (Parquet's natural unit) rather than loading the entire file. Each chunk is written to XLSX and discarded before the next chunk is read.
Delivery channel failure: Delivery is decoupled from export: if email delivery fails, the output files remain in S3 and can be re-delivered by resubmitting the delivery step. The system records delivery attempts and outcomes per channel in a delivery_attempts table (omitted from schema above for brevity) to support debugging.
Duplicate scheduled runs: If the scheduler runs twice (split-brain during leader re-election), the same schedule may enqueue two runs for the same window. Guard against this with a unique constraint on (schedule_id, DATE_FORMAT(queued_at, '%Y-%m-%d %H')) or by checking for an existing run in the same time window before insertion.
Parameter injection via template: If the query template contains structural SQL (table name varies by parameter), validate the parameter value against an explicit allowlist of table names stored in the report definition, never trusting caller-supplied values for structural SQL elements.

Scalability Considerations

Query execution workers and export workers should be separate fleets since they have different resource profiles: query workers are network-I/O bound (waiting on the database), while export workers are CPU and disk-I/O bound (encoding large files). Scale each independently. Use separate queues for each stage so a backlog of large exports does not block fast query executions.

For very large reports (millions of rows), the Parquet intermediate file can be partitioned into multiple S3 objects (one per 500K rows). Export workers process each partition in parallel and merge the results. For CSV this is trivial (concatenate). For XLSX, each partition writes a separate sheet or the worker stitches sheets in a post-processing step.

The report_cache table grows unboundedly without maintenance. A cleanup job deletes rows where expires_at < NOW() and removes the corresponding S3 Parquet files. Run this cleanup hourly on a low-priority worker to avoid impacting the execution path.

For multi-tenant SaaS deployments, enforce per-team concurrency limits. A team running 50 simultaneous reports should not starve other teams. Use a weighted fair-share scheduler at the queue level: each team gets a credit budget, and jobs from a team that has exhausted its credits are deprioritized until the window resets.

Summary

A reporting service is a multi-stage pipeline: definition and parameter validation, scheduled or ad-hoc execution, cached query execution against read replicas, intermediate Parquet storage, multi-format export, and multi-channel delivery. The Parquet intermediate layer is the key architectural decision that enables caching, format flexibility, and retry isolation. Query safety (injection prevention, timeouts, row caps, read-replica enforcement) and scheduler deduplication are the most critical correctness concerns. Scaling the execution and export fleets independently, combined with time-bucketed caching and per-tenant fairness, ensures the service remains responsive under mixed workloads of frequent small reports and infrequent large batch exports.