Low Level Design: Reporting Service

Overview

A reporting service allows business users and automated processes to define, schedule, execute, and distribute reports over operational data. Reports are parameterized queries against one or more data sources, producing structured output exported in multiple formats (CSV, PDF, Excel) and delivered via email, webhook, or object storage. This LLD covers the report definition model, execution engine, scheduler, export pipeline, delivery channels, caching strategy, and all key design trade-offs.

Data Model

CREATE TABLE data_sources (
    id              BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    team_id         BIGINT UNSIGNED NOT NULL,
    name            VARCHAR(128) NOT NULL,
    type            ENUM('postgresql','mysql','bigquery','redshift','snowflake','http_api') NOT NULL,
    connection_dsn  TEXT NOT NULL COMMENT 'encrypted at rest; contains host, port, db, credentials',
    max_rows        INT UNSIGNED NOT NULL DEFAULT 100000 COMMENT 'safety cap on result set size',
    timeout_seconds SMALLINT UNSIGNED NOT NULL DEFAULT 30,
    is_active       TINYINT(1) NOT NULL DEFAULT 1,
    created_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_team (team_id)
);

CREATE TABLE report_definitions (
    id              BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    external_id     CHAR(36) NOT NULL COMMENT 'UUID for external references',
    team_id         BIGINT UNSIGNED NOT NULL,
    owner_user_id   BIGINT UNSIGNED NOT NULL,
    name            VARCHAR(256) NOT NULL,
    description     TEXT NULL,
    data_source_id  BIGINT UNSIGNED NOT NULL,
    query_template  TEXT NOT NULL COMMENT 'SQL or API request template with {{param}} placeholders',
    query_type      ENUM('sql','api') NOT NULL DEFAULT 'sql',
    param_schema    JSON NOT NULL COMMENT 'JSON Schema describing accepted parameters',
    default_params  JSON NULL,
    version         SMALLINT UNSIGNED NOT NULL DEFAULT 1,
    is_active       TINYINT(1) NOT NULL DEFAULT 1,
    created_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    FOREIGN KEY (data_source_id) REFERENCES data_sources(id),
    UNIQUE KEY uq_external (external_id),
    INDEX idx_team (team_id)
);

CREATE TABLE report_schedules (
    id                  BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    report_definition_id BIGINT UNSIGNED NOT NULL,
    cron_expression     VARCHAR(64) NOT NULL COMMENT 'standard 5-field cron or 6-field with seconds',
    timezone            VARCHAR(64) NOT NULL DEFAULT 'UTC',
    params_override     JSON NULL COMMENT 'parameter values for this schedule; merged with default_params',
    export_formats      JSON NOT NULL COMMENT 'array of: csv, pdf, xlsx',
    delivery_channel_id BIGINT UNSIGNED NOT NULL,
    is_active           TINYINT(1) NOT NULL DEFAULT 1,
    last_run_at         DATETIME NULL,
    next_run_at         DATETIME NULL,
    created_at          DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (report_definition_id) REFERENCES report_definitions(id),
    INDEX idx_next_run (is_active, next_run_at)
);

CREATE TABLE report_runs (
    id                  BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    external_id         CHAR(36) NOT NULL,
    report_definition_id BIGINT UNSIGNED NOT NULL,
    schedule_id         BIGINT UNSIGNED NULL COMMENT 'NULL for ad-hoc runs',
    triggered_by        BIGINT UNSIGNED NULL COMMENT 'user ID or NULL for scheduled',
    params              JSON NOT NULL,
    status              ENUM('queued','running','exporting','delivering','done','failed','cancelled') NOT NULL DEFAULT 'queued',
    row_count           INT UNSIGNED NULL,
    error_message       TEXT NULL,
    cache_key           CHAR(64) NULL,
    cache_hit           TINYINT(1) NOT NULL DEFAULT 0,
    queued_at           DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    started_at          DATETIME NULL,
    query_finished_at   DATETIME NULL,
    export_finished_at  DATETIME NULL,
    finished_at         DATETIME NULL,
    UNIQUE KEY uq_external (external_id),
    FOREIGN KEY (report_definition_id) REFERENCES report_definitions(id),
    INDEX idx_definition (report_definition_id, queued_at),
    INDEX idx_status (status, queued_at)
);

CREATE TABLE report_outputs (
    id              BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    run_id          BIGINT UNSIGNED NOT NULL,
    format          ENUM('csv','pdf','xlsx') NOT NULL,
    file_key        VARCHAR(512) NOT NULL COMMENT 'S3 key',
    file_size_bytes BIGINT UNSIGNED NOT NULL,
    file_hash       CHAR(64) NOT NULL,
    expires_at      DATETIME NULL,
    FOREIGN KEY (run_id) REFERENCES report_runs(id),
    INDEX idx_run (run_id)
);

CREATE TABLE delivery_channels (
    id          BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    team_id     BIGINT UNSIGNED NOT NULL,
    name        VARCHAR(128) NOT NULL,
    type        ENUM('email','webhook','s3','sftp') NOT NULL,
    config      JSON NOT NULL COMMENT 'type-specific: {to, cc, subject_template} or {url, headers} or {bucket, prefix}',
    is_active   TINYINT(1) NOT NULL DEFAULT 1,
    created_at  DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_team (team_id)
);

CREATE TABLE report_cache (
    cache_key       CHAR(64) PRIMARY KEY,
    run_id          BIGINT UNSIGNED NOT NULL,
    result_file_key VARCHAR(512) NOT NULL COMMENT 'S3 key of raw result (parquet or compressed JSON)',
    row_count       INT UNSIGNED NOT NULL,
    created_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    expires_at      DATETIME NOT NULL,
    INDEX idx_expires (expires_at)
);

Core Workflow

1. Report Definition and Parameter Schema

Report definitions store a query template with named placeholders in double-brace syntax: WHERE created_at >= {{start_date}} AND status = {{status}}. The param_schema JSON Schema field describes each parameter’s type, allowed values, default, and whether it is required. At run time, submitted parameters are validated against this schema before query construction.

SQL query templates are never interpolated with raw string substitution. Parameters are extracted and passed as prepared statement bind parameters to prevent SQL injection. For date range parameters, the service generates parameterized clauses and binds the values separately. Only the query structure (table names, column selections) can vary by template; all data values are always bound.

2. Scheduling

The scheduler is a single leader process (elected via a database lock or distributed mutex such as Redis SETNX with TTL). Every 30 seconds it queries report_schedules for rows where is_active = 1 AND next_run_at <= NOW(), creates a report_runs row for each, publishes the run IDs to the execution queue, and updates last_run_at and next_run_at (computed from the cron expression and timezone using a cron parser library).

Timezone-aware scheduling is important for business reports that must run at "9:00 AM local time". Cron expressions are stored in the report’s timezone; the scheduler converts next_run_at to UTC for storage and comparison.

3. Execution Engine

  1. A worker dequeues a run ID, transitions status to running, and records started_at.
  2. Cache check: The cache key is SHA-256 of report_definition_id + version + canonical_json(params) + floor(now / cache_ttl_bucket). The time bucket quantizes time so that reports with a 1-hour cache TTL reuse results from the same clock-hour. If a valid report_cache entry exists, the worker skips execution, marks cache_hit = 1, and jumps to export.
  3. Query execution: The worker connects to the data source using a connection pool (pgBouncer for PostgreSQL, JDBC pool for others). The parameterized query executes with a statement_timeout matching data_sources.timeout_seconds. Rows are streamed from the database cursor rather than fetched into memory all at once.
  4. Result storage: Streamed rows are written to S3 in Parquet format (columnar, compressed with Snappy). Parquet allows efficient re-reading for multiple export formats without re-querying the database. The S3 key is recorded in report_cache with an expiry.
  5. query_finished_at and row_count are recorded. Status transitions to exporting.

4. Export Pipeline

For each requested format, the exporter reads the Parquet result file from S3 and converts it:

  • CSV: Stream rows through a CSV encoder with proper quoting and encoding (UTF-8 BOM for Excel compatibility). Compress with gzip if the row count exceeds a threshold.
  • XLSX: Use a streaming XLSX writer (e.g., xlsxwriter in Python, or Apache POI streaming API in Java) that writes rows directly to disk without building the entire workbook in memory. Apply column auto-width and freeze-pane headers. Limit to 1 million rows (Excel's row cap).
  • PDF: Render the result set as an HTML table using a parameterized template (report name, timestamp, parameters in header, paginated table), then convert to PDF via the PDF generation service (internal API call) or directly via headless Chromium. For large result sets, paginate the table across multiple PDF pages.

Each export output is written to S3 and recorded in report_outputs. Status transitions to delivering.

5. Delivery Channels

  • Email: The delivery worker fetches the output files, generates pre-signed S3 download links (valid 7 days), and sends an email via SES or SendGrid with the links embedded. Files under 10 MB may be attached directly; larger files are link-only.
  • Webhook: The worker POSTs a JSON payload to the configured URL containing the run metadata and signed download URLs. Delivery is retried with exponential backoff on non-2xx responses. After 5 failures the channel is marked as degraded and the team is alerted.
  • S3: The output files are copied to a team-configured S3 bucket and prefix. Useful for downstream pipeline consumption.
  • SFTP: Files are pushed to an SFTP server. SFTP credentials are stored encrypted in delivery_channels.config.

After successful delivery, status transitions to done and finished_at is recorded.

Key Design Decisions and Trade-offs

Intermediate Parquet Storage

Storing query results as Parquet on S3 before export decouples query execution from export format generation. This means: (1) the cache can be shared across runs requesting different formats, (2) a failed Excel export can be retried without re-querying the database, and (3) multiple formats can be exported in parallel by independent workers. The trade-off is additional S3 read/write cost and the overhead of a Parquet library dependency.

Read Replica Routing

All report queries must run against read replicas, never the primary database. Enforce this at the data source level: connection DSNs for reporting data sources must point to replica endpoints. A query that accidentally runs on the primary under heavy load can cause replication lag spikes that affect production writes. Consider read replica lag monitoring; if replica lag exceeds a threshold, delay scheduled reports rather than serving stale data silently.

Query Safety and Resource Limits

Untrusted report definitions (if users can write their own SQL) are extremely dangerous. Options in increasing security order: (1) whitelist-only query builder UI with no raw SQL; (2) raw SQL but run under a database role with SELECT-only privileges on specific schemas; (3) raw SQL parsed through a query validator that rejects mutations (INSERT, UPDATE, DELETE, DDL) and disallows subqueries to untrusted tables. Always combine with statement timeouts, max-rows caps, and cost-based query plan analysis (EXPLAIN before execution, reject plans above a cost threshold).

Time-Bucketed Caching

The cache key includes a time bucket (floor of current epoch divided by TTL seconds) so that the cached result is automatically invalidated when the bucket rolls over. This avoids storing explicit TTL logic in the application layer per report and ensures all workers agree on cache validity without coordination. The trade-off is that reports can see data that is up to TTL seconds stale. TTL is configurable per report definition; time-sensitive reports set TTL = 0 to disable caching.

Scheduler Leader Election

Running multiple scheduler instances without coordination causes duplicate runs. Use database-level advisory locking (PostgreSQL pg_try_advisory_lock) or a Redis SETNX lock with a TTL equal to the scheduling interval plus a safety margin. The lock holder runs the scheduling loop; others poll until they can acquire the lock. This is simpler than Raft-based leader election for a low-frequency scheduling loop.

Failure Handling and Edge Cases

  • Query timeout: The database kills the query after timeout_seconds. The worker catches the timeout exception, sets error_message, and transitions the run to failed. Scheduled runs are retried at the next scheduled time (not immediately), preventing a slow query from flooding the database with retries.
  • Result set too large: The worker enforces a row count cap by adding LIMIT max_rows + 1 to every query. If the result exceeds max_rows, the run fails with a descriptive error suggesting the user add filters or reduce the date range. This prevents OOM conditions in the export stage.
  • Export OOM for large XLSX: XLSX streaming writers use constant memory regardless of row count, but the Parquet read buffer can be large. Read the Parquet file in row group chunks (Parquet's natural unit) rather than loading the entire file. Each chunk is written to XLSX and discarded before the next chunk is read.
  • Delivery channel failure: Delivery is decoupled from export: if email delivery fails, the output files remain in S3 and can be re-delivered by resubmitting the delivery step. The system records delivery attempts and outcomes per channel in a delivery_attempts table (omitted from schema above for brevity) to support debugging.
  • Duplicate scheduled runs: If the scheduler runs twice (split-brain during leader re-election), the same schedule may enqueue two runs for the same window. Guard against this with a unique constraint on (schedule_id, DATE_FORMAT(queued_at, '%Y-%m-%d %H')) or by checking for an existing run in the same time window before insertion.
  • Parameter injection via template: If the query template contains structural SQL (table name varies by parameter), validate the parameter value against an explicit allowlist of table names stored in the report definition, never trusting caller-supplied values for structural SQL elements.

Scalability Considerations

Query execution workers and export workers should be separate fleets since they have different resource profiles: query workers are network-I/O bound (waiting on the database), while export workers are CPU and disk-I/O bound (encoding large files). Scale each independently. Use separate queues for each stage so a backlog of large exports does not block fast query executions.

For very large reports (millions of rows), the Parquet intermediate file can be partitioned into multiple S3 objects (one per 500K rows). Export workers process each partition in parallel and merge the results. For CSV this is trivial (concatenate). For XLSX, each partition writes a separate sheet or the worker stitches sheets in a post-processing step.

The report_cache table grows unboundedly without maintenance. A cleanup job deletes rows where expires_at < NOW() and removes the corresponding S3 Parquet files. Run this cleanup hourly on a low-priority worker to avoid impacting the execution path.

For multi-tenant SaaS deployments, enforce per-team concurrency limits. A team running 50 simultaneous reports should not starve other teams. Use a weighted fair-share scheduler at the queue level: each team gets a credit budget, and jobs from a team that has exhausted its credits are deprioritized until the window resets.

Summary

A reporting service is a multi-stage pipeline: definition and parameter validation, scheduled or ad-hoc execution, cached query execution against read replicas, intermediate Parquet storage, multi-format export, and multi-channel delivery. The Parquet intermediate layer is the key architectural decision that enables caching, format flexibility, and retry isolation. Query safety (injection prevention, timeouts, row caps, read-replica enforcement) and scheduler deduplication are the most critical correctness concerns. Scaling the execution and export fleets independently, combined with time-bucketed caching and per-tenant fairness, ensures the service remains responsive under mixed workloads of frequent small reports and infrequent large batch exports.

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Atlassian Interview Guide

Scroll to Top