Report Builder Low-Level Design: Dynamic Query Generation, Parameterized Reports, and Scheduled Delivery

Report Builder Low-Level Design

A report builder lets non-engineers define, save, schedule, and share data reports without writing raw SQL. The key design challenges are safety (preventing SQL injection from user-defined queries), performance (running reports against production data without degrading it), and reliability (scheduled delivery that handles failures gracefully).

Report Definition Schema

report_id — UUID
name, description
query_template — parameterized SQL string with named placeholders
parameters_schema — JSON Schema defining parameter names, types, and defaults
output_format — csv, html, pdf
owner_id, is_public, shared_with_team_ids[]
schedule — cron expression (e.g., 0 8 * * 1 for Monday 8 AM)
delivery_channels — array of {type: email|slack, destination}

Visual Query Builder

The UI exposes a point-and-click interface: select a table from a permitted list, drag in columns, add filter conditions, choose grouping and ordering. Internally, this constructs a parameterized SQL template. Advanced users can edit the SQL directly in a sandboxed editor. The UI never allows arbitrary SQL to be submitted without passing validation.

Parameterized SQL

User-supplied filter values are never string-interpolated into the query. All parameters use named placeholders resolved by the database driver:

SELECT order_id, total, status
FROM orders
WHERE created_at > :start_date
  AND status = :status
  AND customer_id = :customer_id
ORDER BY created_at DESC
LIMIT :limit

The query template is stored as-is. Parameter values are bound at execution time by the DB driver, which prevents SQL injection regardless of what values the user provides.

Query Validation

Before storing or running a query template, validate it server-side:

Parse the SQL with a SQL parser (e.g., pg_query for PostgreSQL) — reject malformed SQL
Allow only SELECT statements — reject INSERT, UPDATE, DELETE, DROP, CTEs that write
Check that all referenced tables and columns are in the permitted whitelist for the user's role
Reject subqueries that reference restricted tables

Execution

Run all report queries against a read replica or a dedicated analytics database — never the primary write database. Apply a hard query timeout (30 seconds default, configurable per report up to 5 minutes for scheduled reports). If the query exceeds the timeout, the execution is killed and the run is marked TIMED_OUT. Alert the report owner on repeated timeouts.

Result Caching

Cache query results in Redis keyed by SHA-256({report_id}:{serialized_parameters}). TTL is set per report based on data freshness requirements:

Reference/config data: 1 hour
Business metrics: 5–15 minutes
Real-time operational data: no cache

Cached results serve repeated loads of the same report with the same parameters instantly, reducing replica load significantly for popular reports.

Report Rendering

CSV: stream rows directly from query result to response; no full dataset in memory
HTML table: paginate at 500 rows per page for browser rendering; include total row count
PDF: render the HTML table in headless Chrome (Puppeteer), capture the page as PDF; suitable for formatted delivery via email

Scheduled Delivery

A scheduler scans for reports whose cron expression matches the current time, enqueues a run job per report. The worker:

Executes the query with default parameters
Renders the output in the configured format
Delivers to each configured channel (email attachment, Slack file upload)
Logs the run with timestamp, duration, row count, delivery status

On delivery failure, retry with exponential backoff up to 3 attempts. After 3 failures, mark the scheduled run FAILED and notify the report owner.

Access Control and Versioning

Access control: report owner sets visibility — private (owner only), team (named teams), or org-public. Column whitelists are enforced at query validation time per role.
Versioning: editing a report creates a new version; previous versions are preserved. Each run in the execution history is linked to the version that produced it, so historical runs remain reproducible.

Large Result Handling and Execution History

If a query returns more than 100K rows, force the output format to CSV and disable HTML and PDF rendering — a 100K-row HTML table is unusable. Log each execution with: run_id, report_version, parameters, started_at, duration_ms, row_count, delivered_to[]. This history is visible to the report owner for auditing and debugging delivery issues.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you safely generate dynamic SQL queries from user-defined report parameters without introducing SQL injection?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Never concatenate raw user input into SQL strings. Build queries programmatically using a query builder library (e.g., SQLAlchemy core, jOOQ, Knex) that enforces parameterized bindings for all values. Allow users to select from a whitelist of column names and table identifiers that are validated server-side against an allowed schema registry — never accept arbitrary column names as strings. Aggregate functions, GROUP BY targets, and ORDER BY columns must also come from the whitelist. The final SQL is constructed by the server from safe building blocks, with all user-supplied values passed as bind parameters.”
}
},
{
“@type”: “Question”,
“name”: “How would you design the data model for parameterized, reusable report templates?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Store report templates as rows in a report_templates table with columns: id, name, owner_id, base_query_json (structured representation of SELECT / FROM / WHERE / GROUP BY clauses), parameter_schema (JSON Schema defining accepted parameters and their types), and created_at. Parameters are placeholders in the query definition (e.g., {date_from}, {user_id}). At run time, validate the caller's supplied values against parameter_schema, substitute into the query structure, and execute. Report runs are stored in a report_runs table referencing the template ID, the resolved parameters, status, and output location.”
}
},
{
“@type”: “Question”,
“name”: “What caching strategy would you apply to expensive report queries that many users run with the same parameters?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Cache at the result level using a key derived from a hash of (template_id + canonical sorted parameters). Store results in Redis or S3 with a TTL matched to the data freshness requirement (e.g., 5 minutes for operational reports, 1 hour for daily summaries). On a cache hit, return the cached result immediately. On a miss, check whether an in-flight computation for the same key exists (using a distributed lock or a pending-jobs table) to prevent the thundering herd problem — additional requests wait for the in-flight result rather than launching duplicate queries. Invalidate proactively when underlying data changes if the system supports change-data-capture.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement scheduled report delivery reliably, ensuring reports are delivered exactly once even if a worker crashes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use a durable scheduler (e.g., pg-cron, Quartz, or Temporal) to insert scheduled_report_runs rows at the configured cadence. A worker claims a run by atomically setting its status from PENDING to CLAIMED with a worker ID and a claim expiry timestamp (UPDATE … WHERE status='PENDING' RETURNING id). If the worker crashes, a watchdog process resets CLAIMED rows whose expiry has passed back to PENDING. After successfully generating and delivering the report, the worker sets status to DELIVERED. Delivery receipts (e.g., email bounce/open events via webhook) are stored against the run record for auditability.”
}
}
]
}