Report Builder Low-Level Design
A report builder lets non-engineers define, save, schedule, and share data reports without writing raw SQL. The key design challenges are safety (preventing SQL injection from user-defined queries), performance (running reports against production data without degrading it), and reliability (scheduled delivery that handles failures gracefully).
Report Definition Schema
report_id— UUIDname,descriptionquery_template— parameterized SQL string with named placeholdersparameters_schema— JSON Schema defining parameter names, types, and defaultsoutput_format—csv,html,pdfowner_id,is_public,shared_with_team_ids[]schedule— cron expression (e.g.,0 8 * * 1for Monday 8 AM)delivery_channels— array of{type: email|slack, destination}
Visual Query Builder
The UI exposes a point-and-click interface: select a table from a permitted list, drag in columns, add filter conditions, choose grouping and ordering. Internally, this constructs a parameterized SQL template. Advanced users can edit the SQL directly in a sandboxed editor. The UI never allows arbitrary SQL to be submitted without passing validation.
Parameterized SQL
User-supplied filter values are never string-interpolated into the query. All parameters use named placeholders resolved by the database driver:
SELECT order_id, total, status
FROM orders
WHERE created_at > :start_date
AND status = :status
AND customer_id = :customer_id
ORDER BY created_at DESC
LIMIT :limit
The query template is stored as-is. Parameter values are bound at execution time by the DB driver, which prevents SQL injection regardless of what values the user provides.
Query Validation
Before storing or running a query template, validate it server-side:
- Parse the SQL with a SQL parser (e.g.,
pg_queryfor PostgreSQL) — reject malformed SQL - Allow only
SELECTstatements — rejectINSERT,UPDATE,DELETE,DROP, CTEs that write - Check that all referenced tables and columns are in the permitted whitelist for the user's role
- Reject subqueries that reference restricted tables
Execution
Run all report queries against a read replica or a dedicated analytics database — never the primary write database. Apply a hard query timeout (30 seconds default, configurable per report up to 5 minutes for scheduled reports). If the query exceeds the timeout, the execution is killed and the run is marked TIMED_OUT. Alert the report owner on repeated timeouts.
Result Caching
Cache query results in Redis keyed by SHA-256({report_id}:{serialized_parameters}). TTL is set per report based on data freshness requirements:
- Reference/config data: 1 hour
- Business metrics: 5–15 minutes
- Real-time operational data: no cache
Cached results serve repeated loads of the same report with the same parameters instantly, reducing replica load significantly for popular reports.
Report Rendering
- CSV: stream rows directly from query result to response; no full dataset in memory
- HTML table: paginate at 500 rows per page for browser rendering; include total row count
- PDF: render the HTML table in headless Chrome (Puppeteer), capture the page as PDF; suitable for formatted delivery via email
Scheduled Delivery
A scheduler scans for reports whose cron expression matches the current time, enqueues a run job per report. The worker:
- Executes the query with default parameters
- Renders the output in the configured format
- Delivers to each configured channel (email attachment, Slack file upload)
- Logs the run with timestamp, duration, row count, delivery status
On delivery failure, retry with exponential backoff up to 3 attempts. After 3 failures, mark the scheduled run FAILED and notify the report owner.
Access Control and Versioning
- Access control: report owner sets visibility — private (owner only), team (named teams), or org-public. Column whitelists are enforced at query validation time per role.
- Versioning: editing a report creates a new version; previous versions are preserved. Each run in the execution history is linked to the version that produced it, so historical runs remain reproducible.
Large Result Handling and Execution History
If a query returns more than 100K rows, force the output format to CSV and disable HTML and PDF rendering — a 100K-row HTML table is unusable. Log each execution with: run_id, report_version, parameters, started_at, duration_ms, row_count, delivered_to[]. This history is visible to the report owner for auditing and debugging delivery issues.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture
See also: Atlassian Interview Guide