Low Level Design: Materialized Views and Streaming SQL

⏱ 3 min read

A materialized view is a pre-computed query result stored as a physical table, refreshed periodically or incrementally as source data changes. Materialized views are the bridge between raw data and fast analytical queries — avoiding repeated expensive computation at query time. Streaming SQL extends this concept to continuously updated views over unbounded streams of data, enabling sub-second latency for aggregate dashboards without full recomputation on every query.

Static Materialized Views

A static materialized view is refreshed on a schedule (complete refresh) or when the underlying tables change (incremental refresh). PostgreSQL: CREATE MATERIALIZED VIEW daily_revenue AS SELECT date, SUM(amount) FROM orders GROUP BY date; — then REFRESH MATERIALIZED VIEW CONCURRENTLY daily_revenue; runs without blocking reads. dbt (data build tool) manages materialized view refresh scheduling, dependency tracking, and incremental models for data warehouses. Use case: a dashboard querying daily revenue for the last year should not re-scan 3 years of order data on every page load.

-- PostgreSQL: create and refresh materialized view
CREATE MATERIALIZED VIEW product_sales_summary AS
SELECT
  p.category,
  DATE_TRUNC('day', o.created_at) AS sale_date,
  SUM(oi.quantity) AS units_sold,
  SUM(oi.quantity * oi.price) AS revenue
FROM order_items oi
JOIN orders o USING (order_id)
JOIN products p USING (product_id)
WHERE o.status = 'completed'
GROUP BY p.category, sale_date;

CREATE INDEX ON product_sales_summary (sale_date, category);

-- Concurrent refresh: readers not blocked
REFRESH MATERIALIZED VIEW CONCURRENTLY product_sales_summary;

Incremental View Maintenance

Full refresh recomputes the entire view from scratch — expensive for large tables. Incremental view maintenance (IVM) applies only the changes caused by new inserts/updates/deletes. For a SUM aggregate: when a new order is inserted, increment the relevant date/category bucket rather than re-summing all rows. IVM requires tracking which base table rows contribute to which view rows. PostgreSQL supports IVM via pgIVM extension. dbt incremental models implement a version: append new rows only (for append-only facts) or merge by primary key (for slowly changing data). IVM reduces refresh time from minutes to milliseconds for high-frequency updates.

Streaming Materialized Views (Flink, Materialize)

Streaming materialized views maintain continuously updated results as source data arrives in real-time. Materialize (a streaming database) and Apache Flink SQL allow writing standard SQL queries that are evaluated as standing queries over Kafka streams: CREATE MATERIALIZED VIEW active_user_count AS SELECT COUNT(DISTINCT user_id) FROM page_views WHERE event_time > NOW() - INTERVAL 5 MINUTES;. This view updates within milliseconds of each new page_view event, without polling or scheduled refresh. The view is stored in memory as an incrementally maintained data structure (a differential dataflow operator).

When to Use Materialized Views vs. Caching

Materialized views are maintained by the database, stay consistent with source data, and support complex aggregations and joins. Caching (Redis) is faster (sub-millisecond) but stale (TTL-based invalidation), does not support SQL-style aggregations, and requires application code to manage cache keys. Use materialized views when: the query is expensive, accuracy matters (financial reports), and the data changes frequently enough that a cache TTL would be too stale. Use caching when: sub-millisecond latency is required, the data is simple (key-value lookup), and stale-by-TTL is acceptable.

Key Interview Discussion Points

View staleness: how fresh does the view need to be? Financial dashboards may require near-real-time; marketing dashboards may accept hourly refresh
Cascade refresh: when a base table changes, multiple dependent views must refresh — track dependency graphs to refresh in correct order
Differential dataflow: the theoretical foundation for efficient incremental computation — only recompute the delta of the output caused by the delta of the input
Approximate aggregations: for extremely high cardinality (count distinct users from billions of events), use HyperLogLog for approximate count or t-digest for percentiles — materializing exact counts may be too expensive
Streaming window functions: tumbling windows (fixed non-overlapping intervals), sliding windows (fixed duration moving forward), session windows (activity-based grouping) for stream aggregation over time