Time-Series Databases for Quant: kdb+, ClickHouse, InfluxDB, and What Quant Firms Use

Time-Series Databases for Quant: KDB, ClickHouse, InfluxDB, and What Quant Firms Actually Use

Quant firms generate and consume staggering volumes of time-series data: trade prints, quote updates, order book snapshots, position histories, signal scores, factor exposures, P&L attributions. A typical hedge fund’s research database holds billions of rows of tick data; HFT firms generate terabytes per day. Standard relational databases (PostgreSQL, MySQL) handle small amounts adequately but collapse at quant scale. Specialized time-series databases — kdb+, ClickHouse, InfluxDB, TimescaleDB, QuestDB, Parquet-based stacks — dominate quant infrastructure. For SWE and quant-developer interview candidates, understanding what these systems do and when each is appropriate is genuine domain knowledge that’s hard to acquire outside the industry.

What Makes Time-Series Different

Time-series workloads have characteristics that general-purpose databases don’t optimize for:

Append-heavy: data is inserted in roughly time order; updates and deletes are rare.
Time-keyed queries: “give me data from time T1 to T2” is the most common query.
Aggregations are dominant: “average price per minute,” “total volume per day,” “VWAP per session.”
Massive scale: billions of rows; terabytes per symbol-year.
High-cardinality joins: joining trades to quotes by timestamp; aligning multiple instruments.

Optimizing for these workloads gives 10–100x performance over general-purpose databases for typical quant queries.

kdb+ / q

The dominant time-series database in finance for over two decades. Developed by KX Systems. Used by most major banks, hedge funds, and HFT firms. Programming language: q (a successor to APL/K).

Strengths

Extreme performance: in-memory tables, columnar storage, vectorized operations.
Compact code: q is famously concise; complex queries fit on a few lines.
Native time-series operations: time-window joins, as-of joins, rolling aggregations.
Industry standard: established tooling, large community, abundant talent (relatively).

Weaknesses

License cost: kdb+ is commercial and expensive (six figures per year for production deployments at scale).
q is a steep learning curve: terse syntax, idioms unlike most programming languages.
Operational complexity: tuning, capacity planning, distributed setups require expertise.

When to use

Industry-standard for tick database storage and replay at large hedge funds, banks, and HFT firms. Most commonly: a kdb+ tick database storing years of trade and quote data, queried by research and risk teams.

ClickHouse

Open-source columnar database originally built by Yandex. Increasingly popular in quant for analytics workloads.

Strengths

Open source, free.
SQL interface (familiar to most developers, unlike q).
Excellent compression and query speed.
Scales horizontally with sharding.
Active development, growing community.

Weaknesses

Less specialized for time-series than kdb+: no native as-of join, less terse for time-series operations.
Newer in finance; tooling and community knowledge less mature.
Some operational rough edges compared to mature commercial systems.

When to use

Cost-conscious teams or shops without legacy kdb+ investment; analytical workloads beyond pure tick storage (clickstream-like data, event analytics).

InfluxDB / TimescaleDB / QuestDB

Other open-source time-series databases. Each has its niche.

InfluxDB: popular for IoT and DevOps monitoring. Has growing finance use but doesn’t match kdb+ or ClickHouse for high-volume tick data.
TimescaleDB: PostgreSQL extension for time-series. Familiar SQL; integrates with existing Postgres deployments. Performance is good for moderate scale; not optimal for HFT-scale tick data.
QuestDB: newer entrant designed for high-throughput finance use cases. Open source. Smaller community but growing.

When to use

For specific niches: InfluxDB for monitoring infrastructure, TimescaleDB for moderate-volume time-series alongside relational data, QuestDB as a kdb+ alternative for cost-conscious teams.

Parquet + Object Storage Stacks

Modern data engineering pattern: store time-series in Parquet (columnar file format) on object storage (S3, GCS, Azure Blob). Query with SQL engines (Spark, Trino / Presto, DuckDB) or DataFrame libraries (Pandas, Polars).

Strengths

Cost-effective: object storage is cheap; you pay for compute only when querying.
Open: Parquet is widely supported.
Decouples storage from compute: query with whatever engine fits the use case.
Plays well with modern ML stacks.

Weaknesses

Latency is higher than purpose-built time-series databases (network access, query planning overhead).
Not suitable for low-latency operational queries; better for batch analytics and research.
Operational complexity of distributed query engines.

When to use

Research and analytics workloads where latency in the seconds is acceptable. Many systematic hedge funds use this stack for research; pair with kdb+ or ClickHouse for low-latency operational queries.

Common Interview Questions

Choose a database

“You need to store 5 years of US equity tick data (~100M rows per day). What database do you use?” Discuss kdb+ as the industry standard if budget allows; ClickHouse as a cost-effective open-source alternative; Parquet + DuckDB for a research-only setup. Strong candidates discuss latency requirements, query patterns, team familiarity, and cost.

Design a tick database schema

“Design the schema for storing trades and quotes.” Columns for timestamp, symbol, side, price, quantity (for trades) plus bid, ask, bid_size, ask_size (for quotes). Discuss compression-friendly column ordering (group similar data types together). Discuss partitioning by date for query efficiency. Discuss handling timestamp precision (microseconds vs nanoseconds).

Discuss as-of joins

“Join trades to quotes such that each trade gets the prevailing quote at the time of the trade.” Standard SQL is awkward; kdb+ has native as-of join (aj). ClickHouse has ASOF JOIN. Explain the algorithm: for each trade, binary search for the latest quote with timestamp ≤ trade time. Strong candidates discuss why this is a fundamental operation in finance.

Compute VWAP

“Compute volume-weighted average price by symbol per day.” Aggregation: sum(price * volume) / sum(volume) per group. Trivial in any SQL-like language; discuss scaling considerations (do you compute VWAP per minute and aggregate, or aggregate raw trades?).

Discuss compression

“How does columnar compression work and why does it help time-series?” Same data type per column (run-length encoding, dictionary encoding); time-series often has correlated values (delta encoding); compression ratios of 5–20x are normal. Result: less data to read from disk, faster queries.

Practical Patterns

Hot vs cold storage

Recent data (today, this week) in fast in-memory storage (kdb+ in-memory, ClickHouse on SSDs). Older data in cheaper storage (kdb+ on disk, S3 / Parquet). Tiering policies move data automatically.

Real-time + historical

Real-time tick stream into a hot database; periodic batches written to historical store. Queries that span both use a federated query that hits both systems and merges.

Snapshotting

For order book data, storing every update is expensive. Common pattern: store snapshots (full book state) periodically plus deltas between snapshots. Reconstruct intermediate states by applying deltas to the most recent snapshot.

Time-zone handling

Time-zone bugs are endemic in finance data. Standard practice: store UTC timestamps; convert at query time if needed. Beware of daylight savings transitions; some venues operate in local time, requiring extra translation.

Frequently Asked Questions

Do I need to learn kdb+ / q before interviewing?

Helpful but not required. Most quant firms don’t expect candidates to know kdb+ before joining; they’ll train you. But familiarity is a meaningful advantage at firms where kdb+ is dominant (most banks, many large hedge funds). For interview prep, knowing kdb+ exists, what it’s good at, and roughly how its data model works (tables as columnar arrays, q as the query language) is sufficient. Going deeper signals serious interest in finance infrastructure.

What’s the relationship between time-series databases and quant research workflows?

Tight. Quant researchers spend significant time querying time-series data: pulling history for a strategy, joining trades to quotes, computing rolling statistics, aggregating across symbols. The database performance directly affects researcher productivity. Slow queries (minutes to hours for routine pulls) drag on research velocity; fast queries enable exploration that wouldn’t be feasible otherwise. Quant firms invest heavily in time-series infrastructure for exactly this reason.

How big are quant time-series datasets in practice?

For a major hedge fund with US equity coverage: tick data alone is ~100GB per day raw, ~10GB per day compressed. Five years of history is 5–10TB. Add options (much higher message rates), futures (multiple exchanges), FX (24/7 trading), credit, and the totals climb to dozens or hundreds of TB. Major HFT firms often have petabyte-scale tick archives. Storage and access costs are real engineering concerns.

How does this differ from time-series in non-finance contexts (IoT, monitoring)?

Finance time-series has higher message rates, smaller events (a trade is a few bytes), higher emphasis on exact timestamp precision (microseconds matter), and richer query patterns (as-of joins, multi-symbol aggregations, intraday seasonality). IoT and monitoring time-series tend to have lower message rates per source but more sources; queries tend to be simpler aggregations over time windows. Different optimization targets; different DB choices.

Is the time-series database choice a strategic decision or an implementation detail?

Strategic. Switching from kdb+ to ClickHouse (or vice versa) is a multi-year project for established firms; the system is core infrastructure. For new firms, the choice shapes the team’s skill profile (kdb+ vs SQL-fluent), the cost structure (license vs commodity), and the operational model. Senior engineering candidates at quant firms should be ready to discuss these trade-offs; junior candidates should at least know that the choice exists and matters.