Q: How do you persist trades and orders durably without becoming a bottleneck?

The matching engine runs at millions of events per second — traditional synchronous database writes would become the bottleneck immediately. Production-grade persistence uses a write-ahead log (WAL) pattern optimized for sequential I/O: (1) Sequencer assigns a monotonically increasing sequence number to each incoming order before it enters the matching engine. (2) The WAL writer appends each sequenced order to an append-only log on NVMe SSD (sequential writes saturate 3-7 GB/s on NVMe — far faster than the 500K-1M events/second a matching engine produces). (3) The WAL entry contains: sequence_number, timestamp_ns, raw order message, CRC32 checksum. (4) After writing to the WAL, the order is submitted to the matching engine. The engine is always ahead of persistence — the WAL catches up asynchronously. (5) On crash recovery: replay all WAL entries from the last known checkpoint. The matching engine rebuilds the order book state deterministically. (6) Checkpointing: periodically snapshot the full order book state to disk. On recovery, load the snapshot and replay only the WAL entries since the last snapshot. Without snapshots, recovery requires replaying the entire day's WAL (could be billions of entries). In-memory databases like Redis or custom memory-mapped files are used for the live order book — disk persistence is only for the append-only WAL. This design saturates NVMe write bandwidth while keeping the matching engine at microsecond latency.

Question 1

What is the order book data structure and how does price-time priority work?

Accepted Answer

The order book is the core data structure of a financial exchange — it maintains all outstanding limit orders organized by price, waiting to be matched with incoming orders. Two sides: the bid side (buy orders, highest price first) and the ask side (sell orders, lowest price first). Price-time priority (FIFO) governs matching: among all orders at the same price level, the order placed earliest is matched first. Implementation: for each side, use a sorted structure (sorted dictionary / tree map) keyed by price, with each price level holding a queue (FIFO) of orders at that price. Best bid lookup is O(1) (peek at the first key of the bid structure), insertion is O(log P) where P is the number of distinct price levels, and cancellation requires O(1) if we maintain a hash map from order_id to its position in the queue. When a market order arrives, it takes from the best price level on the opposite side, consuming from the front of the queue until filled. A limit buy order priced at $100 will match against asks priced at $100 or lower; if no asks are cheap enough, the order rests in the bid side at $100 until a sell order arrives priced at $100 or lower. Matching produces a trade record: buyer_order_id, seller_order_id, price, quantity, timestamp_ns. Sequence numbers on every order ensure deterministic replay — given the same input sequence, the matching engine always produces identical trades.

Question 2

How do exchanges achieve microsecond latency in the matching engine?

Accepted Answer

Professional exchange matching engines measure latency in microseconds (1 μs = 0.001 ms) through a combination of hardware, OS, and software optimizations. Hardware: co-location (exchange customers rent rack space in the same data center as the matching engine, reducing network latency from 10+ ms to < 1 μs). FPGA-based matching engines (used by NASDAQ, CME) execute in sub-microsecond because logic runs directly on programmable gates rather than software. Software matching engines typically achieve 5-50 μs. OS-level: CPU core isolation (isolcpus kernel parameter reserves specific cores exclusively for the matching engine — no OS scheduling interrupts). NUMA awareness (matching engine memory allocated on the same NUMA node as its CPU — remote NUMA access adds ~60 ns). Real-time kernel patches or PREEMPT_RT to reduce scheduling jitter. Network: DPDK (Data Plane Development Kit) bypasses the Linux kernel network stack entirely — the matching engine polls the NIC directly, saving 5-20 μs of kernel overhead. RDMA (Remote Direct Memory Access) for IPC between co-located components. Application-level: single-threaded matching per symbol (no locks or synchronization). Pre-allocated object pools (no dynamic memory allocation during trading hours). Cache-friendly data layouts (order queues in contiguous memory). The aggregate result: software exchanges achieve 5-100 μs round-trip from order receipt to trade acknowledgment. Hardware (FPGA) exchanges: under 1 μs.

Question 3

How do you persist trades and orders durably without becoming a bottleneck?

Accepted Answer

The matching engine runs at millions of events per second — traditional synchronous database writes would become the bottleneck immediately. Production-grade persistence uses a write-ahead log (WAL) pattern optimized for sequential I/O: (1) Sequencer assigns a monotonically increasing sequence number to each incoming order before it enters the matching engine. (2) The WAL writer appends each sequenced order to an append-only log on NVMe SSD (sequential writes saturate 3-7 GB/s on NVMe — far faster than the 500K-1M events/second a matching engine produces). (3) The WAL entry contains: sequence_number, timestamp_ns, raw order message, CRC32 checksum. (4) After writing to the WAL, the order is submitted to the matching engine. The engine is always ahead of persistence — the WAL catches up asynchronously. (5) On crash recovery: replay all WAL entries from the last known checkpoint. The matching engine rebuilds the order book state deterministically. (6) Checkpointing: periodically snapshot the full order book state to disk. On recovery, load the snapshot and replay only the WAL entries since the last snapshot. Without snapshots, recovery requires replaying the entire day's WAL (could be billions of entries). In-memory databases like Redis or custom memory-mapped files are used for the live order book — disk persistence is only for the append-only WAL. This design saturates NVMe write bandwidth while keeping the matching engine at microsecond latency.

System Design Interview: Financial Exchange and Matching Engine

Exchange Architecture Overview

Order Types

Order Book Data Structure

Matching Algorithm

Sequence Numbers and Determinism

Performance and Latency

Market Data Distribution

Interview Questions

Frequently Asked Questions

What is the order book data structure and how does price-time priority work?

How do exchanges achieve microsecond latency in the matching engine?

How do you persist trades and orders durably without becoming a bottleneck?

Companies That Ask This Question