System Design Interview: Financial Exchange and Matching Engine

Exchange Architecture Overview

A financial exchange matches buy and sell orders — a buyer willing to pay $100 for AAPL is matched with a seller willing to accept $100. This sounds simple, but at scale (NYSE processes 5-10 billion shares per day, Binance handles 1.4 million orders per second at peak), every microsecond matters. The matching engine is the core of an exchange: it maintains an order book (all outstanding orders) and matches incoming orders against existing ones with sub-millisecond latency.

Order Types

  • Market order: execute immediately at the best available price. No price guarantee — if the market moves before execution, you get a worse price (slippage). Never rest in the order book.
  • Limit order: execute at a specified price or better. If no match is available, the order rests in the order book until matched, cancelled, or expired. Most liquid markets are primarily limit orders.
  • Stop order: becomes a market or limit order when the price reaches a trigger level. Stored separately; activated when the last trade price crosses the stop level.
  • Immediate-or-Cancel (IOC): fill what is immediately available; cancel the rest. Never rests in the book.
  • Good-Till-Cancelled (GTC): rests in the book until manually cancelled (or end of trading session).

Order Book Data Structure

The order book maintains all resting limit orders, organized by price level. For each price level, orders are sorted by time (FIFO: first-in, first-out). Price-time priority means: the lowest ask price executes first; at the same price, the oldest order executes first.


# Order book structure:
# Bids (buy orders): descending by price (highest bid first)
# Asks (sell orders): ascending by price (lowest ask first)
# Spread: difference between best bid and best ask

Bids:
  $100.00 → [Order(qty=500, time=09:30:01), Order(qty=200, time=09:30:05)]
  $99.99  → [Order(qty=1000, time=09:30:00)]
  $99.98  → [Order(qty=800, time=09:30:02)]

Asks:
  $100.01 → [Order(qty=300, time=09:30:00)]
  $100.02 → [Order(qty=600, time=09:30:03)]
  $100.05 → [Order(qty=2000, time=09:30:01)]

Best bid: $100.00, Best ask: $100.01, Spread: $0.01

# Implementation using sorted dictionaries and deques:
from sortedcontainers import SortedDict
from collections import deque

class OrderBook:
    def __init__(self, symbol: str):
        self.symbol = symbol
        # SortedDict with negative keys for bids (highest bid = first)
        self.bids: SortedDict[float, deque] = SortedDict(lambda k: -k)
        self.asks: SortedDict[float, deque] = SortedDict()
        self.orders: dict[str, Order] = {}  # order_id → Order (for cancellation)

    def best_bid(self) -> float | None:
        return self.bids.keys()[0] if self.bids else None

    def best_ask(self) -> float | None:
        return self.asks.keys()[0] if self.asks else None

Matching Algorithm


def match(self, incoming: Order) -> list[Trade]:
    trades = []

    if incoming.side == "buy":
        # Match against asks (ascending price — lowest ask first)
        while incoming.qty > 0 and self.asks:
            best_ask_price, ask_queue = self.asks.peekitem(0)
            if incoming.price  0:
                resting = ask_queue[0]
                filled_qty = min(incoming.qty, resting.qty)
                trades.append(Trade(
                    buy_order_id=incoming.id,
                    sell_order_id=resting.id,
                    price=best_ask_price,
                    qty=filled_qty,
                    symbol=self.symbol
                ))
                incoming.qty -= filled_qty
                resting.qty -= filled_qty
                if resting.qty == 0:
                    ask_queue.popleft()
                    del self.orders[resting.id]
            if not ask_queue:
                del self.asks[best_ask_price]
    # Mirror logic for sell orders matching against bids

    # Any remaining quantity rests in the book (if limit order)
    if incoming.qty > 0 and incoming.type == "limit":
        self.add_to_book(incoming)

    return trades

Sequence Numbers and Determinism

The matching engine must produce identical results when replayed — this is critical for auditability and disaster recovery. Every incoming order is assigned a monotonically increasing sequence number before entering the matching engine. The engine processes orders in strict sequence order. Given the same input sequence, the engine always produces the same trades and order book state.


# Order entry with sequence number:
class SequencedOrder:
    seq_no: int          # globally unique, monotonically increasing
    timestamp_ns: int    # nanosecond timestamp (hardware clock)
    order: Order

# The engine only processes seq_no N after confirming seq_no N-1 is processed
# Gap detection: if seq_no 105 arrives before 104, wait for 104

# Persistence: every order and trade is written to a write-ahead log
# WAL format: seq_no | order_json | crc32_checksum
# On crash: replay WAL from last checkpoint to reconstruct order book state

Performance and Latency

Professional exchanges measure latency in microseconds. Optimization techniques used by high-frequency trading (HFT) infrastructure:

  • Lock-free data structures: the matching engine runs in a single thread per symbol — no locks needed. Multi-symbol parallelism: each symbol runs on its own CPU core.
  • CPU core pinning: the matching engine thread is pinned to a specific CPU core (NUMA-aware), preventing OS scheduling jitter. BIOS sets CPU isolation (isolcpus=3) so no other processes run on the matching core.
  • DPDK / kernel bypass: network packets are received directly by the application, bypassing the Linux kernel network stack (which adds 10-50 μs). DPDK (Data Plane Development Kit) polls the NIC directly.
  • Huge pages: 2 MB memory pages instead of 4 KB pages — reduces TLB misses by 99% for the order book memory.
  • Pre-allocated memory pools: order objects are pre-allocated and reused from a pool, eliminating malloc/free during trading hours.
  • Latency breakdown (typical exchange): NIC to CPU: 1 μs; kernel network stack: 5-20 μs (bypassed with DPDK); matching logic: 1-5 μs; persistence to NVMe: 10-50 μs. Total: 20-100 μs for a software exchange. FPGA-based exchanges achieve < 1 μs.

Market Data Distribution

After a trade executes, market data must be broadcast to all participants (order book updates, trade confirmations, tick data). This is a high-fan-out problem: NYSE has thousands of market data subscribers.


# Market data feed (L2 data: full order book updates)
# Protocol: UDP multicast (not TCP — no acknowledgment overhead)
# Each message tagged with sequence number for gap detection

# Market data hierarchy:
# L1: best bid/ask only (cheapest, used by retail)
# L2: full order book up to 10 price levels (professional)
# L3: every individual order and cancellation (dark pool, HFT)

# Consolidation tape: NASDAQ/NYSE publish trade data to the SIP (Securities
# Information Processor), which aggregates and redistributes to all
# participants under SEC Rule 613 (CAT reporting).

Interview Questions

  • Design the data structure for an order book that supports O(1) best bid/ask lookup and O(log N) insertion
  • How do you guarantee exactly-once order processing in a distributed exchange?
  • A matching engine processes 1M orders/second — how do you persist this without becoming the bottleneck?
  • How do circuit breakers work in stock exchanges and how would you implement one?
  • Design a crypto exchange that operates 24/7 with no scheduled downtime

Frequently Asked Questions

What is the order book data structure and how does price-time priority work?

The order book is the core data structure of a financial exchange — it maintains all outstanding limit orders organized by price, waiting to be matched with incoming orders. Two sides: the bid side (buy orders, highest price first) and the ask side (sell orders, lowest price first). Price-time priority (FIFO) governs matching: among all orders at the same price level, the order placed earliest is matched first. Implementation: for each side, use a sorted structure (sorted dictionary / tree map) keyed by price, with each price level holding a queue (FIFO) of orders at that price. Best bid lookup is O(1) (peek at the first key of the bid structure), insertion is O(log P) where P is the number of distinct price levels, and cancellation requires O(1) if we maintain a hash map from order_id to its position in the queue. When a market order arrives, it takes from the best price level on the opposite side, consuming from the front of the queue until filled. A limit buy order priced at $100 will match against asks priced at $100 or lower; if no asks are cheap enough, the order rests in the bid side at $100 until a sell order arrives priced at $100 or lower. Matching produces a trade record: buyer_order_id, seller_order_id, price, quantity, timestamp_ns. Sequence numbers on every order ensure deterministic replay — given the same input sequence, the matching engine always produces identical trades.

How do exchanges achieve microsecond latency in the matching engine?

Professional exchange matching engines measure latency in microseconds (1 μs = 0.001 ms) through a combination of hardware, OS, and software optimizations. Hardware: co-location (exchange customers rent rack space in the same data center as the matching engine, reducing network latency from 10+ ms to < 1 μs). FPGA-based matching engines (used by NASDAQ, CME) execute in sub-microsecond because logic runs directly on programmable gates rather than software. Software matching engines typically achieve 5-50 μs. OS-level: CPU core isolation (isolcpus kernel parameter reserves specific cores exclusively for the matching engine — no OS scheduling interrupts). NUMA awareness (matching engine memory allocated on the same NUMA node as its CPU — remote NUMA access adds ~60 ns). Real-time kernel patches or PREEMPT_RT to reduce scheduling jitter. Network: DPDK (Data Plane Development Kit) bypasses the Linux kernel network stack entirely — the matching engine polls the NIC directly, saving 5-20 μs of kernel overhead. RDMA (Remote Direct Memory Access) for IPC between co-located components. Application-level: single-threaded matching per symbol (no locks or synchronization). Pre-allocated object pools (no dynamic memory allocation during trading hours). Cache-friendly data layouts (order queues in contiguous memory). The aggregate result: software exchanges achieve 5-100 μs round-trip from order receipt to trade acknowledgment. Hardware (FPGA) exchanges: under 1 μs.

How do you persist trades and orders durably without becoming a bottleneck?

The matching engine runs at millions of events per second — traditional synchronous database writes would become the bottleneck immediately. Production-grade persistence uses a write-ahead log (WAL) pattern optimized for sequential I/O: (1) Sequencer assigns a monotonically increasing sequence number to each incoming order before it enters the matching engine. (2) The WAL writer appends each sequenced order to an append-only log on NVMe SSD (sequential writes saturate 3-7 GB/s on NVMe — far faster than the 500K-1M events/second a matching engine produces). (3) The WAL entry contains: sequence_number, timestamp_ns, raw order message, CRC32 checksum. (4) After writing to the WAL, the order is submitted to the matching engine. The engine is always ahead of persistence — the WAL catches up asynchronously. (5) On crash recovery: replay all WAL entries from the last known checkpoint. The matching engine rebuilds the order book state deterministically. (6) Checkpointing: periodically snapshot the full order book state to disk. On recovery, load the snapshot and replay only the WAL entries since the last snapshot. Without snapshots, recovery requires replaying the entire day's WAL (could be billions of entries). In-memory databases like Redis or custom memory-mapped files are used for the live order book — disk persistence is only for the append-only WAL. This design saturates NVMe write bandwidth while keeping the matching engine at microsecond latency.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the order book data structure and how does price-time priority work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The order book is the core data structure of a financial exchange — it maintains all outstanding limit orders organized by price, waiting to be matched with incoming orders. Two sides: the bid side (buy orders, highest price first) and the ask side (sell orders, lowest price first). Price-time priority (FIFO) governs matching: among all orders at the same price level, the order placed earliest is matched first. Implementation: for each side, use a sorted structure (sorted dictionary / tree map) keyed by price, with each price level holding a queue (FIFO) of orders at that price. Best bid lookup is O(1) (peek at the first key of the bid structure), insertion is O(log P) where P is the number of distinct price levels, and cancellation requires O(1) if we maintain a hash map from order_id to its position in the queue. When a market order arrives, it takes from the best price level on the opposite side, consuming from the front of the queue until filled. A limit buy order priced at $100 will match against asks priced at $100 or lower; if no asks are cheap enough, the order rests in the bid side at $100 until a sell order arrives priced at $100 or lower. Matching produces a trade record: buyer_order_id, seller_order_id, price, quantity, timestamp_ns. Sequence numbers on every order ensure deterministic replay — given the same input sequence, the matching engine always produces identical trades.”
}
},
{
“@type”: “Question”,
“name”: “How do exchanges achieve microsecond latency in the matching engine?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Professional exchange matching engines measure latency in microseconds (1 μs = 0.001 ms) through a combination of hardware, OS, and software optimizations. Hardware: co-location (exchange customers rent rack space in the same data center as the matching engine, reducing network latency from 10+ ms to < 1 μs). FPGA-based matching engines (used by NASDAQ, CME) execute in sub-microsecond because logic runs directly on programmable gates rather than software. Software matching engines typically achieve 5-50 μs. OS-level: CPU core isolation (isolcpus kernel parameter reserves specific cores exclusively for the matching engine — no OS scheduling interrupts). NUMA awareness (matching engine memory allocated on the same NUMA node as its CPU — remote NUMA access adds ~60 ns). Real-time kernel patches or PREEMPT_RT to reduce scheduling jitter. Network: DPDK (Data Plane Development Kit) bypasses the Linux kernel network stack entirely — the matching engine polls the NIC directly, saving 5-20 μs of kernel overhead. RDMA (Remote Direct Memory Access) for IPC between co-located components. Application-level: single-threaded matching per symbol (no locks or synchronization). Pre-allocated object pools (no dynamic memory allocation during trading hours). Cache-friendly data layouts (order queues in contiguous memory). The aggregate result: software exchanges achieve 5-100 μs round-trip from order receipt to trade acknowledgment. Hardware (FPGA) exchanges: under 1 μs."
}
},
{
"@type": "Question",
"name": "How do you persist trades and orders durably without becoming a bottleneck?",
"acceptedAnswer": {
"@type": "Answer",
"text": "The matching engine runs at millions of events per second — traditional synchronous database writes would become the bottleneck immediately. Production-grade persistence uses a write-ahead log (WAL) pattern optimized for sequential I/O: (1) Sequencer assigns a monotonically increasing sequence number to each incoming order before it enters the matching engine. (2) The WAL writer appends each sequenced order to an append-only log on NVMe SSD (sequential writes saturate 3-7 GB/s on NVMe — far faster than the 500K-1M events/second a matching engine produces). (3) The WAL entry contains: sequence_number, timestamp_ns, raw order message, CRC32 checksum. (4) After writing to the WAL, the order is submitted to the matching engine. The engine is always ahead of persistence — the WAL catches up asynchronously. (5) On crash recovery: replay all WAL entries from the last known checkpoint. The matching engine rebuilds the order book state deterministically. (6) Checkpointing: periodically snapshot the full order book state to disk. On recovery, load the snapshot and replay only the WAL entries since the last snapshot. Without snapshots, recovery requires replaying the entire day's WAL (could be billions of entries). In-memory databases like Redis or custom memory-mapped files are used for the live order book — disk persistence is only for the append-only WAL. This design saturates NVMe write bandwidth while keeping the matching engine at microsecond latency."
}
}
]
}

Companies That Ask This Question

Scroll to Top