Algorithmic Trading System Architecture: How Trading Platforms Are Built

Q: What books should I use for trading-system architecture prep?

Algorithmic Trading and DMA by Barry Johnson covers execution-system architecture in detail. The Science of Algorithmic Trading and Portfolio Management by Robert Kissell covers the broader space. For low-level performance: Computer Architecture: A Quantitative Approach by Hennessy and Patterson is foundational. The Carnegie Mellon CS 15-440 course materials cover distributed systems in a way that translates to trading systems. Real-world experience matters too; if possible, work through tutorials on building a simple matching engine or market-data feed handler.

Algorithmic Trading System Architecture: How Trading Platforms Are Actually Built

For SWE candidates targeting low-latency trading firms (HRT, Jump Trading, Citadel Securities, Optiver, Tower) and trading-system roles at hedge funds (Two Sigma, D. E. Shaw, Citadel), understanding how an algorithmic trading platform is architected is essential interview material. Unlike typical tech systems-design questions (rate limiters, social feeds, distributed file systems), trading systems have specific concerns — latency budgets, deterministic behavior, regulatory compliance, fault tolerance, kill switches — that don’t show up in standard interview prep books. This guide covers how a real algo trading platform is structured and what comes up in systems-design interviews at quant firms.

The Core Components

An algorithmic trading platform has four major subsystems:

Market data ingestion: consumes real-time price and order book updates from exchanges, normalizes them into internal formats, distributes to consumers.
Strategy / signal engine: runs the algorithms that decide what to trade. Consumes market data, produces trade decisions.
Order management system (OMS): takes trade decisions, manages outstanding orders, sends/cancels/amends orders to exchanges, tracks fills and positions.
Risk and compliance: monitors positions, exposure, and trading patterns; enforces limits; provides kill-switch capability.

Around these are operational components: logging, monitoring, replay/debugging tools, regulatory reporting, post-trade analytics. Modern trading systems also include a research / backtesting platform that mirrors the production stack closely so strategies can be developed offline and deployed with minimal translation.

Market Data Layer

Every trading system starts with market data. Exchanges publish order book updates, trade confirmations, and reference data via specific protocols (FIX, ITCH, OUCH, proprietary binary formats). The market data layer:

Subscribes to exchange feeds (often multiple per exchange: A and B feeds for redundancy)
Decodes incoming messages
Maintains an in-memory order book per instrument
Detects sequence gaps and triggers recovery
Distributes normalized updates to downstream consumers

Key concerns:

Latency: measured in microseconds. Decode and dispatch must be in the critical path.
Determinism: the same input produces the same output, exactly, every time. Important for replay and debugging.
Sequence tracking: every exchange message has a sequence number; gaps trigger recovery via backup feeds or replay servers.
Memory efficiency: order books for thousands of instruments must fit in cache; padding and layout matter.

HFT firms invest enormously in market data infrastructure because latency here directly translates to trading edge. Hardware acceleration (FPGAs, kernel bypass NICs like Solarflare or Mellanox with DPDK) is common.

Strategy / Signal Engine

The strategy engine consumes market data and runs trading logic. Implementation varies dramatically:

HFT strategies

Highly optimized C++ inner loops; receive market data, compute decisions, emit orders in microseconds or sub-microseconds. Often hand-rolled with cache-aware data structures. Strategies are simple (relative to research-heavy strategies) because complexity costs latency.

Mid-frequency strategies

Decisions in milliseconds to seconds. More complex models (factor scores, ML predictions); typically C++ or modern systems languages. Strategies might run on signal bars (5-second bars, 1-minute bars) rather than every tick.

Slower / research-driven strategies

Decisions in minutes or hours. Often Python (with NumPy / pandas) or Java; complex models that take time to evaluate. The trade-off is latency for sophistication; for slow strategies, sophistication wins.

Common architectural patterns:

Event-driven: strategies react to market data events (top-of-book change, trade, etc.).
Bar-driven: strategies evaluate at fixed time intervals.
Mix: different strategies on different cadences within the same platform.

Order Management System

The OMS sits between strategies and exchanges. It manages the lifecycle of orders:

Receives trade decisions from strategies
Translates them to exchange-specific protocols
Tracks acknowledgments, partial fills, and final fills
Handles order cancellation, modification (replace), and rejections
Maintains position and pending-order state

Key concerns:

State consistency: the OMS must track exactly what’s outstanding at every exchange. Bugs here cause overtrading or hung positions.
Idempotency: retries shouldn’t double-submit. Client order IDs and dedup logic.
Latency: order send/receive paths must be fast.
Reconciliation: comparing internal state against exchange state at session boundaries; handling discrepancies.

Risk and Compliance

Pre-trade risk: every order is checked against limits before submission. Common checks:

Position limits: are we exceeding allowed long/short positions?
Order size limits: is this order too large?
Price collar: is this order’s price reasonable given current market?
Frequency limits: are we sending too many orders too fast?

Post-trade monitoring: watching position, P&L, exposure throughout the day; flagging anomalies; computing intraday risk metrics.

Kill switches: the ability to stop all trading instantly. Common implementations: a control plane that strategies and OMS check before acting; a hardware kill switch on network paths; combinations of both. Knight Capital’s 2012 incident (a deployment bug that lost the firm $440M in 45 minutes) is the canonical reminder of why kill switches matter.

Latency Budget Decomposition

For HFT systems, total tick-to-trade latency might be 1–5 microseconds. This breaks down roughly:

Network: 100ns–500ns each way (depends on co-location and switch fabric)
NIC and OS bypass: 50ns–200ns
Market data decode: 100ns–500ns
Strategy compute: 200ns–2μs (depends on complexity)
Order encoding: 100ns–500ns

For a competitive HFT system, every component is squeezed. Slower components (dataset queries, complex computations) are pushed off the critical path; the hot loop only does what’s necessary to react.

For non-HFT systems (mid-frequency, hedge fund strategies), latency budgets are looser (milliseconds rather than microseconds), and architectural focus shifts to correctness, scalability, and operational robustness.

Common Systems-Design Questions

Design a market-data distribution system

Walk through: feed handlers, message normalization, in-memory order book, multicast or shared-memory distribution, subscriber management, replay/recovery. Discuss latency-vs-fanout trade-offs.

Design an order management system

State machine for orders (new, accepted, partially filled, fully filled, cancelled, rejected). Concurrency: how do you handle simultaneous strategy submissions? Reconciliation: handling exchange state vs internal state divergence. Idempotency in retries.

Design a backtesting framework

How do you replay historical data accurately? Slippage modeling. Look-ahead bias prevention. Walk-forward validation. Multi-strategy testing without interference.

Discuss the production / research split

Research code is exploratory and notebook-driven; production is rigorous and tested. How do you bridge them so research code can become production with minimal rewrite? Common patterns: shared Python libraries with strict interfaces; “research that compiles” via Cython or Numba; or rewrite-on-promotion with strict review.

Discuss kill switches

Multiple layers: software kill switch in OMS; hardware kill switch on network path; circuit breakers per strategy. Triggering criteria: P&L thresholds, position limits, error rates, manual intervention. How do you ensure a kill switch can’t itself fail?

Frequently Asked Questions

How important are systems-design questions vs coding questions for trading-system roles?

Both matter. Coding questions test fluency in a language and ability to implement specific data structures correctly. Systems-design questions test understanding of trading-system architecture, latency considerations, fault tolerance, and operational concerns. Senior SWE roles weight systems design heavily; junior SWE roles weight coding more. Both rounds are typically present in HFT and trading-firm interview loops.

What level of latency awareness is expected at interviews?

Conversational at junior levels, fluent at senior levels. Junior candidates should know that latency matters, what cache hierarchies are, and that NIC/syscall overhead is substantial. Senior candidates should be able to discuss specific microsecond budgets, kernel bypass, lock-free data structures, FPGA acceleration, and similar low-level details. The bar at HRT and Jump Trading is genuinely high; mid-firm candidates can prepare with several months of focused study.

Do all quant firms operate at HFT latencies?

No. HFT firms (HRT, Jump, Citadel Securities, parts of Tower and Optiver) compete at microsecond scale. Mid-frequency firms (most market makers, some hedge funds) operate at millisecond scale. Slower hedge funds (Two Sigma, D. E. Shaw, Bridgewater) operate at minute or longer scale. Architectural concerns scale accordingly: HFT systems are extreme; slower systems look more like normal high-performance distributed systems. Match interview prep to the firm tier you’re targeting.

What books should I use for trading-system architecture prep?

Algorithmic Trading and DMA by Barry Johnson covers execution-system architecture in detail. The Science of Algorithmic Trading and Portfolio Management by Robert Kissell covers the broader space. For low-level performance: Computer Architecture: A Quantitative Approach by Hennessy and Patterson is foundational. The Carnegie Mellon CS 15-440 course materials cover distributed systems in a way that translates to trading systems. Real-world experience matters too; if possible, work through tutorials on building a simple matching engine or market-data feed handler.

How do trading systems handle exchange outages or data gaps?

Multiple strategies. Redundant feeds (most exchanges publish A and B feeds; consume both, pick the faster, fall back if one fails). Sequence number tracking with gap-fill via backup channels (replay servers, request-replay protocols). Strategies that detect uncertain market state and pause. Reconciliation procedures at session boundaries. Strong candidates discuss layered defenses: prevent gaps, detect them quickly, recover gracefully, and handle the period of uncertain state by trading less aggressively or not at all.