Skip to main content

High-Frequency Trading Infrastructure: Latency Optimization

Published: March 1, 2026 Updated: May 21, 2026 Larry Qu 25 min read

Introduction

High-frequency trading (HFT) operates at the boundary of physics and computing — where a nanosecond of delay can determine profit or loss. The fastest firms achieve tick-to-trade latencies under 500 nanoseconds using custom FPGA pipelines, kernel-bypass networking, and servers placed meters from exchange matching engines.

This article breaks down the complete HFT infrastructure stack: from physical-layer optimization and colocation strategy through network architecture, FPGA acceleration, market data protocols, and order management. Each section covers proven techniques used by quantitative trading firms in 2026 production environments.

Prerequisites: Familiarity with Linux networking, C/Python, and basic digital logic concepts. No prior HFT experience required.


Understanding HFT Latency

Latency Hierarchy

Tick-to-trade latency decomposes into a chain of serial dependencies. Each link must be measured, understood, and optimized independently:

┌─────────────────────────────────────────────────────────────────────┐
│                    HFT LATENCY HIERARCHY                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  TICK-TO-TRADE LATENCY BREAKDOWN                                     │
│                                                                     │
│  Market Data Arrival                                                 │
│    └─ Network Latency (10μs - 1ms)                                  │
│         └─ NIC Processing (1-10μs)                                  │
│              └─ OS/Driver Latency (1-5μs)                           │
│                   └─ Application Processing (1-100μs)               │
│                        └─ Order Entry (1-50μs)                      │
│                             └─ Exchange Processing (10-100μs)       │
│                                                                     │
│  TARGET LATENCIES BY SYSTEM TYPE:                                    │
│  ┌──────────────────────┬─────────────────────────────────────────┐ │
│  │ System Type          │ Target Latency                          │ │
│  ├──────────────────────┼─────────────────────────────────────────┤ │
│  │ FPGA Trading         │ < 1 microsecond                         │ │
│  │ Direct Memory Access │ 1-10 microseconds                       │ │
│  │ Kernel Bypass        │ 10-100 microseconds                     │ │
│  │ Standard Linux       │ 100μs - 1 millisecond                   │ │
│  │ Cloud-based          │ 1-10 milliseconds                       │ │
│  └──────────────────────┴─────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

The critical insight: improving any single stage yields diminishing returns once adjacent stages become the new bottleneck. A kernel-bypass NIC that cuts network latency by 5μs provides no benefit if the application layer still takes 50μs to process a tick.

Latency Measurement at the Cycle Level

Measure tick-to-trade latency using hardware timestamps from the CPU’s TSC (time-stamp counter). clock_gettime is sufficient for microsecond-level profiling, but nanosecond precision requires RDTSC instruction paired with calibration against a PTP-synchronized reference clock:

// Hardware timestamp from CPU TSC
static inline uint64_t get_cycles(void) {
    unsigned int lo, hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

// Calibrate: measure TSC ticks per nanosecond at startup
// Run for 1 second and compute ratio
static double calibrate_ns_per_cycle(void) {
    uint64_t start_cycles = get_cycles();
    struct timespec start_ts;
    clock_gettime(CLOCK_MONOTONIC_RAW, &start_ts);

    // Busy-wait ~1 second
    struct timespec current;
    do {
        clock_gettime(CLOCK_MONOTONIC_RAW, &current);
    } while (current.tv_sec - start_ts.tv_sec < 1);

    uint64_t end_cycles = get_cycles();
    uint64_t elapsed_ns = (current.tv_sec - start_ts.tv_sec) * 1000000000ULL
                        + (current.tv_nsec - start_ts.tv_nsec);
    return (double)elapsed_ns / (end_cycles - start_cycles);
}

// Lock-free latency recorder per core to avoid cache-line bouncing
typedef struct {
    uint64_t timestamp;
    uint32_t sequence;
    int32_t latency_ns;
    int8_t padding[20];  // Pad to 64 bytes (cache line)
} __attribute__((aligned(64))) LatencyRecord;

_Static_assert(sizeof(LatencyRecord) == 64, "Must fill one cache line");

typedef struct {
    LatencyRecord records[1024];
    uint32_t head;
    uint64_t total_latency;
    uint32_t count;
} PerCoreRecorder;

// Thread-local recorder — one per core, no atomics in hot path
static __thread PerCoreRecorder g_recorder;

void record_latency(uint64_t start_cycle, uint64_t end_cycle,
                    double ns_per_cycle) {
    uint64_t latency_ns = (uint64_t)((end_cycle - start_cycle) * ns_per_cycle);
    uint32_t idx = g_recorder.head++ & 1023;
    g_recorder.records[idx].timestamp = get_cycles();
    g_recorder.records[idx].sequence = idx;
    g_recorder.records[idx].latency_ns = (int32_t)latency_ns;
    g_recorder.total_latency += latency_ns;
    g_recorder.count++;
}

Use perf stat -e cycles,instructions,cache-misses on the hot path to identify microarchitectural stalls. A high cache-miss rate signals poor data locality — reorganize order-book structures to fit in L2 cache (256 KB per core on modern Intel Xeon).


Physical-Layer Optimization and Colocation

The Speed-of-Light Constraint

The single most impactful latency optimization is proximity. Light in fiber travels at roughly 200,000 km/s (refractive index ~1.5), adding ~5μs per kilometer of round-trip distance. Microwave transmission through air is faster (~300,000 km/s) but susceptible to weather and requires line-of-sight.

Major exchange colocation centers and their round-trip fiber latencies:

Location Exchange Served Typical Round-Trip to Matching Engine
Carteret, NJ NYSE < 1 μs (same room)
Mahwah, NJ NASDAQ < 1 μs (same rack)
Secaucus, NJ BATS, IEX 2-5 μs
Aurora, IL CME < 1 μs (same floor)
Basildon, UK LSE < 1 μs

Place all trading servers within the same data center row as the exchange’s matching engine. Every additional meter of cable adds roughly 5 nanoseconds of propagation delay (copper) or 3.3 nanoseconds (fiber). The industry standard for competitive HFT is rack-to-rack distance under 50 meters.

When strategies require cross-exchange arbitrage (e.g., trading the same instrument on NYSE and NASDAQ), the inter-site link becomes critical:

TRANSMISSION COMPARISON: 200 km link
┌────────────────────────┬──────────────────┬──────────────────┐
│ Metric                 │ Single-mode Fiber│ Microwave (60GHz)│
├────────────────────────┼──────────────────┼──────────────────┤
│ Propagation speed      │ ~200,000 km/s    │ ~299,700 km/s    │
│ One-way latency        │ ~1,000 μs        │ ~667 μs          │
│ Advantage vs. fiber    │ —                │ ~333 μs          │
│ Weather sensitivity    │ None             │ Rain fade >3dB   │
│ Bandwidth              │ 100 Gbps+        │ ~1-10 Gbps       │
│ Regulatory complexity  │ Low (leased)     │ License required │
└────────────────────────┴──────────────────┴──────────────────┘

For reference, a 333μs advantage at 200 km is the difference between seeing a price change first and reacting to stale data. Major HFT firms operate private microwave networks between Chicago and New York (~1,200 km) where the total advantage over fiber is approximately 2-3 milliseconds.

Colocation Deployment Checklist

  1. Rack position: Request space in the same aisle or adjacent row to the exchange cage
  2. Cross-connect: Use direct fiber cross-connects — never traverse shared switches
  3. Power isolation: Dedicated PDU feeds with battery backup to eliminate power-supply noise
  4. Cooling: Front-to-back cooling with perforated floor tiles positioned for laminar airflow
  5. Physical security: Biometric access, tamper-evident cages, 24/7 monitoring

Network Architecture

Data Center Topology

The HFT network follows a flat, minimal-hop design. Every intermediate switch adds latency (typically 200-500 nanoseconds per hop):

┌─────────────────────────────────────────────────────────────────────┐
│                 HFT NETWORK ARCHITECTURE                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  EXCHANGE DATA CENTER                                                │
│                                                                     │
│    ┌─────────────────────────────────────────────────────────┐     │
│    │                   TRADING SERVER RACK                     │     │
│    │                                                          │     │
│    │  ┌─────────┐  ┌─────────┐  ┌─────────┐                  │     │
│    │  │ Trading │  │ Trading │  │ Market  │                  │     │
│    │  │ Server  │  │ Server  │  │ Data    │                  │     │
│    │  │ (FPGA)  │  │ (FPGA)  │  │ Handler │                  │     │
│    │  └────┬────┘  └────┬────┘  └────┬────┘                  │     │
│    │       │            │            │                        │     │
│    │       └────────────┼────────────┘                        │     │
│    │                    │                                      │     │
│    │              ┌─────┴─────┐                               │     │
│    │              │  Switch   │  (Arista 7130 / Cisco 3560)  │     │
│    │              │  L1 only  │  <200ns cut-through           │     │
│    │              └─────┬─────┘                               │     │
│    │                    │                                      │     │
│    └────────────────────┼─────────────────────────────────────┘     │
│                         │                                           │
│    ┌────────────────────┼─────────────────────────────────────┐     │
│    │                    │                                      │     │
│    │           ┌────────┴────────┐                            │     │
│    │           │  Exchange       │                            │     │
│    │           │  Gateway        │                            │     │
│    │           │  (Co-located)   │                            │     │
│    │           └────────┬───────┘                            │     │
│    │                    │                                      │     │
│    │              ┌─────┴─────┐                               │     │
│    │              │ Exchange  │                               │     │
│    │              │ Match     │                               │     │
│    │              │ Engine    │                               │     │
│    │              └───────────┘                               │     │
│    └─────────────────────────────────────────────────────────┘     │
│                                                                     │
│  NETWORK LATENCY TARGETS:                                            │
│  • Server to Switch: < 100 nanoseconds                             │
│  • Switch to Exchange: < 1 microsecond                              │
│  • Total Network: < 2 microseconds                                 │
└─────────────────────────────────────────────────────────────────────┘

Use Layer 1 (cut-through) switches wherever possible — they forward at wire speed with under 200ns latency, compared to store-and-forward switches that buffer entire frames. Arista 7130 series and Cisco Nexus 3550 are common in HFT environments.

Kernel Bypass with DPDK

Kernel bypass eliminates the operating system’s networking stack from the hot path. Instead of context-switching through the kernel, the application reads packets directly from the NIC’s DMA ring buffer into userspace memory:

// DPDK initialisation and packet processing
#include <rte_ethdev.h>
#include <rte_mbuf.h>
#include <rte_mempool.h>

#define RX_DESC_DEFAULT 512
#define TX_DESC_DEFAULT 512
#define MBUF_CACHE_SIZE 250
#define BURST_SIZE 32

static struct rte_mempool *mbuf_pool;

static int port_init(uint16_t port, struct rte_eth_conf *port_conf) {
    // Configure NIC with minimum RX/TX descriptors for latency
    rte_eth_rx_queue_setup(port, 0, RX_DESC_DEFAULT,
        rte_eth_dev_socket_id(port), NULL, mbuf_pool);
    rte_eth_tx_queue_setup(port, 0, TX_DESC_DEFAULT,
        rte_eth_dev_socket_id(port), NULL);
    rte_eth_dev_start(port);
    rte_eth_promiscuous_enable(port);
    return 0;
}

// Per-core packet loop — pinned to isolated CPU core
// Each core owns one RX/TX queue pair for zero contention
static int packet_processing_loop(void *arg) {
    uint16_t port = (uint16_t)(uintptr_t)arg;
    struct rte_mbuf *rx_bufs[BURST_SIZE];

    while (1) {
        uint16_t nb_rx = rte_eth_rx_burst(port, 0, rx_bufs, BURST_SIZE);
        if (nb_rx == 0) continue;

        for (uint16_t i = 0; i < nb_rx; i++) {
            struct rte_mbuf *pkt = rx_bufs[i];
            struct rte_ether_hdr *eth =
                rte_pktmbuf_mtod(pkt, struct rte_ether_hdr *);
            uint16_t ether_type = rte_be_to_cpu_16(eth->ether_type);

            // Only process UDP multicast market data
            if (ether_type == RTE_ETHER_TYPE_IPV4) {
                process_market_data_packet(pkt);
            }
            rte_pktmbuf_free(pkt);
        }
    }
    return 0;
}

DPDK achieves 10-25 Gbps line-rate processing on a single core. For HFT, the critical configuration parameters are:

  • Minimum RX descriptors (128-256): Reduces DMA descriptor cache pressure at the cost of higher packet-drop risk under burst
  • TX offload disabled: Disable checksum offloading and TSO to avoid NIC processing variation
  • Per-core queue: One RX/TX queue pair per CPU core, no RSS indirection for deterministic packet flow

The Kernel-Bypass Bottleneck Trap

A well-documented pattern in HFT infrastructure: organizations invest $1M-2M in kernel-bypass hardware and FPGA acceleration, achieve a 20-30% tick-to-trade reduction, declare victory — then 18 months later, P95 latency climbs again. The bottleneck has migrated. The constraint moved downstream from the NIC to the order processor.

At 50,000 messages per second (typical mid-frequency), network is the binding constraint — kernel bypass delivers visible gains. At 500,000+ messages per second (top-tier HFT), the constraint shifts to the application layer: order-book maintenance, risk checks, and strategy computation. Further NIC-level optimization yields zero benefit.

Mitigation strategies:

  1. Profile the full pipeline continuously — not just network, but also application and memory latency
  2. FPGA offload of strategy logic — move decision-making into hardware to bypass the CPU entirely
  3. In-network computing — SmartNICs that execute simple trading logic directly on the NIC’s embedded processor

Always measure the complete tick-to-trade path before choosing where to invest. The highest-latency component may not be where you expect.

Network Time Synchronization with PTP

Microsecond-level accuracy across distributed servers requires hardware timestamping. PTP (IEEE 1588) with hardware support on the NIC achieves sub-microsecond synchronization:

# PTP monitoring — sample offset from NIC hardware clock
import subprocess
import re

def check_ptp_offset(interface="eth0"):
    """Return current PTP offset in nanoseconds from master."""
    result = subprocess.run(
        ["pmc", "-u", "-b", "0",
         f"GET CURRENT_DATASET"],
        capture_output=True, text=True, timeout=5
    )
    # Parse output for offsetFromMaster
    match = re.search(r'offsetFromMaster\s+(\d+)', result.stdout)
    if match:
        return int(match.group(0).split()[1])
    return None

# Linux PTP configuration
# /etc/linuxptp/ptp4l.conf
"""
[global]
default_set            0
delay_mechanism        E2E
network_transport      L2
delay_request_interval 0
sync_interval          -3      # 125 ms sync interval
priority1              128
priority2              128
domainNumber           0

# Hardware timestamping enabled
ptp_dst_mac            01:1B:19:00:00:00
tx_timestamp_timeout   10
clock_class            248
"""

# Start ptp4l with hardware timestamping
# ptp4l -i eth0 -m -f /etc/linuxptp/ptp4l.conf --hwts

Monitor offset jitter (not just mean offset). Spikes above 1μs indicate PTP master instability or NIC driver issues — switch to a backup master or reconfigure the clock tree.


Market Data Feed Protocols

HFT systems consume market data from exchange-specific protocols. These are binary, low-overhead, multicast-based protocols designed for minimal parsing overhead.

Protocol Overview

Protocol Exchange Layer Characteristics
ITCH NASDAQ Market Data Full order-book depth, additive message sequence numbers, binary encoding
OUCH NASDAQ Order Entry Direct order entry without FIX overhead, single-byte message types
SMaSH NYSE Market Data Sequenced Market Access Super High-speed, SIP-combined NBBO
MoldUDP64 NASDAQ Transport Session-layer protocol for reliable multicast with sequencing
SoupBinTCP NASDAQ Transport TCP-based binary session protocol for order entry
FAST Multiple Compression FIX Adapted for Streaming — tag-value compression for low-bandwidth feeds

Implementing an ITCH Feed Handler

The ITCH protocol is the de facto standard for NASDAQ market data. Each message type is a fixed-length binary record identified by a single-byte type code:

// ITCH 5.0 message headers and key message types
#include <stdint.h>
#include <stdbool.h>

// Common header: every ITCH message starts with these 4 bytes
typedef struct __attribute__((packed)) {
    uint16_t msg_length;    // Length including this header
    uint8_t  msg_type;      // 'S' = system, 'A' = add order, etc.
    uint8_t  padding;       // Reserved
} ItchHeader;

// Add Order message (type 'A') — most frequent in active stocks
typedef struct __attribute__((packed)) {
    ItchHeader header;      // length=41, type='A'
    uint64_t   order_ref;   // Unique order reference number
    uint8_t    buy_sell;    // 'B' = buy, 'S' = sell
    uint32_t   shares;      // Number of shares
    char       stock[8];    // Stock symbol, left-justified
    uint32_t   price;       // Price in $0.0001 (4 decimal places)
    uint32_t   timestamp;   // Seconds since midnight
} ItchAddOrder;

// Order Executed message (type 'E')
typedef struct __attribute__((packed)) {
    ItchHeader header;      // length=19, type='E'
    uint64_t   order_ref;   // Reference of the executed order
    uint32_t   executed_shares;
    uint64_t   match_number;
    uint8_t    printable;   // 'Y' = print to tape, 'N' = non-printable
    uint32_t   price;       // Execution price in $0.0001
} ItchOrderExecuted;

// Parse one ITCH message from UDP payload
bool parse_itch_message(const uint8_t *data, uint32_t length) {
    if (length < sizeof(ItchHeader)) return false;

    ItchHeader *hdr = (ItchHeader *)data;
    uint16_t msg_len = __builtin_bswap16(hdr->msg_length);

    if (msg_len > length || msg_len < sizeof(ItchHeader))
        return false;  // Invalid or truncated

    switch (hdr->msg_type) {
    case 'A':  // Add Order
    case 'F':  // Add Order with MPID attribution
    {
        ItchAddOrder *ao = (ItchAddOrder *)data;
        // Update top-of-book if this improves the bid/ask
        update_order_book(ao->stock, ao->buy_sell,
                          ao->price, ao->shares, ao->order_ref);
        break;
    }
    case 'E':  // Order Executed
    case 'C':  // Order Executed with Price
    {
        ItchOrderExecuted *eo = (ItchOrderExecuted *)data;
        remove_order(eo->order_ref, eo->executed_shares);
        break;
    }
    case 'X':  // Order Cancelled
    {
        // Remove remaining shares from order book
        uint64_t order_ref;
        uint32_t cancelled_shares;
        memcpy(&order_ref, data + 3, 8);
        memcpy(&cancelled_shares, data + 11, 4);
        cancel_order(order_ref, cancelled_shares);
        break;
    }
    case 'S':  // System event — sequence reset, market open/close
        // No order book action needed
        break;
    default:
        break;  // Unknown message type — skip
    }
    return true;
}

The feed handler must parse messages at line rate — typically 1-3 million messages per second per symbol. Every branch and memory access matters. Place the order-book hash table and the per-symbol state in HBM (high-bandwidth memory) on FPGA cards, or in pinned, hugepage-backed memory on software-based handlers.

FPGA-Accelerated Feed Handling

For sub-microsecond feed processing, move the ITCH/OUCH parser directly into FPGA logic. The complete pipeline — from Ethernet MAC through ITCH decode to order-book update — runs in hardware with deterministic latency:

// ITCH message parser — single-cycle decode per byte
module itch_parser (
    input  wire        clk,          // 322.265 MHz (10GbE)
    input  wire        rst,
    input  wire [63:0] data_in,      // 64-bit datapath
    input  wire        data_valid,
    input  wire        sop,          // Start of packet
    input  wire        eop,          // End of packet
    output reg  [63:0] order_ref,
    output reg  [31:0] price,
    output reg  [31:0] shares,
    output reg   [7:0] symbol[8],
    output reg         buy_sell,     // 0=sell, 1=buy
    output reg         parsed_valid,
    output reg         parse_error
);

    localparam WAIT_SOP = 2'd0;
    localparam HEADER   = 2'd1;
    localparam BODY     = 2'd2;

    reg [1:0] state;
    reg [7:0] byte_count;

    always @(posedge clk) begin
        if (rst) begin
            state <= WAIT_SOP;
            parsed_valid <= 0;
            byte_count <= 0;
        end else if (data_valid) begin
            parsed_valid <= 0;

            case (state)
            WAIT_SOP: begin
                if (sop) begin
                    state <= HEADER;
                    byte_count <= 0;
                end
            end
            HEADER: begin
                // Byte 2 = message type;  Byte 0-1 = length (big-endian)
                if (byte_count == 2) begin
                    // Only parse Add Order ('A') and Executed ('E')
                    // Other types bypass with minimum decoding
                end
                byte_count <= byte_count + 8;
                if (byte_count >= 10)  // header parsed
                    state <= BODY;
            end
            BODY: begin
                // Extract fields at known offsets
                if (byte_count >= 10 && byte_count < 18) begin
                    order_ref[63:0] <= data_in[63:0];  // order_ref at offset 2
                end
                if (byte_count >= 26 && byte_count < 30) begin
                    shares[31:0] <= data_in[31:0];      // shares at offset 18
                end
                if (byte_count >= 34 && byte_count < 38) begin
                    price[31:0] <= data_in[31:0];       // price at offset 26
                end
                byte_count <= byte_count + 8;
                if (eop) begin
                    parsed_valid <= 1;
                    state <= WAIT_SOP;
                end
            end
            endcase
        end
    end
endmodule

The FPGA approach eliminates jitter entirely — every packet is decoded in exactly the same number of clock cycles. Software-based parsers, even with DPDK, show P99/P999 latency variation due to CPU cache misses, branch mispredictions, and interrupt handling.


FPGA Acceleration

FPGA Trading System Architecture

Modern FPGA cards for HFT combine a high-bandwidth memory subsystem, hardened Ethernet MACs, and programmable logic that implements trading logic directly in hardware:

┌─────────────────────────────────────────────────────────────────────┐
│                    FPGA TRADING SYSTEM                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    FPGA LOGIC BLOCKS                         │   │
│  │                                                              │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐    │   │
│  │  │ 25GbE MAC   │  │ ITCH/OUCH   │  │   Order Entry   │    │   │
│  │  │ (Hard IP)   │─►│  Decoder    │─►│   Generator     │    │   │
│  │  └─────────────┘  └─────────────┘  └────────┬────────┘    │   │
│  │                                              │              │   │
│  │  ┌─────────────┐  ┌─────────────┐           │              │   │
│  │  │  Strategy   │  │  Risk       │           │              │   │
│  │  │  Engine     │◄─│  Check      │◄──────────┘              │   │
│  │  │  (Custom)   │  │  (HW)       │                          │   │
│  │  └──────┬──────┘  └─────────────┘                          │   │
│  │         │                                                   │   │
│  │  ┌──────┴──────┐                                            │   │
│  │  │  Order Book │   HBM2 memory: 4-8 GB @ 460 GB/s          │   │
│  │  │  (HBM)      │                                            │   │
│  │  └─────────────┘                                            │   │
│  │                                                              │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              ▼                                       │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                 HOST SOFTWARE (Control Plane)                │   │
│  │  • Strategy parameter updates via PCIe Gen4/5                │   │
│  │  • Position and risk reporting (not time-critical)           │   │
│  │  • Logging and analytics (side channel)                     │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  LATENCY BREAKDOWN:                                                 │
│  • NIC to FPGA (internal bus):     ~50ns                          │
│  • ITCH header decode:             ~15ns                           │
│  • Order-book lookup (HBM):        ~35ns                           │
│  • Strategy computation:           ~100ns - 500ns                  │
│  • Order generation + egress:      ~40ns                           │
│  • TOTAL:                          < 1 microsecond                 │
└─────────────────────────────────────────────────────────────────────┘

Current Hardware Landscape (2026)

AMD and Intel continue to release HFT-specific FPGA accelerators. The table below shows the leading cards:

Card Logic Cells HBM Network Key Advantage
AMD Alveo UL3422 1.5M LUT 8 GB HBM2e 2x100GbE Dedicated FinTech card, 7x latency reduction over prev gen
AMD Alveo X3522PV 1.3M LUT 8 GB HBM2 4x25GbE Cost-effective, 18% more logic than U50
AMD Alveo U55C 1.7M LUT 8 GB HBM2 2x100GbE Highest density for complex strategies
Intel Agilex 7 2.0M LUT 16 GB HBM2e 4x100GbE Hard PCIe Gen5 and Ethernet IP
Xilinx VU13P (legacy) 1.3M LUT 4 GB 2x100GbE Still deployed in production racks

The AMD Alveo UL3422, announced in 2025, is the first FPGA card designed specifically for financial technology. Its hardened Ethernet MAC reduces latency by up to 7x compared to the previous generation, delivering a baseline tick-to-trade of under 300 nanoseconds without custom logic.

High-Level Synthesis for Faster Development

Writing Verilog for trading logic is slow and error-prone. HLS (High-Level Synthesis) allows developers to express trading strategies in C++ and compile them to FPGA bitstreams:

// HLS trading strategy — compiled to FPGA logic
#include <hls_stream.h>
#include <ap_int.h>

// Simplified market data tick
struct Tick {
    ap_uint<64> order_ref;
    ap_uint<32> price;      // Fixed-point: 4 decimal places
    ap_uint<32> shares;
    ap_uint<8>  side;       // 0=sell, 1=buy
    ap_uint<1>  is_trade;   // 1=execution, 0=order addition
};

// Strategy output
struct Order {
    ap_uint<64> order_ref;
    ap_uint<32> price;
    ap_uint<32> shares;
    ap_uint<8>  side;
    ap_uint<1>  is_cancel;
};

// Market-making strategy: maintain bid-ask spread
// Pipelined: processes one tick per clock cycle
void market_maker_strategy(
    hls::stream<Tick>  &ticks_in,
    hls::stream<Order> &orders_out,
    ap_uint<32>         target_spread  // in price ticks
) {
#pragma HLS INTERFACE axis port=ticks_in
#pragma HLS INTERFACE axis port=orders_out
#pragma HLS PIPELINE II=1  // One tick per cycle

    static Tick last_bid = {0, 0, 0, 1, 0};  // side=buy
    static Tick last_ask = {0, 0xFFFFFFFF, 0, 0, 0};

    Tick t = ticks_in.read();

    if (t.is_trade) {
        // Trade occurred — check if we need to update quotes
        if (t.side == 1 && t.price >= last_bid.price) {
            // Upward price movement — widen spread
            Order o = {0, t.price - target_spread, 100, 1, 0};
            orders_out.write(o);
        } else if (t.side == 0 && t.price <= last_ask.price) {
            Order o = {0, t.price + target_spread, 100, 0, 0};
            orders_out.write(o);
        }
    }

    // Update best bid/ask from order book
    if (t.side == 1 && t.price > last_bid.price)
        last_bid = t;
    if (t.side == 0 && t.price < last_ask.price)
        last_ask = t;
}

Xilinx Vitis HLS and Intel oneAPI both support this flow. The same C++ code can be simulated on a workstation for strategy validation, then synthesized to run at 300+ MHz on the FPGA.

AI/ML Integration in FPGA Pipelines

Machine learning inference in the FPGA pipeline enables predictive trading logic without leaving the hardware domain. Common models deployed in HFT pipelines include:

// XGBoost inference — compiled to FPGA with HLS
// Predicts short-term price movement from order-book features
#include <hls_math.h>

// Feature vector extracted from current order-book state
struct Features {
    ap_fixed<32,16> spread;          // Best ask - best bid
    ap_fixed<32,16> imbalance;       // (bid_vol - ask_vol) / total_vol
    ap_fixed<32,16> mid_price_delta; // Change vs. last tick
    ap_fixed<32,16> volatility;      // Rolling 10-tick std dev
    ap_fixed<32,16> order_flow;      // Net trade direction last 10 ticks
};

// Fixed-point decision tree ensemble (8 trees, depth 6)
// Pipelined: prediction every ~20 clock cycles
ap_uint<1> predict_movement(Features f) {
    const ap_fixed<32,16> thresholds[8][6] = { /* trained weights */ };
    ap_fixed<32,16> score = 0;

    for (int t = 0; t < 8; t++) {
#pragma HLS UNROLL
        if (f.spread < thresholds[t][0]) {
            if (f.imbalance < thresholds[t][1]) {
                score += (f.volatility < thresholds[t][3]) ? 1.0 : -1.0;
            } else {
                // ... additional tree traversal
            }
        }
    }
    return score > 0;
}

The key advantage of FPGA-based inference over GPU: latency is deterministic and sub-microsecond. A GPU inference call requires at least 5-10μs for PCIe transfer and kernel launch, while FPGA inference runs at hardware speed in under 100ns — but at lower model complexity. In practice, most HFT firms use lightweight models (decision trees, logistic regression, linear SVM) on FPGA and reserve deep learning for off-line signal generation.


Order Management System

Low-Latency Order Router

The order router must support all exchange protocols from a single code path. Pre-allocate message buffers per exchange to avoid allocation overhead on the hot path:

// C++ order router with protocol-agnostic interface
#include <array>
#include <cstdint>

// Protocol-specific templates — built at compile time
struct OUCHTemplate {
    static constexpr size_t NEW_ORDER_SIZE = 32;
    static constexpr size_t CANCEL_SIZE    = 16;

    static void build_new_order(char *buf, uint64_t order_id,
                                uint64_t price, uint32_t qty, char side) {
        // NASDAQ OUCH 5.0 format
        buf[0] = 'O';                           // Message type
        std::memcpy(buf + 1, &order_id, 8);     // Order token
        buf[9] = side;                          // B/S
        std::memcpy(buf + 10, &price, 8);       // Price (4 decimal places)
        std::memcpy(buf + 18, &qty, 4);         // Quantity
        // ... remaining fields
    }
};

struct FIXTemplate {
    static constexpr size_t NEW_ORDER_SIZE = 128;

    static void build_new_order(char *buf, uint64_t order_id,
                                uint64_t price, uint32_t qty, char side) {
        // FIX 5.0 SP2 message — pre-formatted tag-value pairs
        char *ptr = buf;
        ptr += sprintf(ptr, "35=D\x01");        // MsgType = NewOrderSingle
        ptr += sprintf(ptr, "11=%lu\x01", order_id); // ClOrdID
        ptr += sprintf(ptr, "54=%c\x01", side); // Side
        ptr += sprintf(ptr, "38=%u\x01", qty);  // OrderQty
        ptr += sprintf(ptr, "44=%lu\x01", price); // Price
        // ... remaining tags
    }
};

template<typename Protocol>
class OrderRouter {
private:
    std::array<char, Protocol::NEW_ORDER_SIZE> send_buf;

public:
    uint64_t send_new_order(uint64_t order_id, uint64_t price,
                            uint32_t qty, char side) {
        Protocol::build_new_order(send_buf.data(), order_id, price, qty, side);
        // Write directly to NIC DMA buffer (kernel bypass)
        write_dma_ring(send_buf.data(), Protocol::NEW_ORDER_SIZE);
        return order_id;
    }
};

Use direct NIC DMA ring writes for order submission — avoid any syscall. The order router should run on an isolated CPU core with isolcpus and nohz_full kernel parameters.

Lock-Free Order Book

The in-memory order book is the most performance-critical data structure. It tracks all resting limit orders at each price level and maintains the best bid and ask:

// Lock-free order book — one writer per price level
#include <stdatomic.h>
#include <stdbool.h>

#define MAX_PRICE_LEVELS 10000
#define MAX_ORDERS_PER_LEVEL 256

typedef struct {
    uint64_t price;            // Fixed-point price × 10000
    uint32_t total_volume;     // Accumulated shares at this level
    uint32_t order_count;
    uint64_t first_order_ref;  // Head of linked orders
} PriceLevel;

typedef struct {
    uint64_t order_ref;
    uint64_t price;
    uint32_t remaining;
    uint32_t original_qty;
    uint8_t  side;             // 0=bid, 1=ask
} Order;

// Minimised order book — top 10 bid/ask levels only
// Reduces memory footprint to fit in CPU L2 cache
typedef struct {
    PriceLevel bids[10];
    PriceLevel asks[10];
    uint64_t   best_bid;
    uint64_t   best_ask;
    uint64_t   last_trade_price;
} OrderBook;

void book_init(OrderBook *book) {
    book->best_bid = 0;
    book->best_ask = UINT64_MAX;
    for (int i = 0; i < 10; i++) {
        book->bids[i].price = 0;
        book->asks[i].price = UINT64_MAX;
    }
}

static int find_or_create_level(PriceLevel *levels, uint64_t price,
                                int max_levels) {
    // Linear scan — 10 levels, fits in cache
    for (int i = 0; i < max_levels; i++) {
        if (levels[i].price == price)
            return i;
        if (levels[i].price == 0) {
            levels[i].price = price;
            return i;
        }
    }
    return -1;  // Level not found — outside top 10
}

void book_add_order(OrderBook *book, uint64_t ref, uint64_t price,
                    uint32_t qty, uint8_t side) {
    PriceLevel *levels = side ? book->asks : book->bids;
    int max_levels = side ? 10 : 10;
    
    int idx = find_or_create_level(levels, price, max_levels);
    if (idx < 0) return;  // Not in top 10 — ignore for speed
    
    atomic_fetch_add(&levels[idx].total_volume, qty);
    atomic_fetch_add(&levels[idx].order_count, 1);

    if (side == 0 && price > book->best_bid)
        atomic_store(&book->best_bid, price);
    if (side == 1 && price < book->best_ask)
        atomic_store(&book->best_ask, price);
}

Key design decisions:

  • Top-of-book only (10 levels): Deep books are built on FPGA or in separate processes. The trading engine only needs the best 5-10 levels for most strategies.
  • Flat arrays, not trees: A linear scan of 10 cache-hot entries is faster than a tree traversal that misses cache.
  • Per-price-level atomic counters: Avoid global locks. Each price level is independent and can be updated atomically.
  • No dynamic memory allocation: Pre-allocate all structures at startup. malloc in the hot path adds unpredictable latency.

Tail Latency and Performance Jitter

In HFT, average latency is a misleading metric. A system that processes 99.9% of ticks in under 500ns but has a 1ms tail on 0.1% of ticks will lose money on that 0.1% — and potentially more if those slow ticks coincide with market-moving events.

Sources of Jitter

Source Typical Impact Mitigation
CPU frequency scaling 1-10μs spikes cpupower frequency-set -g performance
SMT/hyperthreading 2-5μs isolcpus + nohz_full
TLB misses (4K pages) 1-3μs 1 GB huge pages (default_hugepagesz=1G)
System management interrupts 10-100μs BIOS: disable SMI, VT-d
NUMA remote memory access 100-300ns numactl --cpunodebind=0 --membind=0
Network interrupt coalescing 50-200μs ethtool -C eth0 rx-usecs 0

Jitter Measurement Methodology

Record every individual tick latency, not just aggregates. A histogram with nanosecond buckets reveals tail behavior:

# Tail latency analysis from recorded traces
import numpy as np

class TailLatencyAnalyzer:
    def __init__(self, latencies_ns: np.ndarray):
        self.latencies = latencies_ns

    def report(self):
        p50 = np.percentile(self.latencies, 50)
        p99 = np.percentile(self.latencies, 99)
        p999 = np.percentile(self.latencies, 99.9)
        p9999 = np.percentile(self.latencies, 99.99)
        max_lat = np.max(self.latencies)

        print(f"P50:   {p50:>8.0f} ns ({p50/1000:>6.2f} μs)")
        print(f"P99:   {p99:>8.0f} ns ({p99/1000:>6.2f} μs)")
        print(f"P99.9: {p999:>8.0f} ns ({p999/1000:>6.2f} μs)")
        print(f"P99.99:{p9999:>8.0f} ns ({p9999/1000:>6.2f} μs)")
        print(f"MAX:   {max_lat:>8.0f} ns ({max_lat/1000:>6.2f} μs)")
        print(f"Tail ratio (P99.9/P50): {p999/p50:.2f}")

For HFT, target a tail ratio (P99.9 / P50) below 3.0. Higher ratios indicate systemic jitter that requires kernel, BIOS, or hardware reconfiguration.


Risk Management

Real-time risk checks must execute on every order before submission. In FPGA-based systems, risk logic runs in hardware alongside the strategy engine. In software-based systems, use a parallel pre-trade risk thread that validates orders without blocking the main execution path:

# Pre-trade risk check — runs on dedicated core
from dataclasses import dataclass
from typing import Dict

@dataclass
class RiskLimits:
    max_order_size: int = 10000
    max_position: int = 50000
    max_loss_per_day: float = 100000.0
    max_orders_per_second: int = 500

class PreTradeRisk:
    def __init__(self, limits: RiskLimits):
        self.limits = limits
        self.positions: Dict[str, int] = {}  # symbol -> net qty
        self.daily_pnl = 0.0
        self.order_count = 0
        self.epoch_second = 0

    def check(self, symbol: str, side: str,
              quantity: int, price: float, current_pnl: float) -> bool:
        # Order size limit
        if quantity > self.limits.max_order_size:
            return False

        # Position limit
        net = self.positions.get(symbol, 0)
        delta = quantity if side == 'BUY' else -quantity
        if abs(net + delta) > self.limits.max_position:
            return False

        # Daily loss limit
        if current_pnl < -self.limits.max_loss_per_day:
            return False

        # Rate limit (sliding window)
        now = int(__import__('time').time())
        if now != self.epoch_second:
            self.epoch_second = now
            self.order_count = 0
        self.order_count += 1
        if self.order_count > self.limits.max_orders_per_second:
            return False

        self.positions[symbol] = net + delta
        return True

In production, risk checks must run in the same process (or on the same FPGA) as the strategy engine. A separate risk-checking process introduces IPC latency that can exceed the trading engine’s total budget.


Performance Testing

Latency Benchmarking Framework

Benchmark every component of the pipeline independently, then measure end-to-end:

# End-to-end latency benchmark
import time
import numpy as np

class TickToTradeBenchmark:
    def __init__(self, pipeline):
        self.pipeline = pipeline
        self.latencies = []

    def run(self, num_ticks=100000):
        # Pre-generate market data ticks
        ticks = self._generate_itch_ticks(num_ticks)

        for tick in ticks:
            start = time.perf_counter_ns()
            self.pipeline.feed_tick(tick)        # Simulate feed arrival
            self.pipeline.process()               # Decode + strategize
            self.pipeline.submit_orders()         # Output orders
            end = time.perf_counter_ns()
            self.latencies.append(end - start)

        self._report()

    def _report(self):
        data = np.array(self.latencies)
        print(f"Mean:  {np.mean(data)/1000:.2f} μs")
        print(f"P50:   {np.median(data)/1000:.2f} μs")
        print(f"P99:   {np.percentile(data,99)/1000:.2f} μs")
        print(f"P99.9: {np.percentile(data,99.9)/1000:.2f} μs")
        print(f"Min:   {np.min(data)/1000:.2f} μs")
        print(f"Max:   {np.max(data)/1000:.2f} μs")

Key benchmark requirements:

  • Warm CPU caches: Discard first 10,000 iterations
  • Fixed CPU frequency: Pin to performance governor and disable Turbo Boost
  • Isolated cores: Run benchmark on core isolated from kernel scheduler
  • Realistic data: Use replayed market data traces, not synthetic patterns

Acceptance Criteria for Production

Metric Target Warning Threshold
Mean tick-to-trade < 1μs (FPGA) / < 10μs (SW+bypass) > 2μs / > 50μs
P99 tick-to-trade < 2μs (FPGA) / < 25μs (SW) > 5μs / > 100μs
Tail ratio (P99.9/P50) < 3.0 > 5.0
Max jitter (P99.9 - P50) < 2μs > 10μs
Zero packet loss at peak 100% < 99.99%

Conclusion

HFT latency optimization is a systems engineering discipline that spans physics, hardware design, and software architecture. The key principles:

  1. Proximity is the first-order effect: Colocate servers within meters of exchange matching engines. No amount of FPGA acceleration compensates for a 100km fiber round trip.

  2. Profile before investing: The bottleneck migrates. After fixing network latency, the constraint moves to application processing. Measure the full tick-to-trade path continuously and invest where the actual bottleneck lies.

  3. FPGA for determinism, not just speed: The primary advantage of FPGA is not raw throughput — it is deterministic, sub-nanosecond jitter. An FPGA that is 2x slower than a CPU on average but has 100x less jitter will win in production.

  4. Tail latency determines profitability: Average latency is a vanity metric. P99.9 and the max jitter control the downside. Target a tail ratio below 3.0.

  5. Risk checks must match the hot path speed: Pre-trade risk validation that adds 5μs to a 1μs tick-to-trade pipeline destroys the advantage. Run risk logic in hardware or in the same thread as the strategy engine.

The race to zero latency has no finish line — but these patterns provide a systematic approach to building infrastructure that competes at the nanosecond frontier.


Resources

Comments

👍 Was this article helpful?