High-Frequency Trading Infrastructure: Latency Optimization

Introduction

High-frequency trading (HFT) represents the cutting edge of financial technology, where microseconds translate to millions of dollars. Building HFT infrastructure requires understanding of hardware acceleration, network optimization, and ultra-low latency software design.

This guide covers the complete HFT stack: from network architecture and colocation to FPGA implementation and latency measurement. Whether you’re building a trading system or optimizing existing infrastructure, these patterns will help you achieve the lowest possible latency.

Understanding HFT Latency

Latency Hierarchy

┌─────────────────────────────────────────────────────────────────────┐
│                    HFT LATENCY HIERARCHY                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  TICK-TO-TRADE LATENCY BREAKDOWN                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                                                             │   │
│  │  Market Data Arrival                                         │   │
│  │  │                                                           │   │
│  │  │    Network Latency (10μs - 1ms)                          │   │
│  │  │    │                                                      │   │
│  │  │    │    NIC Processing (1-10μs)                          │   │
│  │  │    │    │                                                 │   │
│  │  │    │    │    OS/Driver Latency (1-5μs)                   │   │
│  │  │    │    │    │                                            │   │
│  │  │    │    │    │    Application Processing (1-100μs)      │   │
│  │  │    │    │    │    │                                       │   │
│  │  │    │    │    │    |    Order Entry (1-50μs)              │   │
│  │  │    │    │    │    |    │                                  │   │
│  │  │    │    │    |    |    |    Exchange Processing (10-100μs)│
│  │  ↓    ↓    ↓    ↓    ↓    ↓                                   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  TARGET LATENCIES BY SYSTEM TYPE:                                    │
│  ┌─────────────────────┬────────────────────────────────────────┐  │
│  │ System Type         │ Target Latency                        │  │
│  ├─────────────────────┼────────────────────────────────────────┤  │
│  │ FPGA Trading        │ < 1 microsecond                        │  │
│  │ Direct Memory       │ 1-10 microseconds                      │  │
│  │ Kernel Bypass        │ 10-100 microseconds                   │  │
│  │ Standard Linux      │ 100 microseconds - 1 millisecond     │  │
│  │ Cloud-based         │ 1-10 milliseconds                     │  │
│  └─────────────────────┴────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Latency Measurement

// Ultra-low latency timestamp implementation
#include <stdint.h>
#include <time.h>
#include <sys/time.h>

// Hardware timestamp from CPU
static inline uint64_t get_cycles() {
    unsigned int lo, hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

// Convert cycles to nanoseconds (requires calibration)
static inline uint64_t cycles_to_ns(uint64_t cycles, double ns_per_cycle) {
    return (uint64_t)(cycles * ns_per_cycle);
}

// Timestamp with system time
static inline uint64_t get_timestamp_ns() {
    struct timespec ts;
    clock_gettime(CLOCK_REALTIME, &ts);
    return (uint64_t)ts.tv_sec * 1000000000ULL + ts.tv_nsec;
}

// Monotonic clock for intervals
static inline uint64_t get_monotonic_ns() {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (uint64_t)ts.tv_sec * 1000000000ULL + ts.tv_nsec;
}

// Lock-free latency recorder
typedef struct {
    uint64_t timestamp;
    uint32_t sequence;
    int32_t data;
} __attribute__((aligned(64))) LatencyRecord;

_Static_assert(sizeof(LatencyRecord) == 64, "Cache line size");

class LatencyRecorder {
private:
    LatencyRecord* records;
    uint32_t current_index;
    uint64_t total_latency;
    uint32_t sample_count;
    
public:
    void record(uint64_t start_cycle, uint64_t end_cycle, double ns_per_cycle) {
        uint64_t latency_ns = cycles_to_ns(end_cycle - start_cycle, ns_per_cycle);
        
        // Lock-free recording using atomic
        uint32_t index = __sync_fetch_and_add(&current_index, 1) & 1023;
        
        records[index].timestamp = get_timestamp_ns();
        records[index].sequence = index;
        records[index].data = (int32_t)latency_ns;
        
        __sync_fetch_and_add(&total_latency, latency_ns);
        __sync_fetch_and_add(&sample_count, 1);
    }
    
    double get_average_latency_ns() {
        uint32_t count = __sync_fetch_and_add(&sample_count, 0);
        if (count == 0) return 0;
        return (double)__sync_fetch_and_add(&total_latency, 0) / count;
    }
};

Network Architecture

Colocation and Network Design

┌─────────────────────────────────────────────────────────────────────┐
│                 HFT NETWORK ARCHITECTURE                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  EXCHANGE DATA CENTER                                                │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                                                             │   │
│  │    ┌─────────────────────────────────────────────┐         │   │
│  │    │         TRADING SERVER RACK                  │         │   │
│  │    │                                              │         │   │
│  │    │  ┌─────────┐  ┌─────────┐  ┌─────────┐   │         │   │
│  │    │  │ Trading │  │ Trading │  │  Market │   │         │   │
│  │    │  │ Server  │  │ Server  │  │  Data   │   │         │   │
│  │    │  │  (FPGA)│  │  (FPGA)│  │ Handler │   │         │   │
│  │    │  └────┬────┘  └────┬────┘  └────┬────┘   │         │   │
│  │    │       │            │            │        │         │   │
│  │    │       └────────────┼────────────┘        │         │   │
│  │    │                    │                     │         │   │
│  │    │              ┌─────┴─────┐               │         │   │
│  │    │              │  Switch   │               │         │   │
│  │    │              │ (Arista/  │               │         │   │
│  │    │              │  Cisco)   │               │         │   │
│  │    │              └─────┬─────┘               │         │   │
│  │    │                    │                     │         │   │
│  │    └────────────────────┼─────────────────────┘         │   │
│  │                         │                               │   │
│  │    ┌────────────────────┼─────────────────────┐         │   │
│  │    │                    │                       │         │   │
│  │    │           ┌────────┴────────┐              │         │   │
│  │    │           │  Exchange      │              │         │   │
│  │    │           │  Gateway       │              │         │   │
│  │    │           │  (Co-located) │              │         │   │
│  │    │           └────────┬───────┘              │         │   │
│  │    │                    │                       │         │   │
│  │    │              ┌─────┴─────┐               │         │   │
│  │    │              │  Exchange │               │         │   │
│  │    │              │  Match    │               │         │   │
│  │    │              │  Engine   │               │         │   │
│  │    │              └───────────┘               │         │   │
│  │    │                                          │         │   │
│  │    └──────────────────────────────────────────┘         │   │
│  │                                                             │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  NETWORK LATENCY TARGETS:                                            │
│  • Server to Switch: < 100 nanoseconds                             │
│  • Switch to Exchange: < 1 microsecond                              │
│  • Total Network: < 2 microseconds                                 │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Kernel Bypass with DPDK

// DPDK-based ultra-low latency packet processing
#include <rte_ethdev.h>
#include <rte_mbuf.h>
#include <rte_cycles.h>

#define RX_DESC_DEFAULT 512
#define TX_DESC_DEFAULT 512
#define MBUF_CACHE_SIZE 250
#define BURST_SIZE 32

struct rte_mempool *mbuf_pool;

int port_init(uint16_t port, struct rte_eth_conf *port_conf) {
    struct rte_eth_dev_info dev_info;
    rte_eth_dev_info_get(port, &dev_info);
    
    // Configure RX/TX queues
    rte_eth_rx_queue_setup(
        port, 0, RX_DESC_DEFAULT,
        rte_eth_dev_socket_id(port),
        NULL, mbuf_pool
    );
    
    rte_eth_tx_queue_setup(
        port, 0, TX_DESC_DEFAULT,
        rte_eth_dev_socket_id(port),
        NULL
    );
    
    // Start port
    rte_eth_dev_start(port);
    rte_eth_promiscuous_enable(port);
    
    return 0;
}

// Main packet processing loop
void packet_processing_loop(uint16_t port) {
    struct rte_mbuf *rx_bufs[BURST_SIZE];
    struct rte_mbuf *tx_bufs[BURST_SIZE];
    
    while (1) {
        // Receive packets
        uint16_t nb_rx = rte_eth_rx_burst(
            port, 0, rx_bufs, BURST_SIZE
        );
        
        if (nb_rx == 0) continue;
        
        uint16_t nb_tx = 0;
        
        // Process each packet
        for (uint16_t i = 0; i < nb_rx; i++) {
            struct rte_mbuf *pkt = rx_bufs[i];
            
            // Parse and process market data
            if (process_market_data(pkt) > 0) {
                // Decision made - prepare order
                tx_bufs[nb_tx++] = prepare_order(pkt);
            }
        }
        
        // Send orders
        if (nb_tx > 0) {
            rte_eth_tx_burst(port, 0, tx_bufs, nb_tx);
        }
        
        // Free processed packets
        for (uint16_t i = 0; i < nb_rx; i++) {
            rte_pktmbuf_free(rx_bufs[i]);
        }
    }
}

Network Time Synchronization

# PTP (Precision Time Protocol) Configuration

import socket
import struct
import time

class PTPClient:
    """Precision Time Protocol client for nanosecond synchronization"""
    
    def __init__(self, interface='eth0'):
        self.interface = interface
        self.socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        self.socket.setsockopt(socket.SOL_SOCKET, socket.SO_TIMESTAMPNS, 1)
        
    def get_hardware_timestamp(self):
        """Get hardware timestamp from NIC"""
        # SIOCGHWTSTAMP ioctl
        pass
    
    def synchronize(self, master_ip='192.168.1.100'):
        """Synchronize with PTP master"""
        
        # Send sync request
        sync_msg = self._create_sync_message()
        
        send_time = time.time_ns()
        self.socket.sendto(sync_msg, (master_ip, 319))
        
        # Receive time
        response, recv_time = self.socket.recvfrom(1024)
        recv_time_ns = time.time_ns()
        
        # Calculate offset
        offset = self._calculate_offset(
            send_time, recv_time_ns, response
        )
        
        return offset
    
    def _create_sync_message(self):
        # PTP message structure
        pass
    
    def _calculate_offset(self, send_time, recv_time, response):
        # Calculate clock offset
        pass

# Linux PTP configuration
"""
# /etc/linuxptp/ptp4l.conf
[global]
default_set 0
delay Mechanism     E2E
network_transport   L2
delay_request_interval  0
sync_interval       -3
priority1           128
priority2           128
domainNumber        0

# Run ptp4l
# ptp4l -i eth0 -m -f /etc/linuxptp/ptp4l.conf

# Check synchronization status
# pmc -u -b 0 get STATUS
"""

FPGA Acceleration

FPGA Trading System Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                    FPGA TRADING SYSTEM                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    FPGA LOGIC BLOCKS                         │   │
│  │                                                              │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐   │   │
│  │  │ 10GbE MAC   │  │  Market Data│  │   Order Entry   │   │   │
│  │  │ (Xilinx IP) │─►│  Parser     │─►│   Generator     │   │   │
│  │  └─────────────┘  └─────────────┘  └────────┬────────┘   │   │
│  │                                              │              │   │
│  │  ┌─────────────┐  ┌─────────────┐           │              │   │
│  │  │  Strategy   │  │  Risk       │           │              │   │
│  │  │  Engine     │◄─│  Check      │◄──────────┘              │   │
│  │  │  (Custom)   │  │  Limits     │                          │   │
│  │  └──────┬──────┘  └─────────────┘                          │   │
│  │         │                                                   │   │
│  │  ┌──────┴──────┐                                            │   │
│  │  │  Order Book │                                            │   │
│  │  │  Management │                                            │   │
│  │  └─────────────┘                                            │   │
│  │                                                              │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              ▼                                       │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                 HOST SOFTWARE                                │   │
│  │  • Strategy management                                      │   │
│  │  • Order and position management                           │   │
│  │  • Risk reporting                                           │   │
│  │  • Logging and analytics                                    │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  LATENCY BREAKDOWN:                                                 │
│  • NIC to FPGA: ~100ns                                            │
│  • FPGA parsing: ~50ns                                             │
│  • Strategy logic: ~100ns - 1μs                                    │
│  • Order generation: ~50ns                                         │
│  • FPGA to NIC: ~100ns                                            │
│  • TOTAL: < 1 microsecond                                         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Verilog Market Data Parser

// Market Data Parser - NASDAQ OUCH Protocol
module market_data_parser (
    input wire clk,
    input wire rst,
    
    // 10GbE input
    input wire [63:0] rx_data,
    input wire rx_valid,
    input wire rx_sop,
    input wire rx_eop,
    
    // Parsed output
    output reg [31:0] symbol,
    output reg [31:0] price,
    output reg [31:0] size,
    output reg [2:0] msg_type,
    output reg data_valid,
    output reg parse_error
);
    
    // Protocol constants
    localparam MSG_ADD_ORDER = 3'b001;
    localparam MSG_EXECUTE = 3'b010;
    localparam MSG_CANCEL = 3'b011;
    localparam MSG_DELETE = 3'b100;
    
    // State machine
    reg [2:0] state;
    localparam IDLE = 3'd0;
    localparam PARSE_HEADER = 3'd1;
    localparam PARSE_BODY = 3'd2;
    localparam VALIDATE = 3'd3;
    
    reg [15:0] msg_len;
    reg [15:0] byte_count;
    
    always @(posedge clk) begin
        if (rst) begin
            state <= IDLE;
            data_valid <= 0;
            parse_error <= 0;
        end else begin
            case (state)
                IDLE: begin
                    if (rx_valid && rx_sop) begin
                        state <= PARSE_HEADER;
                        byte_count <= 0;
                    end
                    data_valid <= 0;
                end
                
                PARSE_HEADER: begin
                    if (rx_valid) begin
                        // First word: Message type + Length
                        msg_type <= rx_data[2:0];
                        msg_len <= rx_data[31:16];
                        byte_count <= byte_count + 8;
                        
                        if (msg_len > 0) begin
                            state <= PARSE_BODY;
                        end
                    end
                end
                
                PARSE_BODY: begin
                    if (rx_valid) begin
                        byte_count <= byte_count + 8;
                        
                        // Parse symbol (offset 8-11)
                        if (byte_count >= 8 && byte_count < 12) begin
                            symbol <= rx_data[31:0];
                        end
                        
                        // Parse price (offset 16-19)
                        if (byte_count >= 16 && byte_count < 20) begin
                            price <= rx_data[31:0];
                        end
                        
                        // Parse size (offset 20-23)
                        if (byte_count >= 20 && byte_count < 24) begin
                            size <= rx_data[31:0];
                        end
                        
                        if (byte_count >= msg_len) begin
                            state <= VALIDATE;
                        end
                    end
                end
                
                VALIDATE: begin
                    // Basic validation
                    parse_error <= (price == 0) || (size == 0);
                    data_valid <= (price != 0) && (size != 0);
                    state <= IDLE;
                end
            endcase
        end
    end
endmodule

Order Management System

Low-Latency Order Router

// Java low-latency order management
import java.nio.ByteBuffer;
import java.util.concurrent.atomic.AtomicLong;

public class OrderRouter {
    
    private final ByteBuffer sendBuffer;
    private final long[] timestamps;
    private final AtomicLong orderId;
    
    // Pre-allocated order templates
    private static final byte[] NEW_ORDER_TEMPLATE;
    private static final byte[] CANCEL_TEMPLATE;
    private static final byte[] MODIFY_TEMPLATE;
    
    static {
        NEW_ORDER_TEMPLATE = new byte[64];
        CANCEL_TEMPLATE = new byte[32];
        MODIFY_TEMPLATE = new byte[48];
    }
    
    public OrderRouter(int capacity) {
        this.sendBuffer = ByteBuffer.allocateDirect(capacity);
        this.timestamps = new long[10000];
        this.orderId = new AtomicLong(0);
    }
    
    public long sendNewOrder(
        String symbol,
        Side side,
        long price,
        long size,
        TimeInForce tif
    ) {
        long startCycle = get_cycles();
        
        // Generate order ID
        long orderId = this.orderId.incrementAndGet();
        
        // Build FIX/OUCH message
        sendBuffer.clear();
        
        // Message header
        sendBuffer.put((byte) 'O');  // New Order
        sendBuffer.putLong(orderId);
        sendBuffer.putLong(System.nanoTime());  // Timestamp
        
        // Symbol (6 bytes, space-padded)
        byte[] symbolBytes = symbol.getBytes();
        sendBuffer.put(symbolBytes, 0, Math.min(6, symbolBytes.length));
        
        // Side
        sendBuffer.put(side == Side.BUY ? (byte) 'B' : (byte) 'S');
        
        // Price (8 bytes, long)
        sendBuffer.putLong(price);
        
        // Size (4 bytes)
        sendBuffer.putInt((int) size);
        
        // Time in force
        sendBuffer.put((byte) tif.getCode());
        
        // Send to exchange
        int bytesSent = sendToExchange(sendBuffer);
        
        // Record latency
        long endCycle = get_cycles();
        recordLatency(startCycle, endCycle, orderId);
        
        return orderId;
    }
    
    public long sendCancel(long originalOrderId) {
        sendBuffer.clear();
        
        sendBuffer.put((byte) 'X');  // Cancel
        sendBuffer.putLong(originalOrderId);
        
        return sendToExchange(sendBuffer);
    }
    
    public long sendModify(long originalOrderId, long newPrice, long newSize) {
        sendBuffer.clear();
        
        sendBuffer.put((byte) 'U');  // Modify
        sendBuffer.putLong(originalOrderId);
        sendBuffer.putLong(newPrice);
        sendBuffer.putInt((int) newSize);
        
        return sendToExchange(sendBuffer);
    }
    
    private int sendToExchange(ByteBuffer buffer) {
        // Direct NIO send or use kernel bypass
        // Implementation depends on connectivity
        return 0;
    }
    
    private native long get_cycles();
    private void recordLatency(long start, long end, long orderId) {}
}

In-Memory Order Book

// Lock-free order book implementation
#include <stdatomic.h>
#include <stdbool.h>

#define MAX_PRICE_LEVELS 10000
#define MAX_ORDERS 100000

typedef struct {
    atomic_uint_fast64_t price;
    atomic_uint_fast32_t total_size;
    atomic_uint_fast32_t order_count;
    void* next;
} PriceLevel;

typedef struct {
    atomic_uint_fast64_t order_id;
    atomic_uint_fast64_t price;
    atomic_uint_fast32_t size;
    atomic_uint_fast32_t remaining;
    atomic_uint_fast8_t side;  // 0 = buy, 1 = sell
    atomic_uint_fast8_t status;
    atomic_uint_fast64_t timestamp;
} Order;

typedef struct {
    PriceLevel* bid_levels[MAX_PRICE_LEVELS];
    PriceLevel* ask_levels[MAX_PRICE_LEVELS];
    Order orders[MAX_ORDERS];
    
    atomic_uint_fast64_t best_bid;
    atomic_uint_fast64_t best_ask;
    atomic_uint_fast64_t last_trade_price;
    atomic_uint_fast32_t total_bid_size;
    atomic_uint_fast32_t total_ask_size;
} OrderBook;

void order_book_init(OrderBook* book) {
    atomic_store(&book->best_bid, 0);
    atomic_store(&book->best_ask, UINT64_MAX);
    atomic_store(&book->total_bid_size, 0);
    atomic_store(&book->total_ask_size, 0);
}

bool add_order(OrderBook* book, uint64_t order_id, 
               uint64_t price, uint32_t size, uint8_t side) {
    
    // Find or create price level
    PriceLevel* level = find_or_create_level(
        side == 0 ? book->bid_levels : book->ask_levels,
        price
    );
    
    if (level == NULL) return false;
    
    // Create order
    Order* order = &book->orders[order_id % MAX_ORDERS];
    
    atomic_store(&order->order_id, order_id);
    atomic_store(&order->price, price);
    atomic_store(&order->size, size);
    atomic_store(&order->remaining, size);
    atomic_store(&order->side, side);
    atomic_store(&order->status, 1);  // New
    
    // Update price level
    atomic_fetch_add(&level->total_size, size);
    atomic_fetch_add(&level->order_count, 1);
    
    // Update book totals
    if (side == 0) {
        atomic_fetch_add(&book->total_bid_size, size);
        
        // Update best bid if needed
        uint64_t current_best = atomic_load(&book->best_bid);
        while (price > current_best) {
            if (atomic_compare_exchange_weak(&book->best_bid, &current_best, price)) {
                break;
            }
        }
    } else {
        atomic_fetch_add(&book->total_ask_size, size);
        
        uint64_t current_best = atomic_load(&book->best_ask);
        while (price < current_best) {
            if (atomic_compare_exchange_weak(&book->best_ask, &current_best, price)) {
                break;
            }
        }
    }
    
    return true;
}

bool execute_order(OrderBook* book, uint64_t order_id, uint32_t executed) {
    Order* order = &book->orders[order_id % MAX_ORDERS];
    
    uint32_t remaining = atomic_fetch_sub(&order->remaining, executed);
    
    // Update price level
    PriceLevel* level = find_level(
        order->side == 0 ? book->bid_levels : book->ask_levels,
        atomic_load(&order->price)
    );
    
    if (level != NULL) {
        atomic_fetch_sub(&level->total_size, executed);
    }
    
    // Check if order is fully filled
    if (remaining <= executed) {
        atomic_store(&order->status, 2);  // Filled
    }
    
    return true;
}

Risk Management

Real-Time Risk Checks

# Real-time risk management system

from dataclasses import dataclass
from typing import Dict, List
from collections import deque
import threading

@dataclass
class Position:
    symbol: str
    long_qty: int
    short_qty: int
    avg_long_price: float
    avg_short_price: float
    
    @property
    def net_qty(self):
        return self.long_qty - self.short_qty
    
    @property
    def market_value(self):
        return (self.long_qty * self.avg_long_price + 
                self.short_qty * self.avg_short_price)

@dataclass
class RiskLimits:
    max_position_size: int
    max_order_size: int
    max_loss_per_minute: float
    max_loss_per_day: float
    max_orders_per_second: int

class RiskManager:
    def __init__(self, limits: RiskLimits):
        self.limits = limits
        self.positions: Dict[str, Position] = {}
        self.order_history = deque(maxlen=10000)
        self.pnl_history = deque(maxlen=1440)  # 24 hours of minutes
        self.daily_pnl = 0.0
        self.lock = threading.Lock()
        
    def check_order(self, symbol: str, side: str, 
                   quantity: int, price: float) -> bool:
        """Check if order passes risk checks"""
        
        with self.lock:
            position = self.positions.get(symbol, Position(
                symbol, 0, 0, 0.0, 0.0
            ))
            
            # Check order size
            if quantity > self.limits.max_order_size:
                return False
            
            # Check position limit
            new_qty = position.net_qty + (quantity if side == 'BUY' else -quantity)
            if abs(new_qty) > self.limits.max_position_size:
                return False
            
            # Check minute loss
            if self._get_minute_loss() > self.limits.max_loss_per_minute:
                return False
            
            # Check daily loss
            if self.daily_pnl < -self.limits.max_loss_per_day:
                return False
            
            # Check order rate
            if self._get_orders_per_second() > self.limits.max_orders_per_second:
                return False
            
            return True
            
    def check_cancel(self, order_id: int) -> bool:
        """Allow all cancels"""
        return True
    
    def record_trade(self, symbol: str, side: str, 
                    quantity: int, price: float, pnl: float):
        """Record executed trade"""
        
        with self.lock:
            # Update position
            if symbol not in self.positions:
                self.positions[symbol] = Position(symbol, 0, 0, 0.0, 0.0)
            
            pos = self.positions[symbol]
            
            if side == 'BUY':
                total_cost = pos.long_qty * pos.avg_long_price + quantity * price
                pos.long_qty += quantity
                pos.avg_long_price = total_cost / pos.long_qty if pos.long_qty > 0 else 0
            else:
                total_cost = pos.short_qty * pos.avg_short_price + quantity * price
                pos.short_qty += quantity
                pos.avg_short_price = total_cost / pos.short_qty if pos.short_qty > 0 else 0
            
            # Update P&L
            self.daily_pnl += pnl
            self.pnl_history.append((self._get_current_minute(), pnl))
            
    def _get_minute_loss(self) -> float:
        current_minute = self._get_current_minute()
        return sum(pnl for time, pnl in self.pnl_history if time == current_minute and pnl < 0)
    
    def _get_orders_per_second(self) -> int:
        # Implementation to track orders per second
        return 0
    
    def _get_current_minute(self) -> int:
        import time
        return int(time.time() / 60)

Performance Testing

Latency Benchmarking

# Comprehensive latency testing

import numpy as np
import time
from dataclasses import dataclass
from typing import List

@dataclass
class LatencyStats:
    min: float
    max: float
    mean: float
    median: float
    p99: float
    p999: float
    std: float

class LatencyBenchmark:
    def __init__(self):
        self.measurements: List[float] = []
        
    def measure_round_trip(self, iterations=10000):
        """Measure round-trip latency"""
        
        for _ in range(iterations):
            start = time.perf_counter_ns()
            
            # Simulate processing
            self._simulate_workload()
            
            end = time.perf_counter_ns()
            self.measurements.append(end - start)
            
    def _simulate_workload(self):
        """Simulate typical trading workload"""
        # Market data parsing
        # Strategy calculation
        # Risk check
        # Order generation
        pass
    
    def get_statistics(self) -> LatencyStats:
        """Calculate latency statistics"""
        
        data = np.array(self.measurements)
        
        return LatencyStats(
            min=np.min(data),
            max=np.max(data),
            mean=np.mean(data),
            median=np.median(data),
            p99=np.percentile(data, 99),
            p999=np.percentile(data, 99.9),
            std=np.std(data)
        )
    
    def print_report(self):
        stats = self.get_statistics()
        
        print("=" * 60)
        print("LATENCY BENCHMARK REPORT")
        print("=" * 60)
        print(f"Iterations:       {len(self.measurements):,}")
        print(f"Min:              {stats.min:.0f} ns ({stats.min/1000:.2f} μs)")
        print(f"Max:              {stats.max:.0f} ns ({stats.max/1000:.2f} μs)")
        print(f"Mean:             {stats.mean:.0f} ns ({stats.mean/1000:.2f} μs)")
        print(f"Median:           {stats.median:.0f} ns ({stats.median/1000:.2f} μs)")
        print(f"P99:              {stats.p99:.0f} ns ({stats.p99/1000:.2f} μs)")
        print(f"P99.9:            {stats.p999:.0f} ns ({stats.p999/1000:.2f} μs)")
        print(f"Std Dev:          {stats.std:.0f} ns ({stats.std/1000:.2f} μs)")
        print("=" * 60)

Conclusion

Building high-frequency trading infrastructure requires attention to every microsecond. Key takeaways:

Network is Critical: Colocation, kernel bypass (DPDK), and PTP synchronization are essential.
FPGA for Ultimate Speed: For sub-microsecond latency, FPGA acceleration is required.
Lock-Free Data Structures: Avoid locks in the hot path; use atomic operations.
Comprehensive Testing: Benchmark everything and measure at the cycle level.
Risk Management: Real-time risk checks must be as fast as order execution.

The pursuit of lower latency is endless, but these patterns will help you build a competitive HFT system.