Skip to main content

Real-time ML Inference: Latency & Performance Tuning

Published: June 27, 2025 Updated: June 24, 2026 Larry Qu 8 min read

Introduction

Real-time ML inference requires sub-100ms latency at scale. A single inference request passes through several stages — each one a potential bottleneck. This guide walks through each stage, shows where time is lost, and provides concrete techniques to cut latency by 4–100x depending on your stack.


Latency Breakdown

Where Time Goes in Inference

Every inference request has a cost at each stage of the pipeline. Understanding where time is spent is the first step to knowing where to optimize.

flowchart LR
    A[Client Request] --> B[Network\n~20ms]
    B --> C[Input Processing\n~10ms]
    C --> D[Data Transfer\n~5ms]
    D --> E[Model Inference\n~50ms]
    E --> F[Post-processing\n~10ms]
    F --> G[Response\nTotal: ~95ms]

    style E fill:#f96,stroke:#c63,color:#000

For a typical image classification service, the model itself accounts for about half the total latency. The rest is overhead you can often eliminate entirely with caching, batching, and hardware acceleration.

Latency Requirements by Use Case

Different applications have very different tolerances. Deploying a model without matching it to the right target is a common source of production incidents.

Application Target Latency Tolerance
Autonomous vehicle < 10ms Extremely strict
Real-time recommendation < 50ms Strict
Voice assistant < 100ms Strict
Search ranking < 100ms Moderate
Video analysis < 500ms Moderate
Mobile app < 200ms Moderate

Inference Pipeline Architecture

Before optimizing, it helps to see the full system topology. A production inference service typically spans model serving, caching, batching, and hardware acceleration layers.

flowchart TD
    Client -->|HTTP/gRPC| LB[Load Balancer]
    LB --> Cache{Prediction\nCache}
    Cache -->|HIT| Client
    Cache -->|MISS| Batcher[Dynamic Batcher]
    Batcher --> GPU[GPU Inference\nWorker Pool]
    GPU --> Model[Optimized Model\nFP16 / INT8 / TRT]
    Model --> Batcher
    Batcher --> Cache
    Batcher --> Client

    style Cache fill:#6af,stroke:#36c,color:#000
    style Model fill:#f96,stroke:#c63,color:#000

The cache layer is checked first — a cache hit costs microseconds vs. tens of milliseconds for a full inference. The batcher collects concurrent requests and amortizes the fixed cost of a GPU kernel launch across many inputs.


Model Optimization Techniques

Quantization

Quantization reduces the numerical precision of model weights and activations — from 32-bit floats (FP32) down to 8-bit integers (INT8). This cuts model size by 4x, memory bandwidth by 4x, and typically halves or quarters inference time with less than 1% accuracy loss.

Post-training quantization is the easiest entry point: no retraining required.

import tensorflow as tf

def quantize_trained_model(model):
    """Post-training INT8 quantization — no retraining needed."""
    converter = tf.lite.TFLiteConverter.from_keras_model(model)
    converter.optimizations = [tf.lite.Optimize.DEFAULT]
    converter.target_spec.supported_ops = [
        tf.lite.OpsSet.TFLITE_BUILTINS_INT8
    ]
    return converter.convert()

For models where accuracy is critical, quantization-aware training (QAT) simulates INT8 precision during the training forward pass, letting the model adapt its weights before the actual conversion:

import tensorflow as tf

def create_qat_model(base_model):
    """Quantization-aware training: simulate INT8 during training."""
    return tf.keras.Sequential([
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(10),
    ])
    # Wrap with tf.keras.quantization.quantize_model() before fit()

Pruning

Pruning removes weights that contribute little to model output — typically those close to zero. Structured pruning removes entire neurons or channels, which maps directly to fewer compute operations at inference time.

import tensorflow_model_optimization as tfmot

def prune_model(model, target_sparsity=0.5):
    """Remove 50% of weights by magnitude during fine-tuning."""
    pruning_params = {
        'pruning_schedule': tfmot.sparsity.keras.ConstantSparsity(
            target_sparsity=target_sparsity,
            begin_step=0,
        ),
    }
    pruned = tfmot.sparsity.keras.prune_low_magnitude(model, **pruning_params)
    pruned.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
    return pruned

def finalize_pruned_model(pruned_model):
    """Strip pruning wrappers before export."""
    return tfmot.sparsity.keras.strip_pruning(pruned_model)

At 50% sparsity, you can expect roughly 2x speedup on CPU with specialized sparse kernels. At 90%+ sparsity, speedups of 5–10x are achievable with 1–2% accuracy loss.

Knowledge Distillation

Distillation trains a small, fast “student” model to mimic the outputs of a large, accurate “teacher”. The student learns from soft probability distributions (temperature-scaled logits) rather than hard labels, which gives it far more information per training example.

import tensorflow as tf

def distillation_loss(y_true, teacher_logits, student_logits, temperature=3.0, alpha=0.7):
    """Combined soft + hard target loss for knowledge distillation."""
    # Soft targets: high temperature smooths the distribution
    soft_teacher = tf.nn.softmax(teacher_logits / temperature)
    soft_student = tf.nn.softmax(student_logits / temperature)
    kl = tf.keras.losses.KLDivergence()(soft_teacher, soft_student)

    # Hard targets: standard classification loss
    hard = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)(
        y_true, student_logits
    )
    return alpha * kl + (1 - alpha) * hard

A well-distilled student model can achieve 50–100x faster inference with only 1–3% accuracy drop versus the teacher.


Batching and Inference Serving

Dynamic Batching

GPUs are massively parallel — a single inference on one sample uses a fraction of available compute. Dynamic batching collects requests arriving within a short time window and processes them as a single batch, amortizing kernel launch overhead across many requests.

import queue
import threading
import time

class DynamicBatcher:
    """Collect concurrent requests and inference them as a batch."""

    def __init__(self, model, max_batch_size=32, timeout_ms=20):
        self.model = model
        self.max_batch_size = max_batch_size
        self.timeout_ms = timeout_ms
        self.q = queue.Queue()
        threading.Thread(target=self._loop, daemon=True).start()

    def predict(self, data):
        event = threading.Event()
        req = {'data': data, 'event': event, 'result': None}
        self.q.put(req)
        event.wait()
        return req['result']

    def _loop(self):
        while True:
            batch, deadline = [], time.time() + self.timeout_ms / 1000.0
            while len(batch) < self.max_batch_size:
                remaining = deadline - time.time()
                if remaining <= 0:
                    break
                try:
                    batch.append(self.q.get(timeout=remaining))
                except queue.Empty:
                    break
            if not batch:
                continue
            preds = self.model.predict([r['data'] for r in batch])
            for req, pred in zip(batch, preds):
                req['result'] = pred
                req['event'].set()

With a batch size of 32, throughput typically improves 8–16x versus single-request inference, while per-request latency increases only marginally (by the batching window, typically 10–20ms).


Hardware Acceleration

GPU Optimization

Three changes together can roughly double inference speed on modern NVIDIA GPUs: FP16 precision, cuDNN autotuning, and torch.compile graph fusion.

import torch

def optimize_for_gpu(model):
    """Apply FP16, cuDNN autotuning, and graph compilation."""
    model = model.half().cuda()
    torch.backends.cudnn.benchmark = True          # Profile & pick fastest kernels
    model = torch.compile(model, mode="reduce-overhead")  # Fuse ops, reduce Python overhead
    return model

torch.compile with mode="reduce-overhead" is most effective for fixed-shape inputs. For variable-length inputs (e.g., NLP), use mode="default" instead.

TensorRT Optimization

TensorRT is NVIDIA’s inference optimizer. It fuses layers, picks precision per-layer, and generates GPU-specific kernels that are typically 3–5x faster than a standard PyTorch or TF model.

# Convert ONNX model to a TensorRT engine
trtexec --onnx=model.onnx \
        --saveEngine=model.engine \
        --fp16 \
        --workspace=4096 \
        --avgRuns=10
import tensorrt as trt

def run_trt_inference(engine_path, input_data):
    """Load a TensorRT engine and run inference."""
    logger = trt.Logger(trt.Logger.WARNING)
    with open(engine_path, 'rb') as f:
        runtime = trt.Runtime(logger)
        engine = runtime.deserialize_cuda_engine(f.read())
    context = engine.create_execution_context()
    return context.execute_v2([input_data])

Edge / Mobile Deployment

For mobile or embedded targets, TFLite with INT8 quantization runs efficiently on ARM CPUs without a GPU.

import tensorflow as tf

def deploy_to_edge(model):
    """Convert to INT8 TFLite for edge deployment."""
    converter = tf.lite.TFLiteConverter.from_keras_model(model)
    converter.optimizations = [tf.lite.Optimize.DEFAULT]
    converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
    return converter.convert()

def run_tflite(tflite_model_path, input_data):
    interp = tf.lite.Interpreter(model_path=tflite_model_path)
    interp.allocate_tensors()
    inp = interp.get_input_details()[0]
    out = interp.get_output_details()[0]
    interp.set_tensor(inp['index'], input_data)
    interp.invoke()
    return interp.get_tensor(out['index'])

Caching and Prefetching

Prediction Cache

For many real-world workloads, a significant fraction of requests are for the same or similar inputs — user recommendations, search queries, or popular product classifications. Caching predictions eliminates inference entirely for these cases.

import hashlib
import time

class PredictionCache:
    """TTL-based LRU prediction cache."""

    def __init__(self, max_size=100_000, ttl_seconds=3600):
        self.cache = {}
        self.max_size = max_size
        self.ttl = ttl_seconds

    def _key(self, data):
        return hashlib.md5(str(data).encode()).hexdigest()

    def get(self, data):
        k = self._key(data)
        if k in self.cache:
            value, ts = self.cache[k]
            if time.time() - ts < self.ttl:
                return value
            del self.cache[k]
        return None

    def set(self, data, result):
        if len(self.cache) >= self.max_size:
            oldest = min(self.cache, key=lambda k: self.cache[k][1])
            del self.cache[oldest]
        self.cache[self._key(data)] = (result, time.time())

_cache = PredictionCache()

def cached_predict(model, data):
    result = _cache.get(data)
    if result is None:
        result = model.predict(data)
        _cache.set(data, result)
    return result

Cache hit rates of 30–60% are common in recommendation and search workloads, reducing average latency by a corresponding amount.


Benchmark Results

The table below shows the combined impact of these techniques on a ResNet-50 image classification model running on an NVIDIA T4 GPU.

Configuration Latency Throughput Memory
ResNet-50 FP32 (baseline) 50ms 20 req/s 500MB
ResNet-50 FP16 25ms 40 req/s 300MB
ResNet-50 INT8 12ms 80 req/s 150MB
ResNet-50 INT8 + Batch=4 15ms 320 req/s 150MB
ResNet-50 TFLite (mobile) 100ms 10 req/s 50MB

Key takeaways:

  • FP32 → INT8 cuts latency 75% and memory 70% with < 1% accuracy loss
  • Adding batching on top of INT8 yields 16x throughput at nearly the same per-request latency
  • Mobile (TFLite) trades raw speed for a 90% memory reduction — the right tradeoff for on-device inference

Monitoring

Track inference latency in production with Prometheus histograms. Percentiles matter more than averages: a p99 spike at 500ms can affect 1% of users even when the median is 20ms.

from prometheus_client import Histogram, Counter

inference_latency = Histogram(
    'ml_inference_latency_ms',
    'Inference latency in milliseconds',
    buckets=[10, 25, 50, 100, 200, 500],
)
cache_hits = Counter('ml_cache_hits_total', 'Prediction cache hits')
cache_misses = Counter('ml_cache_misses_total', 'Prediction cache misses')

def predict_with_metrics(model, data):
    result = _cache.get(data)
    if result is not None:
        cache_hits.inc()
        return result
    cache_misses.inc()
    with inference_latency.time():
        result = model.predict(data)
    _cache.set(data, result)
    return result

Alert on p99 > 2× your SLA target. If p99 diverges from p50, it usually signals batching queue buildup or GPU memory pressure rather than a model issue.


Glossary

  • Quantization — Reducing numerical precision (FP32 → INT8) to shrink model size and speed up math operations
  • Pruning — Removing low-magnitude weights to reduce effective model size
  • Distillation — Training a small student model to mimic a large teacher model
  • Dynamic batching — Grouping concurrent requests to maximize GPU utilization
  • TensorRT — NVIDIA’s inference optimizer; fuses layers and generates GPU-specific kernels

Resources


Comments

👍 Was this article helpful?