Model Serving: Triton vs vLLM vs Text Generation Inference

Introduction

Deploying large language models efficiently requires specialized serving infrastructure. This guide compares three leading solutions: NVIDIA Triton Inference Server, vLLM, and Hugging Face Text Generation Inference (TGI).

Each offers different trade-offs between throughput, latency, features, and hardware optimization.

Understanding LLM Serving

Key concepts in LLM serving:

Throughput: Tokens processed per second
Latency: Time to first token (TTFT) and per-token latency
Batching: Grouping requests for efficient processing
KV Cache: Caching attention key-values for efficiency
Quantization: Reducing model size for faster inference

# Key metrics to measure
metrics = {
    "throughput": "tokens/second across all requests",
    "latency_p50": "50th percentile response time",
    "latency_p99": "99th percentile response time",
    "ttft": "time to first token",
    "batch_size": "average batching efficiency",
    "gpu_utilization": "GPU usage percentage"
}

NVIDIA Triton Inference Server

Triton is a production-grade serving solution with broad framework support and optimization features.

Triton Setup

# config.pbtxt - Triton model configuration
name: "llama-2-7b"
platform: "pytorch_libtorch"
max_batch_size: 32

input [
  {
    name: "input_ids"
    shape: [-1]
    data_type: INT64
  },
  {
    name: "attention_mask"  
    shape: [-1]
    data_type: INT64
  }
]

output [
  {
    name: "logits"
    shape: [-1, 32000]
    data_type: FP32
  }
]

dynamic_batching {
  preferred_batch_size: [8, 16, 32]
  max_queue_delay_microseconds: 100000
}

optimization {
  graph: {
    enabled: true
  }
}

Triton Python Client

import tritonclient.http as httpclient
from tritonclient.utils import np_to_triton_dtype
import numpy as np

# Connect to Triton server
client = httpclient.InferenceServerClient(url="localhost:8000")

def generate(prompt: str, max_tokens: int = 100):
    # Tokenize
    inputs = tokenizer(prompt, return_tensors="np")
    
    # Prepare Triton inputs
    input_ids = httpclient.InferInput(
        name="input_ids",
        shape=inputs["input_ids"].shape,
        datatype=np_to_triton_dtype(inputs["input_ids"].dtype])
    input_ids.set_data_from_numpy(inputs["input_ids"])
    
    attention_mask = httpclient.InferInput(
        name="attention_mask",
        shape=inputs["attention_mask"].shape,
        datatype=np_to_triton_dtype(inputs["attention_mask"].dtype))
    attention_mask.set_data_from_numpy(inputs["attention_mask"])
    
    # Configure output
    request = client.infer(
        model_name="llama-2-7b",
        inputs=[input_ids, attention_mask],
        outputs=None,
        headers={"Authorization": f"Bearer {token}"}
    )
    
    # Decode response
    output = request.as_numpy("logits")
    return tokenizer.decode(output)

Triton Ensemble Models

# Ensemble: tokenization -> model -> detokenization
name: "llm_ensemble"

input [
  {
    name: "prompt"
    data_type: TYPE_STRING
  }
]

output [
  {
    name: "response"
    data_type: TYPE_STRING
  }
]

ensemble_scheduling {
  step [
    {
      model_name: "tokenizer"
      input_map: { key: "input" value: "prompt" }
      output_map: { key: "output" value: "input_ids" }
    },
    {
      model_name: "llama-2-7b"
      input_map: { key: "input_ids" value: "input_ids" }
      output_map: { key: "logits" value: "logits" }
    },
    {
      model_name: "detokenizer"
      input_map: { key: "logits" value: "logits" }
      output_map: { key: "output" value: "response" }
    }
  ]
}

Triton Performance Optimization

# Optimized config with dynamic batching
name: "llama-2-7b-optimized"
platform: "pytorch_libtorch"
max_batch_size: 64

instance_group [
  {
    kind: KIND_GPU
    count: 4  # Multiple GPUs
  }
]

dynamic_batching {
  preferred_batch_size: [16, 32, 64]
  max_queue_delay_microseconds: 50000  # 50ms max wait
}

optimization {
  graph: {
    enabled: true  # CUDA graphs
  }
  execution_accelerators {
    gpu_execution_accelerator: [
      {
        name: "tensorrt"
        parameters {
          key: "precision"
          value: "fp16"
        }
      }
    ]
  }
}

vLLM: High-Throughput LLM Serving

vLLM specializes in optimized throughput with PagedAttention and continuous batching.

vLLM Installation and Setup

# Install vLLM
pip install vllm

from vllm import LLM, SamplingParams

# Initialize vLLM
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    tensor_parallel_size=2,  # Use 2 GPUs
    trust_remote_code=True,
    dtype="half",  # FP16
    max_num_seqs=256  # Max concurrent sequences
)

# Generate
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=256,
    stop="</s>"
)

prompts = [
    "Explain quantum computing in simple terms",
    "What is machine learning?",
    "How do neural networks work?"
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}\n")

vLLM API Server

# Start vLLM API server
vllm serve meta-llama/Llama-2-7b-hf \
    --tensor-parallel-size 2 \
    --dtype half \
    --port 8000

# Query with OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-hf",
        "messages": [{"role": "user", "content": "Hello!"}]
    }'

vLLM Advanced Features

# Streaming generation
outputs = llm.generate(prompts, sampling_params, streaming=True)

for output in outputs:
    for candidate in output.outputs:
        print(candidate.text, end="", flush=True)

# Beam search
beam_params = SamplingParams(
    temperature=0,
    max_tokens=256,
    use_beam_search=True,
    best_of=5
)

# Chunked prefill
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    chunked_prefill_enabled=True,
    max_num_seqs=256,
    max_token_len=8192
)

# Prefix caching
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    enable_prefix_caching=True
)

vLLM Quantization

# AWQ quantization
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    quantization="awq",
    dtype="half"
)

# GPTQ quantization
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    quantization="gptq",
    dtype="half",
    gptq_bits=4,
    gptq_group_size=128
)

# SqueezeLLM
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    quantization="squeezellm",
    dtype="half"
)

Hugging Face Text Generation Inference

TGI provides optimized inference for Hugging Face models with production features.

TGI Docker Deployment

# Start TGI with Docker
docker run --gpus all \
    -p 8080:80 \
    -v $PWD/data:/data \
    ghcr.io/huggingface/text-generation-inference:1.4 \
    --model-id meta-llama/Llama-2-7b-hf \
    --num-shard 2 \
    --max-input-length 4096 \
    --max-total-tokens 8192 \
    --quantize bitsandbytes-nf4

TGI API Usage

from text_generation import Client

client = Client("http://localhost:8080")

# Simple generation
response = client.generate(
    prompt="Explain quantum computing",
    max_new_tokens=256,
    temperature=0.7
)

print(response.generated_text)

# Streaming
for response in client.generate_stream(
    prompt="Tell me a story",
    max_new_tokens=100
):
    print(response.token.text, end="", flush=True)

# Batch processing
responses = client.generate(
    prompts=[
        "What is AI?",
        "Define machine learning",
        "Explain neural networks"
    ],
    max_new_tokens=100
)

TGI Configuration

# tgi.toml
[model]
model_id = "meta-llama/Llama-2-70b-hf"
trust_remote_code = true

[parameters]
max_input_length = 4096
max_total_tokens = 8192
temperature = 0.7
top_p = 0.9
top_k = 50
repetition_penalty = 1.1

[quantization]
quantize = "bitsandbytes-nf4"
bits = 4
group_size = 128

[performance]
num_shards = 4
max_batch_size = 64
max_waiting_time = 10

[safety]
max_guard_history_size = 512

TGI with Custom Chat Template

# Custom chat template
from text_generation import Client

client = Client("http://localhost:8080")

# Use custom chat template
response = client.generate(
    prompt="<|system|>You are a helpful assistant.<|end|><|user|>Hello!<|end|><|assistant|>",
    add_special_tokens=False
)

# With inference endpoints
from huggingface_hub import InferenceClient

client = InferenceClient(
    model="meta-llama/Llama-2-70b-chat-hf",
    token="hf_..."
)

for chunk in client.chat_completion(
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256,
    stream=True
):
    print(chunk.choices[0].delta.content, end="")

Feature Comparison

Feature	Triton	vLLM	TGI
Throughput	Good	Excellent	Excellent
Latency	Good	Very Good	Very Good
Batching	Dynamic	Continuous	Dynamic
Quantization	TensorRT	AWQ, GPTQ	BitsAndBytes
Multi-GPU	Yes	Yes	Yes
OpenAI API	Via add-on	Yes	Yes
Streaming	Yes	Yes	Yes
Frameworks	Any	PyTorch	Hugging Face

When to Use Each Solution

Use Triton When:

Need multi-framework support
Already using NVIDIA ecosystem
Need ensemble models

# Good: Triton for heterogeneous models
# Mix TensorFlow, PyTorch, ONNX in one pipeline

Use vLLM When:

Maximum throughput is priority
Running open models (LLaMA, Mistral, etc.)
Need PagedAttention benefits

# Good: vLLM for high-throughput serving
llm = LLM(model="mistralai/Mistral-7B-v0.1", tensor_parallel_size=4)

Use TGI When:

Using Hugging Face models
Need easy deployment
Want latest Hugging Face features

# Good: TGI for quick deployment
docker run ... ghcr.io/huggingface/text-generation-inference:latest

Bad Practices to Avoid

Bad Practice 1: No Batching Configuration

# Bad: Single request at a time
for prompt in prompts:
    result = llm.generate(prompt)  # Very slow!

# Good: Batch requests
outputs = llm.generate(prompts, sampling_params)

Bad Practice 2: Ignoring Quantization

# Bad: Using full precision
llm = LLM(model="70b-model", dtype="float32")  # Needs 140GB!

# Good: Use quantization
llm = LLM(model="70b-model", quantization="awq")

Bad Practice 3: Wrong Tensor Parallelism

# Bad: Too few GPUs
llm = LLM(model="70b-model", tensor_parallel_size=1)  # OOM!

# Good: Match to available GPUs
llm = LLM(model="70b-model", tensor_parallel_size=4)  # 4 GPUs

Good Practices Summary

Performance Tuning

# Good: Optimized vLLM configuration
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    tensor_parallel_size=2,
    dtype="half",
    max_num_seqs=256,
    max_model_len=8192,
    gpu_memory_utilization=0.95,
    enable_chunked_prefill=True,
    enforce_eager=False  # CUDA graphs
)

Monitoring

# Good: Track key metrics
from prometheus_client import Counter, Histogram

request_count = Counter('llm_requests_total', 'Total requests')
request_latency = Histogram('llm_request_latency', 'Request latency')

@app.route("/generate")
def generate():
    request_count.inc()
    with request_latency.time():
        result = llm.generate(prompt)
    return result

Capacity Planning

# Estimate GPU requirements
def estimate_gpus(model_size_billions, quantize_bits=16, target_rps=10):
    # Memory = model_size * 4 bytes (FP32) / compression
    fp32_memory = model_size_billions * 4  # GB
    compressed = fp32_memory / (16 / quantize_bytes)  # Quantization
    overhead = compressed * 1.2  # KV cache, etc.
    
    # Throughput per GPU ~ model_size / 2 tokens/s
    tokens_per_sec_per_gpu = model_size_billions / 2
    
    gpus_needed = target_rps / tokens_per_sec_per_gpu
    return ceil(gpus_needed)