Introduction
Deploying large language models efficiently requires specialized serving infrastructure. This guide compares three leading solutions: NVIDIA Triton Inference Server, vLLM, and Hugging Face Text Generation Inference (TGI).
Each offers different trade-offs between throughput, latency, features, and hardware optimization.
Understanding LLM Serving
Key concepts in LLM serving:
- Throughput: Tokens processed per second
- Latency: Time to first token (TTFT) and per-token latency
- Batching: Grouping requests for efficient processing
- KV Cache: Caching attention key-values for efficiency
- Quantization: Reducing model size for faster inference
# Key metrics to measure
metrics = {
"throughput": "tokens/second across all requests",
"latency_p50": "50th percentile response time",
"latency_p99": "99th percentile response time",
"ttft": "time to first token",
"batch_size": "average batching efficiency",
"gpu_utilization": "GPU usage percentage"
}
NVIDIA Triton Inference Server
Triton is a production-grade serving solution with broad framework support and optimization features.
Triton Setup
# config.pbtxt - Triton model configuration
name: "llama-2-7b"
platform: "pytorch_libtorch"
max_batch_size: 32
input [
{
name: "input_ids"
shape: [-1]
data_type: INT64
},
{
name: "attention_mask"
shape: [-1]
data_type: INT64
}
]
output [
{
name: "logits"
shape: [-1, 32000]
data_type: FP32
}
]
dynamic_batching {
preferred_batch_size: [8, 16, 32]
max_queue_delay_microseconds: 100000
}
optimization {
graph: {
enabled: true
}
}
Triton Python Client
import tritonclient.http as httpclient
from tritonclient.utils import np_to_triton_dtype
import numpy as np
# Connect to Triton server
client = httpclient.InferenceServerClient(url="localhost:8000")
def generate(prompt: str, max_tokens: int = 100):
# Tokenize
inputs = tokenizer(prompt, return_tensors="np")
# Prepare Triton inputs
input_ids = httpclient.InferInput(
name="input_ids",
shape=inputs["input_ids"].shape,
datatype=np_to_triton_dtype(inputs["input_ids"].dtype])
input_ids.set_data_from_numpy(inputs["input_ids"])
attention_mask = httpclient.InferInput(
name="attention_mask",
shape=inputs["attention_mask"].shape,
datatype=np_to_triton_dtype(inputs["attention_mask"].dtype))
attention_mask.set_data_from_numpy(inputs["attention_mask"])
# Configure output
request = client.infer(
model_name="llama-2-7b",
inputs=[input_ids, attention_mask],
outputs=None,
headers={"Authorization": f"Bearer {token}"}
)
# Decode response
output = request.as_numpy("logits")
return tokenizer.decode(output)
Triton Ensemble Models
# Ensemble: tokenization -> model -> detokenization
name: "llm_ensemble"
input [
{
name: "prompt"
data_type: TYPE_STRING
}
]
output [
{
name: "response"
data_type: TYPE_STRING
}
]
ensemble_scheduling {
step [
{
model_name: "tokenizer"
input_map: { key: "input" value: "prompt" }
output_map: { key: "output" value: "input_ids" }
},
{
model_name: "llama-2-7b"
input_map: { key: "input_ids" value: "input_ids" }
output_map: { key: "logits" value: "logits" }
},
{
model_name: "detokenizer"
input_map: { key: "logits" value: "logits" }
output_map: { key: "output" value: "response" }
}
]
}
Triton Performance Optimization
# Optimized config with dynamic batching
name: "llama-2-7b-optimized"
platform: "pytorch_libtorch"
max_batch_size: 64
instance_group [
{
kind: KIND_GPU
count: 4 # Multiple GPUs
}
]
dynamic_batching {
preferred_batch_size: [16, 32, 64]
max_queue_delay_microseconds: 50000 # 50ms max wait
}
optimization {
graph: {
enabled: true # CUDA graphs
}
execution_accelerators {
gpu_execution_accelerator: [
{
name: "tensorrt"
parameters {
key: "precision"
value: "fp16"
}
}
]
}
}
vLLM: High-Throughput LLM Serving
vLLM specializes in optimized throughput with PagedAttention and continuous batching.
vLLM Installation and Setup
# Install vLLM
pip install vllm
from vllm import LLM, SamplingParams
# Initialize vLLM
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
tensor_parallel_size=2, # Use 2 GPUs
trust_remote_code=True,
dtype="half", # FP16
max_num_seqs=256 # Max concurrent sequences
)
# Generate
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=256,
stop="</s>"
)
prompts = [
"Explain quantum computing in simple terms",
"What is machine learning?",
"How do neural networks work?"
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Generated: {output.outputs[0].text}\n")
vLLM API Server
# Start vLLM API server
vllm serve meta-llama/Llama-2-7b-hf \
--tensor-parallel-size 2 \
--dtype half \
--port 8000
# Query with OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-hf",
"messages": [{"role": "user", "content": "Hello!"}]
}'
vLLM Advanced Features
# Streaming generation
outputs = llm.generate(prompts, sampling_params, streaming=True)
for output in outputs:
for candidate in output.outputs:
print(candidate.text, end="", flush=True)
# Beam search
beam_params = SamplingParams(
temperature=0,
max_tokens=256,
use_beam_search=True,
best_of=5
)
# Chunked prefill
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
chunked_prefill_enabled=True,
max_num_seqs=256,
max_token_len=8192
)
# Prefix caching
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
enable_prefix_caching=True
)
vLLM Quantization
# AWQ quantization
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
quantization="awq",
dtype="half"
)
# GPTQ quantization
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
quantization="gptq",
dtype="half",
gptq_bits=4,
gptq_group_size=128
)
# SqueezeLLM
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
quantization="squeezellm",
dtype="half"
)
Hugging Face Text Generation Inference
TGI provides optimized inference for Hugging Face models with production features.
TGI Docker Deployment
# Start TGI with Docker
docker run --gpus all \
-p 8080:80 \
-v $PWD/data:/data \
ghcr.io/huggingface/text-generation-inference:1.4 \
--model-id meta-llama/Llama-2-7b-hf \
--num-shard 2 \
--max-input-length 4096 \
--max-total-tokens 8192 \
--quantize bitsandbytes-nf4
TGI API Usage
from text_generation import Client
client = Client("http://localhost:8080")
# Simple generation
response = client.generate(
prompt="Explain quantum computing",
max_new_tokens=256,
temperature=0.7
)
print(response.generated_text)
# Streaming
for response in client.generate_stream(
prompt="Tell me a story",
max_new_tokens=100
):
print(response.token.text, end="", flush=True)
# Batch processing
responses = client.generate(
prompts=[
"What is AI?",
"Define machine learning",
"Explain neural networks"
],
max_new_tokens=100
)
TGI Configuration
# tgi.toml
[model]
model_id = "meta-llama/Llama-2-70b-hf"
trust_remote_code = true
[parameters]
max_input_length = 4096
max_total_tokens = 8192
temperature = 0.7
top_p = 0.9
top_k = 50
repetition_penalty = 1.1
[quantization]
quantize = "bitsandbytes-nf4"
bits = 4
group_size = 128
[performance]
num_shards = 4
max_batch_size = 64
max_waiting_time = 10
[safety]
max_guard_history_size = 512
TGI with Custom Chat Template
# Custom chat template
from text_generation import Client
client = Client("http://localhost:8080")
# Use custom chat template
response = client.generate(
prompt="<|system|>You are a helpful assistant.<|end|><|user|>Hello!<|end|><|assistant|>",
add_special_tokens=False
)
# With inference endpoints
from huggingface_hub import InferenceClient
client = InferenceClient(
model="meta-llama/Llama-2-70b-chat-hf",
token="hf_..."
)
for chunk in client.chat_completion(
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256,
stream=True
):
print(chunk.choices[0].delta.content, end="")
Feature Comparison
| Feature | Triton | vLLM | TGI |
|---|---|---|---|
| Throughput | Good | Excellent | Excellent |
| Latency | Good | Very Good | Very Good |
| Batching | Dynamic | Continuous | Dynamic |
| Quantization | TensorRT | AWQ, GPTQ | BitsAndBytes |
| Multi-GPU | Yes | Yes | Yes |
| OpenAI API | Via add-on | Yes | Yes |
| Streaming | Yes | Yes | Yes |
| Frameworks | Any | PyTorch | Hugging Face |
When to Use Each Solution
Use Triton When:
- Need multi-framework support
- Already using NVIDIA ecosystem
- Need ensemble models
# Good: Triton for heterogeneous models
# Mix TensorFlow, PyTorch, ONNX in one pipeline
Use vLLM When:
- Maximum throughput is priority
- Running open models (LLaMA, Mistral, etc.)
- Need PagedAttention benefits
# Good: vLLM for high-throughput serving
llm = LLM(model="mistralai/Mistral-7B-v0.1", tensor_parallel_size=4)
Use TGI When:
- Using Hugging Face models
- Need easy deployment
- Want latest Hugging Face features
# Good: TGI for quick deployment
docker run ... ghcr.io/huggingface/text-generation-inference:latest
Bad Practices to Avoid
Bad Practice 1: No Batching Configuration
# Bad: Single request at a time
for prompt in prompts:
result = llm.generate(prompt) # Very slow!
# Good: Batch requests
outputs = llm.generate(prompts, sampling_params)
Bad Practice 2: Ignoring Quantization
# Bad: Using full precision
llm = LLM(model="70b-model", dtype="float32") # Needs 140GB!
# Good: Use quantization
llm = LLM(model="70b-model", quantization="awq")
Bad Practice 3: Wrong Tensor Parallelism
# Bad: Too few GPUs
llm = LLM(model="70b-model", tensor_parallel_size=1) # OOM!
# Good: Match to available GPUs
llm = LLM(model="70b-model", tensor_parallel_size=4) # 4 GPUs
Good Practices Summary
Performance Tuning
# Good: Optimized vLLM configuration
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
tensor_parallel_size=2,
dtype="half",
max_num_seqs=256,
max_model_len=8192,
gpu_memory_utilization=0.95,
enable_chunked_prefill=True,
enforce_eager=False # CUDA graphs
)
Monitoring
# Good: Track key metrics
from prometheus_client import Counter, Histogram
request_count = Counter('llm_requests_total', 'Total requests')
request_latency = Histogram('llm_request_latency', 'Request latency')
@app.route("/generate")
def generate():
request_count.inc()
with request_latency.time():
result = llm.generate(prompt)
return result
Capacity Planning
# Estimate GPU requirements
def estimate_gpus(model_size_billions, quantize_bits=16, target_rps=10):
# Memory = model_size * 4 bytes (FP32) / compression
fp32_memory = model_size_billions * 4 # GB
compressed = fp32_memory / (16 / quantize_bytes) # Quantization
overhead = compressed * 1.2 # KV cache, etc.
# Throughput per GPU ~ model_size / 2 tokens/s
tokens_per_sec_per_gpu = model_size_billions / 2
gpus_needed = target_rps / tokens_per_sec_per_gpu
return ceil(gpus_needed)
External Resources
- Triton Inference Server Documentation
- vLLM Documentation
- Text Generation Inference Documentation
- NVIDIA TensorRT-LLM
- PagedAttention Paper
- vLLM GitHub
- TGI GitHub
- LLM Serving Benchmark
Comments