Introduction
Deploying ML models in production is a distinct discipline from training them. A model that performs well in a Jupyter notebook still needs to handle concurrent requests, degrade gracefully under load, support versioning and rollback, and expose a stable API. Three tools dominate this space: TensorFlow Serving for TF-native deployments, TorchServe for PyTorch workloads, and KServe for multi-framework, Kubernetes-native serving at scale.
This guide compares all three, walks through essential setup, and helps you choose the right tool for your use case.
Deployment Architecture Overview
All three tools share the same high-level pattern: a client sends an inference request to a serving layer, which routes to the appropriate model version stored in a backend like S3 or GCS.
flowchart TD
Client["Client / API Gateway"]
LB["Load Balancer"]
S1["Serving Instance\n(Model v1)"]
S2["Serving Instance\n(Model v2)"]
Store["Model Storage\n(S3 / GCS)"]
Client --> LB
LB --> S1
LB --> S2
S1 & S2 --> Store
Where they diverge is in how they handle versioning, batching, scaling, and framework support — the tradeoffs that matter most in production.
Tool Comparison
The table below summarizes the key differences to guide your initial selection:
| Feature | TensorFlow Serving | TorchServe | KServe |
|---|---|---|---|
| Framework support | TensorFlow only | PyTorch only | TF, PyTorch, SKLearn, XGBoost, ONNX |
| Deployment target | Docker / bare metal | Docker / bare metal | Kubernetes (required) |
| Setup complexity | Medium | Low | High |
| Inference performance | Excellent | Excellent | Dependent on backend |
| Batching | Built-in | Built-in | Per-backend |
| Custom pre/post-processing | Limited | Full (handler API) | Transformer component |
| Auto-scaling | Manual | Manual | Knative (serverless) |
| A/B testing / canary | Not built-in | Limited | First-class |
| Explainability | No | No | Built-in (SHAP, Alibi) |
| Best for | TF models, low latency | PyTorch, custom logic | Multi-model, cloud-native |
TensorFlow Serving
TensorFlow Serving is purpose-built for serving SavedModel and TensorFlow Hub artifacts. It is the most mature option for TF workloads and provides extremely low latency out of the box because the serving binary is tightly integrated with the TF runtime.
Architecture
The Manager component watches a directory for new version directories, loads them, and handles traffic switchover atomically — so you can deploy a new version without restarting the server.
flowchart LR
Client --> REST["REST :8501\n/ gRPC :8500"]
REST --> Manager["Manager\n(version control)"]
Manager --> V1["Model v1"]
Manager --> V2["Model v2"]
Exporting a Model for Serving
Before starting the server, export your Keras model in SavedModel format. The version number is a directory integer — TF Serving picks the highest available version by default.
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
tf.keras.layers.Dense(10, activation='softmax'),
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(x_train, y_train, epochs=5)
# Version 1 lives at /models/mnist/1/
tf.saved_model.save(model, '/models/mnist/1')
Starting the Server
The Docker image is the easiest way to run TF Serving. Mount your model directory and set MODEL_NAME to the directory name immediately under /models.
docker run -p 8500:8500 -p 8501:8501 \
-v /models/mnist:/models/mnist \
-e MODEL_NAME=mnist \
tensorflow/serving
Making an Inference Request
Use the gRPC client for latency-sensitive paths, or the REST API for simplicity. Here is the gRPC approach:
import grpc
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2, prediction_service_pb2_grpc
channel = grpc.insecure_channel('localhost:8500')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = 'mnist'
request.model_spec.signature_name = 'serving_default'
request.inputs['dense_input'].CopyFrom(
tf.make_tensor_proto(x_input, dtype=tf.float32, shape=[1, 784])
)
response = stub.Predict(request, timeout=5.0)
TorchServe
TorchServe is the official PyTorch serving solution maintained by AWS and Meta. Its key strength is the handler API — a Python class that gives you full control over preprocessing, inference, and postprocessing, which is essential for models with complex input pipelines (NLP tokenization, image transforms, etc.).
Architecture
TorchServe exposes three ports: inference (8080), management (8081), and metrics (8082). Models are packaged as .mar archives, which bundle the model weights, handler code, and dependencies.
flowchart TD
Client -->|"Inference :8080"| Frontend["Frontend\n(API Layer)"]
Operator -->|"Management :8081"| Frontend
Frontend --> Worker1["Worker\n(Model v1)"]
Frontend --> Worker2["Worker\n(Model v1)"]
Metrics["Metrics :8082"] --- Frontend
Packaging a Model Archive
The torch-model-archiver CLI bundles everything TorchServe needs. The handler file defines the inference pipeline.
torch-model-archiver \
--model-name resnet50 \
--version 1.0 \
--serialized-file resnet50.pth \
--handler image_classifier \
--export-path /model-store
The image_classifier handler is a built-in option. For custom logic, point --handler to your own Python file that subclasses BaseHandler.
Starting the Server and Registering a Model
After starting TorchServe, you register models dynamically via the management API without restarting the process.
# Start the server
torchserve --start --model-store /model-store --ncs
# Register the model (batch_size=4, max delay 50ms)
curl -X POST "http://localhost:8081/models?url=resnet50.mar&batch_size=4&max_batch_delay=50"
# Run inference
curl -X POST -T cat.jpg http://localhost:8080/predictions/resnet50
KServe
KServe (formerly KFServing) is a Kubernetes-native serving platform that sits above TF Serving and TorchServe. You describe what you want in a Kubernetes CRD (InferenceService) and KServe handles the rest: container provisioning, traffic splitting, and scale-to-zero via Knative.
Use KServe when you need multi-framework support, canary deployments, or serverless autoscaling — and when you are already running on Kubernetes.
Architecture
KServe decomposes serving into three optional components: a Predictor (the model), a Transformer (pre/post-processing), and an Explainer (SHAP or Alibi for model explanations). Traffic routing between versions is handled by an Istio gateway.
flowchart TD
Client --> Istio["Istio Ingress Gateway"]
Istio -->|"90% traffic"| Stable["Predictor\n(Stable v1)"]
Istio -->|"10% canary"| Canary["Predictor\n(Canary v2)"]
Stable & Canary --> Store["Model Store\n(S3)"]
Transformer["Transformer\n(pre/post-processing)"] --> Stable
Explainer["Explainer\n(SHAP)"] --> Stable
Deploying an InferenceService
The following manifest deploys a scikit-learn model from S3 with a 10% canary rollout to a new version. KServe pulls the model artifact, wraps it in the appropriate serving container, and manages the Knative revision.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
namespace: kserve-test
spec:
predictor:
sklearn:
storageUri: s3://my-bucket/iris/v1
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
canaryTrafficPercent: 10
Apply with kubectl apply -f inference-service.yaml. The canary version is set by updating storageUri to a new path and keeping canaryTrafficPercent at your desired split.
Autoscaling Configuration
KServe uses Knative’s pod autoscaler. Set minScale: 0 for scale-to-zero on low-traffic models, or minScale: 1 to avoid cold starts in latency-sensitive paths.
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
annotations:
autoscaling.knative.dev/min-scale: "1"
autoscaling.knative.dev/max-scale: "10"
autoscaling.knative.dev/metric: "rps"
autoscaling.knative.dev/target: "100"
When to Use Each Tool
Choose TensorFlow Serving when your team is fully committed to TensorFlow and needs the lowest possible inference latency on bare metal or Docker. It has the smallest operational footprint and is the most straightforward path from model.fit() to a production gRPC endpoint.
Choose TorchServe when you are working with PyTorch models that require custom input preprocessing — tokenization, image augmentation, multi-modal fusion. The handler API is significantly more flexible than TF Serving’s signature-based approach. It also supports serving multiple models from a single process, which is useful for cost-conscious deployments.
Choose KServe when you are operating on Kubernetes and need any of the following: multi-framework model serving from a single control plane, automatic canary rollouts, scale-to-zero for dev/staging environments, or model explainability out of the box. The operational overhead of Kubernetes is the trade-off; it is not the right choice for a small deployment that lives outside a cluster.
Production Checklist
Before going live with any of these tools, verify the following:
- Model artifact versioning is tracked in your experiment tracker (MLflow, W&B)
- Rollback is tested — you can switch versions in under 5 minutes
- p95 latency meets your SLA under peak load (run
locustorvegetabenchmarks) - Request rate, latency histograms, and error rates feed into Prometheus/Grafana
- GPU memory utilization is monitored to catch memory leaks between requests
- Batch size is tuned — larger batches increase throughput but add latency
Related Articles
- Cpu Based Llm Deployment Guide
- Model Serving Triton Vllm Tgi
- Building Production Ml Systems Mlops Best Practices
Comments