Skip to main content

Model Deployment: TensorFlow Serving, TorchServe, KServe

Published: June 24, 2025 Updated: May 8, 2026 Larry Qu 6 min read

Introduction

Deploying ML models in production is a distinct discipline from training them. A model that performs well in a Jupyter notebook still needs to handle concurrent requests, degrade gracefully under load, support versioning and rollback, and expose a stable API. Three tools dominate this space: TensorFlow Serving for TF-native deployments, TorchServe for PyTorch workloads, and KServe for multi-framework, Kubernetes-native serving at scale.

This guide compares all three, walks through essential setup, and helps you choose the right tool for your use case.


Deployment Architecture Overview

All three tools share the same high-level pattern: a client sends an inference request to a serving layer, which routes to the appropriate model version stored in a backend like S3 or GCS.

flowchart TD
    Client["Client / API Gateway"]
    LB["Load Balancer"]
    S1["Serving Instance\n(Model v1)"]
    S2["Serving Instance\n(Model v2)"]
    Store["Model Storage\n(S3 / GCS)"]

    Client --> LB
    LB --> S1
    LB --> S2
    S1 & S2 --> Store

Where they diverge is in how they handle versioning, batching, scaling, and framework support — the tradeoffs that matter most in production.


Tool Comparison

The table below summarizes the key differences to guide your initial selection:

Feature TensorFlow Serving TorchServe KServe
Framework support TensorFlow only PyTorch only TF, PyTorch, SKLearn, XGBoost, ONNX
Deployment target Docker / bare metal Docker / bare metal Kubernetes (required)
Setup complexity Medium Low High
Inference performance Excellent Excellent Dependent on backend
Batching Built-in Built-in Per-backend
Custom pre/post-processing Limited Full (handler API) Transformer component
Auto-scaling Manual Manual Knative (serverless)
A/B testing / canary Not built-in Limited First-class
Explainability No No Built-in (SHAP, Alibi)
Best for TF models, low latency PyTorch, custom logic Multi-model, cloud-native

TensorFlow Serving

TensorFlow Serving is purpose-built for serving SavedModel and TensorFlow Hub artifacts. It is the most mature option for TF workloads and provides extremely low latency out of the box because the serving binary is tightly integrated with the TF runtime.

Architecture

The Manager component watches a directory for new version directories, loads them, and handles traffic switchover atomically — so you can deploy a new version without restarting the server.

flowchart LR
    Client --> REST["REST :8501\n/ gRPC :8500"]
    REST --> Manager["Manager\n(version control)"]
    Manager --> V1["Model v1"]
    Manager --> V2["Model v2"]

Exporting a Model for Serving

Before starting the server, export your Keras model in SavedModel format. The version number is a directory integer — TF Serving picks the highest available version by default.

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(10, activation='softmax'),
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(x_train, y_train, epochs=5)

# Version 1 lives at /models/mnist/1/
tf.saved_model.save(model, '/models/mnist/1')

Starting the Server

The Docker image is the easiest way to run TF Serving. Mount your model directory and set MODEL_NAME to the directory name immediately under /models.

docker run -p 8500:8500 -p 8501:8501 \
  -v /models/mnist:/models/mnist \
  -e MODEL_NAME=mnist \
  tensorflow/serving

Making an Inference Request

Use the gRPC client for latency-sensitive paths, or the REST API for simplicity. Here is the gRPC approach:

import grpc
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2, prediction_service_pb2_grpc

channel = grpc.insecure_channel('localhost:8500')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

request = predict_pb2.PredictRequest()
request.model_spec.name = 'mnist'
request.model_spec.signature_name = 'serving_default'
request.inputs['dense_input'].CopyFrom(
    tf.make_tensor_proto(x_input, dtype=tf.float32, shape=[1, 784])
)

response = stub.Predict(request, timeout=5.0)

TorchServe

TorchServe is the official PyTorch serving solution maintained by AWS and Meta. Its key strength is the handler API — a Python class that gives you full control over preprocessing, inference, and postprocessing, which is essential for models with complex input pipelines (NLP tokenization, image transforms, etc.).

Architecture

TorchServe exposes three ports: inference (8080), management (8081), and metrics (8082). Models are packaged as .mar archives, which bundle the model weights, handler code, and dependencies.

flowchart TD
    Client -->|"Inference :8080"| Frontend["Frontend\n(API Layer)"]
    Operator -->|"Management :8081"| Frontend
    Frontend --> Worker1["Worker\n(Model v1)"]
    Frontend --> Worker2["Worker\n(Model v1)"]
    Metrics["Metrics :8082"] --- Frontend

Packaging a Model Archive

The torch-model-archiver CLI bundles everything TorchServe needs. The handler file defines the inference pipeline.

torch-model-archiver \
  --model-name resnet50 \
  --version 1.0 \
  --serialized-file resnet50.pth \
  --handler image_classifier \
  --export-path /model-store

The image_classifier handler is a built-in option. For custom logic, point --handler to your own Python file that subclasses BaseHandler.

Starting the Server and Registering a Model

After starting TorchServe, you register models dynamically via the management API without restarting the process.

# Start the server
torchserve --start --model-store /model-store --ncs

# Register the model (batch_size=4, max delay 50ms)
curl -X POST "http://localhost:8081/models?url=resnet50.mar&batch_size=4&max_batch_delay=50"

# Run inference
curl -X POST -T cat.jpg http://localhost:8080/predictions/resnet50

KServe

KServe (formerly KFServing) is a Kubernetes-native serving platform that sits above TF Serving and TorchServe. You describe what you want in a Kubernetes CRD (InferenceService) and KServe handles the rest: container provisioning, traffic splitting, and scale-to-zero via Knative.

Use KServe when you need multi-framework support, canary deployments, or serverless autoscaling — and when you are already running on Kubernetes.

Architecture

KServe decomposes serving into three optional components: a Predictor (the model), a Transformer (pre/post-processing), and an Explainer (SHAP or Alibi for model explanations). Traffic routing between versions is handled by an Istio gateway.

flowchart TD
    Client --> Istio["Istio Ingress Gateway"]
    Istio -->|"90% traffic"| Stable["Predictor\n(Stable v1)"]
    Istio -->|"10% canary"| Canary["Predictor\n(Canary v2)"]
    Stable & Canary --> Store["Model Store\n(S3)"]
    Transformer["Transformer\n(pre/post-processing)"] --> Stable
    Explainer["Explainer\n(SHAP)"] --> Stable

Deploying an InferenceService

The following manifest deploys a scikit-learn model from S3 with a 10% canary rollout to a new version. KServe pulls the model artifact, wraps it in the appropriate serving container, and manages the Knative revision.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
  namespace: kserve-test
spec:
  predictor:
    sklearn:
      storageUri: s3://my-bucket/iris/v1
      resources:
        requests:
          cpu: 100m
          memory: 256Mi
        limits:
          cpu: 500m
          memory: 512Mi
  canaryTrafficPercent: 10

Apply with kubectl apply -f inference-service.yaml. The canary version is set by updating storageUri to a new path and keeping canaryTrafficPercent at your desired split.

Autoscaling Configuration

KServe uses Knative’s pod autoscaler. Set minScale: 0 for scale-to-zero on low-traffic models, or minScale: 1 to avoid cold starts in latency-sensitive paths.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  annotations:
    autoscaling.knative.dev/min-scale: "1"
    autoscaling.knative.dev/max-scale: "10"
    autoscaling.knative.dev/metric: "rps"
    autoscaling.knative.dev/target: "100"

When to Use Each Tool

Choose TensorFlow Serving when your team is fully committed to TensorFlow and needs the lowest possible inference latency on bare metal or Docker. It has the smallest operational footprint and is the most straightforward path from model.fit() to a production gRPC endpoint.

Choose TorchServe when you are working with PyTorch models that require custom input preprocessing — tokenization, image augmentation, multi-modal fusion. The handler API is significantly more flexible than TF Serving’s signature-based approach. It also supports serving multiple models from a single process, which is useful for cost-conscious deployments.

Choose KServe when you are operating on Kubernetes and need any of the following: multi-framework model serving from a single control plane, automatic canary rollouts, scale-to-zero for dev/staging environments, or model explainability out of the box. The operational overhead of Kubernetes is the trade-off; it is not the right choice for a small deployment that lives outside a cluster.


Production Checklist

Before going live with any of these tools, verify the following:

  • Model artifact versioning is tracked in your experiment tracker (MLflow, W&B)
  • Rollback is tested — you can switch versions in under 5 minutes
  • p95 latency meets your SLA under peak load (run locust or vegeta benchmarks)
  • Request rate, latency histograms, and error rates feed into Prometheus/Grafana
  • GPU memory utilization is monitored to catch memory leaks between requests
  • Batch size is tuned — larger batches increase throughput but add latency

Resources

Comments

👍 Was this article helpful?