Introduction
Deploying ML models in production requires specialized infrastructure. TensorFlow Serving, TorchServe, and KServe are the leading solutions for scalable, reliable model serving.
This guide compares these platforms and covers production deployment patterns.
Model Serving Fundamentals
Deployment Architecture
โโโโโโโโโโโโโโโโโโโโโโโ
โ API Gateway โ
โ (Load Balancer) โ
โโโโโโโโโโโฌโโโโโโโโโโโโ
โ
โโโโโโโดโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโ
โ โ โ โ
โโโโโโโโโ โโโโโโโโโโ โโโโโโโโโโ โโโโโโโโโ
โServer โ โServer โ โServer โ โServer โ
โModel1 โ โModel2 โ โModel1 โ โModel3 โ
โโโโโโโโโ โโโโโโโโโโ โโโโโโโโโโ โโโโโโโโโ
โ โ โ โ
โโโโโโโฌโโโโโโดโโโโโโโโโโโดโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโ
โ Model Storage โ
โ (S3/GCS) โ
โโโโโโโโโโโโโโโโโโโ
Deployment Considerations
โ
Throughput: requests/second
โ
Latency: response time (p50, p95, p99)
โ
Availability: uptime and failover
โ
Scalability: horizontal and vertical
โ
Cost: infrastructure efficiency
โ
Monitoring: metrics and alerting
โ
Model versioning: A/B testing
TensorFlow Serving
What is TensorFlow Serving?
TensorFlow Serving (TF Serving):
โโโ Native TensorFlow optimization
โโโ Low-latency inference
โโโ Model versioning and rollback
โโโ Batching support
โโโ gRPC and REST APIs
โโโ Production battle-tested
Architecture
Client
โ
โโโโโโโโโโโโโโโโโโโโโโโ
โ REST/gRPC Endpoint โ
โโโโโโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Manager (versioning) โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ
โโโโโโโโดโโโโโโโ
โ โ
โโโโโโโโโโโโ โโโโโโโโโโโโ
โVersion 1 โ โVersion 2 โ
โModel โ โModel โ
โโโโโโโโโโโโ โโโโโโโโโโโโ
Setup and Installation
# Install TensorFlow Serving
curl https://apt.tensorflow.org/tensorflow-serving.gpg | sudo apt-key add -
echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list
sudo apt-get update
sudo apt-get install tensorflow-serving
# Or Docker
docker run -p 8500:8500 \
-v /path/to/models:/models \
-e MODEL_NAME=my_model \
tensorflow/serving
SavedModel Format
import tensorflow as tf
def create_model():
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation='relu', input_shape=(784,)),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax'),
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
return model
# Train
model = create_model()
model.fit(train_data, train_labels, epochs=10)
# Save for serving (SavedModel format)
tf.saved_model.save(model, '/models/mnist/1')
# Creates:
# /models/mnist/1/
# โโโ assets/
# โโโ saved_model.pb
# โโโ variables/
Deployment
# Directory structure
/models/
โโโ mnist/
โโโ 1/ (version 1)
โ โโโ saved_model.pb
โ โโโ variables/
โโโ 2/ (version 2)
โโโ saved_model.pb
โโโ variables/
# Start serving
tensorflow_model_server \
--port=8500 \
--rest_api_port=8501 \
--model_base_path=/models
Inference
import grpc
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
# gRPC Client
channel = grpc.aio.secure_channel(
'localhost:8500',
grpc.ssl_channel_credentials()
)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
# Prepare request
request = predict_pb2.PredictRequest()
request.model_spec.name = 'mnist'
request.model_spec.signature_name = 'serving_default'
request.inputs['input'].CopyFrom(
tf.make_tensor_proto(data, dtype=tf.float32, shape=[1, 784])
)
# Predict
response = stub.Predict(request)
TorchServe
What is TorchServe?
TorchServe (PyTorch serving):
โโโ Optimized for PyTorch models
โโโ Multi-model serving
โโโ Custom handlers
โโโ A/B testing support
โโโ Metrics and monitoring
โโโ Active PyTorch community
Model Archive Creation
import torch
from torch.jit import trace
# Train and save model
model = CustomModel()
model.load_state_dict(torch.load('model.pth'))
# Create handler (custom inference logic)
class ModelHandler(BaseHandler):
def preprocess(self, data):
# Data preprocessing
return processed_data
def inference(self, data):
# Run model
return self.model(data)
def postprocess(self, inference_output):
# Format output
return formatted_output
# Create model archive
torch-model-archiver \
--model-name my_model \
--version 1.0 \
--model-file model.py \
--serialized-file model.pth \
--handler model_handler.py \
--export-path /models
Deployment
# Start TorchServe
torchserve --start \
--model-store /models \
--ncs # No config snapshots
# Register model
curl -X POST \
"http://localhost:8081/models?url=/models/my_model.mar&batch_size=4&max_batch_delay=100"
# Make prediction
curl -X POST \
-T test_image.jpg \
"http://localhost:8080/predictions/my_model"
Comparison: TensorFlow vs TorchServe
Aspect TensorFlow Serving TorchServe
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Model Framework TensorFlow only PyTorch only
Ease of Setup Medium Easy
Performance Excellent Excellent
Batching Built-in Built-in
Custom Logic Limited Full control
Monitoring Good Excellent
Community Large Growing
Multi-model Limited Excellent
KServe
What is KServe?
KServe = Kubernetes-native ML serving
โโโ Framework agnostic (TF, PyTorch, SKLearn, XGBoost)
โโโ Kubernetes CRDs
โโโ Auto-scaling (serverless)
โโโ A/B testing and canary
โโโ Explainability
โโโ Enterprise features
Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Kubernetes Cluster โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ InferenceService (KServe) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ
โ โ โโโโโโโโโโโโฌโโโโโโโโโโโ โ โ
โ โ โPredictor โExplainer โ โ โ
โ โ โ (Prod) โ (Shadow) โ โ โ
โ โ โโโโโโโโโโโโดโโโโโโโโโโโ โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ KNative Autoscaling โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Installation
# Install KServe
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.10.0/kserve.yaml
# Install Knative (serverless)
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.9.0/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.9.0/serving-core.yaml
Deployment
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
spec:
predictor:
sklearn:
storageUri: s3://bucket/iris-model
resources:
requests:
memory: "1Gi"
cpu: "100m"
limits:
memory: "2Gi"
cpu: "500m"
explainer:
shap:
storageUri: s3://bucket/shap-explainer
# Canary: route 10% to new version
canaryTrafficPercent: 10
Autoscaling
apiVersion: autoscaling.knative.dev/v1alpha1
kind: PodAutoscaler
metadata:
name: sklearn-iris-predictor
spec:
minScale: 1
maxScale: 10
scaleMetric: rps # Scale by requests/sec
scaleTarget: 100 # 100 requests/sec per pod
Production Deployment Checklist
Model Management
โ
Version control (which model, which data)
โ
Reproducible builds
โ
Rollback capability
โ
A/B testing setup
โ
Shadow mode for new models
โ
Performance baselines
Monitoring
โ
Request latency (p50, p95, p99)
โ
Throughput (requests/sec)
โ
Error rate and types
โ
Model accuracy on live data
โ
Data drift detection
โ
Resource utilization
Example: Monitoring with Prometheus
from prometheus_client import Counter, Histogram, Gauge
# Define metrics
predictions = Counter(
'model_predictions_total',
'Total predictions',
['model', 'version']
)
latency = Histogram(
'model_prediction_latency_seconds',
'Prediction latency in seconds'
)
accuracy = Gauge(
'model_accuracy',
'Current model accuracy on test set'
)
@app.route('/predict', methods=['POST'])
def predict():
start = time.time()
result = model.predict(data)
# Log metrics
predictions.labels(model='iris', version='1.0').inc()
latency.observe(time.time() - start)
return result
Cost Optimization
Infrastructure Costs
Comparison (monthly cost for 1M inference requests):
Setup Cost Notes
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Single Instance $100-200 No autoscaling
Load Balanced (3x) $300-600 Manual scaling
Kubernetes (EKS) $200-500 Autoscaling
KServe Serverless $150-400 Pay per request
Optimization Strategies
โ
Batch inference (reduce calls)
โ
Caching (Redis, memcached)
โ
Model compression (quantization, pruning)
โ
Edge deployment (reduce latency)
โ
Spot instances (60-90% savings)
โ
Right-sizing (match instance to workload)
Real-World Example: Multi-Model Serving
# KServe multi-model serving
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: multi-model-ensemble
spec:
predictor:
componentExtensionSpec:
- name: model1
spec:
sklearn:
storageUri: s3://bucket/model1
- name: model2
spec:
pytorch:
storageUri: s3://bucket/model2
- name: model3
spec:
tensorflow:
storageUri: s3://bucket/model3
# Custom ensemble logic
transformer:
custom:
container:
image: ensemble:latest
env:
- name: MODELS
value: "model1,model2,model3"
Glossary
- SavedModel: TensorFlow standard format
- Model Archive: TorchServe packaged model
- InferenceService: KServe Kubernetes resource
- Canary Deployment: Gradual rollout to subset
- Shadow Mode: Running new model without serving
Comments