Introduction
Deploying ML models in production requires specialized infrastructure. TensorFlow Serving, TorchServe, and KServe are the leading solutions for scalable, reliable model serving.
This guide compares these platforms and covers production deployment patterns.
Model Serving Fundamentals
Deployment Architecture
┌─────────────────────┐
│ API Gateway │
│ (Load Balancer) │
└─────────┬───────────┘
│
┌─────┴─────┬──────────┬─────────┐
│ │ │ │
┌───────┐ ┌────────┐ ┌────────┐ ┌───────┐
│Server │ │Server │ │Server │ │Server │
│Model1 │ │Model2 │ │Model1 │ │Model3 │
└───────┘ └────────┘ └────────┘ └───────┘
│ │ │ │
└─────┬─────┴──────────┴─────────┘
│
┌─────────────────┐
│ Model Storage │
│ (S3/GCS) │
└─────────────────┘
Deployment Considerations
✅ Throughput: requests/second
✅ Latency: response time (p50, p95, p99)
✅ Availability: uptime and failover
✅ Scalability: horizontal and vertical
✅ Cost: infrastructure efficiency
✅ Monitoring: metrics and alerting
✅ Model versioning: A/B testing
TensorFlow Serving
What is TensorFlow Serving?
TensorFlow Serving (TF Serving):
├── Native TensorFlow optimization
├── Low-latency inference
├── Model versioning and rollback
├── Batching support
├── gRPC and REST APIs
└── Production battle-tested
Architecture
Client
↓
┌─────────────────────┐
│ REST/gRPC Endpoint │
└────────────┬────────┘
│
┌────────────────────────┐
│ Manager (versioning) │
└────────────┬───────────┘
│
┌──────┴──────┐
│ │
┌──────────┐ ┌──────────┐
│Version 1 │ │Version 2 │
│Model │ │Model │
└──────────┘ └──────────┘
Setup and Installation
# Install TensorFlow Serving
curl https://apt.tensorflow.org/tensorflow-serving.gpg | sudo apt-key add -
echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list
sudo apt-get update
sudo apt-get install tensorflow-serving
# Or Docker
docker run -p 8500:8500 \
-v /path/to/models:/models \
-e MODEL_NAME=my_model \
tensorflow/serving
SavedModel Format
import tensorflow as tf
def create_model():
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation='relu', input_shape=(784,)),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax'),
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
return model
# Train
model = create_model()
model.fit(train_data, train_labels, epochs=10)
# Save for serving (SavedModel format)
tf.saved_model.save(model, '/models/mnist/1')
# Creates:
# /models/mnist/1/
# ├── assets/
# ├── saved_model.pb
# └── variables/
Deployment
# Directory structure
/models/
└── mnist/
├── 1/ (version 1)
│ ├── saved_model.pb
│ └── variables/
└── 2/ (version 2)
├── saved_model.pb
└── variables/
# Start serving
tensorflow_model_server \
--port=8500 \
--rest_api_port=8501 \
--model_base_path=/models
Inference
import grpc
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
# gRPC Client
channel = grpc.aio.secure_channel(
'localhost:8500',
grpc.ssl_channel_credentials()
)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
# Prepare request
request = predict_pb2.PredictRequest()
request.model_spec.name = 'mnist'
request.model_spec.signature_name = 'serving_default'
request.inputs['input'].CopyFrom(
tf.make_tensor_proto(data, dtype=tf.float32, shape=[1, 784])
)
# Predict
response = stub.Predict(request)
TorchServe
What is TorchServe?
TorchServe (PyTorch serving):
├── Optimized for PyTorch models
├── Multi-model serving
├── Custom handlers
├── A/B testing support
├── Metrics and monitoring
└── Active PyTorch community
Model Archive Creation
import torch
from torch.jit import trace
# Train and save model
model = CustomModel()
model.load_state_dict(torch.load('model.pth'))
# Create handler (custom inference logic)
class ModelHandler(BaseHandler):
def preprocess(self, data):
# Data preprocessing
return processed_data
def inference(self, data):
# Run model
return self.model(data)
def postprocess(self, inference_output):
# Format output
return formatted_output
# Create model archive
torch-model-archiver \
--model-name my_model \
--version 1.0 \
--model-file model.py \
--serialized-file model.pth \
--handler model_handler.py \
--export-path /models
Deployment
# Start TorchServe
torchserve --start \
--model-store /models \
--ncs # No config snapshots
# Register model
curl -X POST \
"http://localhost:8081/models?url=/models/my_model.mar&batch_size=4&max_batch_delay=100"
# Make prediction
curl -X POST \
-T test_image.jpg \
"http://localhost:8080/predictions/my_model"
Comparison: TensorFlow vs TorchServe
Aspect TensorFlow Serving TorchServe
─────────────────────────────────────────────────────
Model Framework TensorFlow only PyTorch only
Ease of Setup Medium Easy
Performance Excellent Excellent
Batching Built-in Built-in
Custom Logic Limited Full control
Monitoring Good Excellent
Community Large Growing
Multi-model Limited Excellent
KServe
What is KServe?
KServe = Kubernetes-native ML serving
├── Framework agnostic (TF, PyTorch, SKLearn, XGBoost)
├── Kubernetes CRDs
├── Auto-scaling (serverless)
├── A/B testing and canary
├── Explainability
└── Enterprise features
Architecture
┌──────────────────────────────────────┐
│ Kubernetes Cluster │
├──────────────────────────────────────┤
│ ┌──────────────────────────────┐ │
│ │ InferenceService (KServe) │ │
│ ├──────────────────────────────┤ │
│ │ ┌──────────┬──────────┐ │ │
│ │ │Predictor │Explainer │ │ │
│ │ │ (Prod) │ (Shadow) │ │ │
│ │ └──────────┴──────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────┐ │ │
│ │ │ KNative Autoscaling │ │ │
│ │ └─────────────────────────┘ │ │
│ └──────────────────────────────┘ │
└──────────────────────────────────────┘
Installation
# Install KServe
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.10.0/kserve.yaml
# Install Knative (serverless)
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.9.0/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.9.0/serving-core.yaml
Deployment
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
spec:
predictor:
sklearn:
storageUri: s3://bucket/iris-model
resources:
requests:
memory: "1Gi"
cpu: "100m"
limits:
memory: "2Gi"
cpu: "500m"
explainer:
shap:
storageUri: s3://bucket/shap-explainer
# Canary: route 10% to new version
canaryTrafficPercent: 10
Autoscaling
apiVersion: autoscaling.knative.dev/v1alpha1
kind: PodAutoscaler
metadata:
name: sklearn-iris-predictor
spec:
minScale: 1
maxScale: 10
scaleMetric: rps # Scale by requests/sec
scaleTarget: 100 # 100 requests/sec per pod
Production Deployment Checklist
Model Management
✅ Version control (which model, which data)
✅ Reproducible builds
✅ Rollback capability
✅ A/B testing setup
✅ Shadow mode for new models
✅ Performance baselines
Monitoring
✅ Request latency (p50, p95, p99)
✅ Throughput (requests/sec)
✅ Error rate and types
✅ Model accuracy on live data
✅ Data drift detection
✅ Resource utilization
Example: Monitoring with Prometheus
from prometheus_client import Counter, Histogram, Gauge
# Define metrics
predictions = Counter(
'model_predictions_total',
'Total predictions',
['model', 'version']
)
latency = Histogram(
'model_prediction_latency_seconds',
'Prediction latency in seconds'
)
accuracy = Gauge(
'model_accuracy',
'Current model accuracy on test set'
)
@app.route('/predict', methods=['POST'])
def predict():
start = time.time()
result = model.predict(data)
# Log metrics
predictions.labels(model='iris', version='1.0').inc()
latency.observe(time.time() - start)
return result
Cost Optimization
Infrastructure Costs
Comparison (monthly cost for 1M inference requests):
Setup Cost Notes
───────────────────────────────────────────────
Single Instance $100-200 No autoscaling
Load Balanced (3x) $300-600 Manual scaling
Kubernetes (EKS) $200-500 Autoscaling
KServe Serverless $150-400 Pay per request
Optimization Strategies
✅ Batch inference (reduce calls)
✅ Caching (Redis, memcached)
✅ Model compression (quantization, pruning)
✅ Edge deployment (reduce latency)
✅ Spot instances (60-90% savings)
✅ Right-sizing (match instance to workload)
Real-World Example: Multi-Model Serving
# KServe multi-model serving
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: multi-model-ensemble
spec:
predictor:
componentExtensionSpec:
- name: model1
spec:
sklearn:
storageUri: s3://bucket/model1
- name: model2
spec:
pytorch:
storageUri: s3://bucket/model2
- name: model3
spec:
tensorflow:
storageUri: s3://bucket/model3
# Custom ensemble logic
transformer:
custom:
container:
image: ensemble:latest
env:
- name: MODELS
value: "model1,model2,model3"
Glossary
- SavedModel: TensorFlow standard format
- Model Archive: TorchServe packaged model
- InferenceService: KServe Kubernetes resource
- Canary Deployment: Gradual rollout to subset
- Shadow Mode: Running new model without serving
Comments