Model Deployment: TensorFlow Serving, TorchServe, KServe

Introduction

Deploying ML models in production requires specialized infrastructure. TensorFlow Serving, TorchServe, and KServe are the leading solutions for scalable, reliable model serving.

This guide compares these platforms and covers production deployment patterns.

Model Serving Fundamentals

Deployment Architecture

┌─────────────────────┐
│   API Gateway       │
│   (Load Balancer)   │
└─────────┬───────────┘
          │
    ┌─────┴─────┬──────────┬─────────┐
    │           │          │         │
┌───────┐  ┌────────┐  ┌────────┐  ┌───────┐
│Server │  │Server  │  │Server  │  │Server │
│Model1 │  │Model2  │  │Model1  │  │Model3 │
└───────┘  └────────┘  └────────┘  └───────┘
    │           │          │         │
    └─────┬─────┴──────────┴─────────┘
          │
    ┌─────────────────┐
    │  Model Storage  │
    │  (S3/GCS)       │
    └─────────────────┘

Deployment Considerations

✅ Throughput: requests/second
✅ Latency: response time (p50, p95, p99)
✅ Availability: uptime and failover
✅ Scalability: horizontal and vertical
✅ Cost: infrastructure efficiency
✅ Monitoring: metrics and alerting
✅ Model versioning: A/B testing

TensorFlow Serving

What is TensorFlow Serving?

TensorFlow Serving (TF Serving):
├── Native TensorFlow optimization
├── Low-latency inference
├── Model versioning and rollback
├── Batching support
├── gRPC and REST APIs
└── Production battle-tested

Architecture

Client
  ↓
┌─────────────────────┐
│  REST/gRPC Endpoint │
└────────────┬────────┘
             │
┌────────────────────────┐
│  Manager (versioning)  │
└────────────┬───────────┘
             │
      ┌──────┴──────┐
      │             │
┌──────────┐   ┌──────────┐
│Version 1 │   │Version 2 │
│Model     │   │Model     │
└──────────┘   └──────────┘

Setup and Installation

# Install TensorFlow Serving
curl https://apt.tensorflow.org/tensorflow-serving.gpg | sudo apt-key add -
echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list
sudo apt-get update
sudo apt-get install tensorflow-serving

# Or Docker
docker run -p 8500:8500 \
  -v /path/to/models:/models \
  -e MODEL_NAME=my_model \
  tensorflow/serving

SavedModel Format

import tensorflow as tf

def create_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(10, activation='relu', input_shape=(784,)),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation='softmax'),
    ])
    
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
    return model

# Train
model = create_model()
model.fit(train_data, train_labels, epochs=10)

# Save for serving (SavedModel format)
tf.saved_model.save(model, '/models/mnist/1')
# Creates:
# /models/mnist/1/
# ├── assets/
# ├── saved_model.pb
# └── variables/

Deployment

# Directory structure
/models/
└── mnist/
    ├── 1/  (version 1)
    │   ├── saved_model.pb
    │   └── variables/
    └── 2/  (version 2)
        ├── saved_model.pb
        └── variables/

# Start serving
tensorflow_model_server \
  --port=8500 \
  --rest_api_port=8501 \
  --model_base_path=/models

Inference

import grpc
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc

# gRPC Client
channel = grpc.aio.secure_channel(
    'localhost:8500',
    grpc.ssl_channel_credentials()
)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

# Prepare request
request = predict_pb2.PredictRequest()
request.model_spec.name = 'mnist'
request.model_spec.signature_name = 'serving_default'
request.inputs['input'].CopyFrom(
    tf.make_tensor_proto(data, dtype=tf.float32, shape=[1, 784])
)

# Predict
response = stub.Predict(request)

TorchServe

What is TorchServe?

TorchServe (PyTorch serving):
├── Optimized for PyTorch models
├── Multi-model serving
├── Custom handlers
├── A/B testing support
├── Metrics and monitoring
└── Active PyTorch community

Model Archive Creation

import torch
from torch.jit import trace

# Train and save model
model = CustomModel()
model.load_state_dict(torch.load('model.pth'))

# Create handler (custom inference logic)
class ModelHandler(BaseHandler):
    def preprocess(self, data):
        # Data preprocessing
        return processed_data
    
    def inference(self, data):
        # Run model
        return self.model(data)
    
    def postprocess(self, inference_output):
        # Format output
        return formatted_output

# Create model archive
torch-model-archiver \
  --model-name my_model \
  --version 1.0 \
  --model-file model.py \
  --serialized-file model.pth \
  --handler model_handler.py \
  --export-path /models

Deployment

# Start TorchServe
torchserve --start \
  --model-store /models \
  --ncs  # No config snapshots

# Register model
curl -X POST \
  "http://localhost:8081/models?url=/models/my_model.mar&batch_size=4&max_batch_delay=100"

# Make prediction
curl -X POST \
  -T test_image.jpg \
  "http://localhost:8080/predictions/my_model"

Comparison: TensorFlow vs TorchServe

Aspect              TensorFlow Serving    TorchServe
─────────────────────────────────────────────────────
Model Framework     TensorFlow only       PyTorch only
Ease of Setup       Medium                Easy
Performance         Excellent             Excellent
Batching            Built-in              Built-in
Custom Logic        Limited               Full control
Monitoring          Good                  Excellent
Community           Large                 Growing
Multi-model         Limited               Excellent

KServe

What is KServe?

KServe = Kubernetes-native ML serving
├── Framework agnostic (TF, PyTorch, SKLearn, XGBoost)
├── Kubernetes CRDs
├── Auto-scaling (serverless)
├── A/B testing and canary
├── Explainability
└── Enterprise features

Architecture

┌──────────────────────────────────────┐
│       Kubernetes Cluster             │
├──────────────────────────────────────┤
│  ┌──────────────────────────────┐    │
│  │  InferenceService (KServe)   │    │
│  ├──────────────────────────────┤    │
│  │  ┌──────────┬──────────┐     │    │
│  │  │Predictor │Explainer │     │    │
│  │  │ (Prod)   │ (Shadow) │     │    │
│  │  └──────────┴──────────┘     │    │
│  │                              │    │
│  │  ┌─────────────────────────┐ │    │
│  │  │ KNative Autoscaling     │ │    │
│  │  └─────────────────────────┘ │    │
│  └──────────────────────────────┘    │
└──────────────────────────────────────┘

Installation

# Install KServe
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.10.0/kserve.yaml

# Install Knative (serverless)
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.9.0/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.9.0/serving-core.yaml

Deployment

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
spec:
  predictor:
    sklearn:
      storageUri: s3://bucket/iris-model
      resources:
        requests:
          memory: "1Gi"
          cpu: "100m"
        limits:
          memory: "2Gi"
          cpu: "500m"
  
  explainer:
    shap:
      storageUri: s3://bucket/shap-explainer
  
  # Canary: route 10% to new version
  canaryTrafficPercent: 10

Autoscaling

apiVersion: autoscaling.knative.dev/v1alpha1
kind: PodAutoscaler
metadata:
  name: sklearn-iris-predictor
spec:
  minScale: 1
  maxScale: 10
  scaleMetric: rps  # Scale by requests/sec
  scaleTarget: 100  # 100 requests/sec per pod

Production Deployment Checklist

Model Management

✅ Version control (which model, which data)
✅ Reproducible builds
✅ Rollback capability
✅ A/B testing setup
✅ Shadow mode for new models
✅ Performance baselines

Monitoring

✅ Request latency (p50, p95, p99)
✅ Throughput (requests/sec)
✅ Error rate and types
✅ Model accuracy on live data
✅ Data drift detection
✅ Resource utilization

Example: Monitoring with Prometheus

from prometheus_client import Counter, Histogram, Gauge

# Define metrics
predictions = Counter(
    'model_predictions_total',
    'Total predictions',
    ['model', 'version']
)

latency = Histogram(
    'model_prediction_latency_seconds',
    'Prediction latency in seconds'
)

accuracy = Gauge(
    'model_accuracy',
    'Current model accuracy on test set'
)

@app.route('/predict', methods=['POST'])
def predict():
    start = time.time()
    
    result = model.predict(data)
    
    # Log metrics
    predictions.labels(model='iris', version='1.0').inc()
    latency.observe(time.time() - start)
    
    return result

Cost Optimization

Infrastructure Costs

Comparison (monthly cost for 1M inference requests):

Setup                    Cost        Notes
───────────────────────────────────────────────
Single Instance         $100-200    No autoscaling
Load Balanced (3x)      $300-600    Manual scaling
Kubernetes (EKS)        $200-500    Autoscaling
KServe Serverless       $150-400    Pay per request

Optimization Strategies

✅ Batch inference (reduce calls)
✅ Caching (Redis, memcached)
✅ Model compression (quantization, pruning)
✅ Edge deployment (reduce latency)
✅ Spot instances (60-90% savings)
✅ Right-sizing (match instance to workload)

Real-World Example: Multi-Model Serving

# KServe multi-model serving
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: multi-model-ensemble
spec:
  predictor:
    componentExtensionSpec:
    - name: model1
      spec:
        sklearn:
          storageUri: s3://bucket/model1
    - name: model2
      spec:
        pytorch:
          storageUri: s3://bucket/model2
    - name: model3
      spec:
        tensorflow:
          storageUri: s3://bucket/model3
    
    # Custom ensemble logic
    transformer:
      custom:
        container:
          image: ensemble:latest
          env:
            - name: MODELS
              value: "model1,model2,model3"

Glossary

SavedModel: TensorFlow standard format
Model Archive: TorchServe packaged model
InferenceService: KServe Kubernetes resource
Canary Deployment: Gradual rollout to subset
Shadow Mode: Running new model without serving

Model Deployment: TensorFlow Serving, TorchServe, KServe

Introduction

Model Serving Fundamentals

Deployment Architecture

Deployment Considerations

TensorFlow Serving

What is TensorFlow Serving?

Architecture

Setup and Installation

SavedModel Format

Deployment

Inference

TorchServe

What is TorchServe?

Model Archive Creation

Deployment

Comparison: TensorFlow vs TorchServe

KServe

What is KServe?

Architecture

Installation

Deployment

Autoscaling

Production Deployment Checklist

Model Management

Monitoring

Example: Monitoring with Prometheus

Cost Optimization

Infrastructure Costs

Optimization Strategies

Real-World Example: Multi-Model Serving

Glossary

Resources

Comments