Skip to main content
โšก Calmops

Model Deployment: TensorFlow Serving, TorchServe, KServe

Introduction

Deploying ML models in production requires specialized infrastructure. TensorFlow Serving, TorchServe, and KServe are the leading solutions for scalable, reliable model serving.

This guide compares these platforms and covers production deployment patterns.


Model Serving Fundamentals

Deployment Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   API Gateway       โ”‚
โ”‚   (Load Balancer)   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚           โ”‚          โ”‚         โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚Server โ”‚  โ”‚Server  โ”‚  โ”‚Server  โ”‚  โ”‚Server โ”‚
โ”‚Model1 โ”‚  โ”‚Model2  โ”‚  โ”‚Model1  โ”‚  โ”‚Model3 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ”‚           โ”‚          โ”‚         โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  Model Storage  โ”‚
    โ”‚  (S3/GCS)       โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Deployment Considerations

โœ… Throughput: requests/second
โœ… Latency: response time (p50, p95, p99)
โœ… Availability: uptime and failover
โœ… Scalability: horizontal and vertical
โœ… Cost: infrastructure efficiency
โœ… Monitoring: metrics and alerting
โœ… Model versioning: A/B testing

TensorFlow Serving

What is TensorFlow Serving?

TensorFlow Serving (TF Serving):
โ”œโ”€โ”€ Native TensorFlow optimization
โ”œโ”€โ”€ Low-latency inference
โ”œโ”€โ”€ Model versioning and rollback
โ”œโ”€โ”€ Batching support
โ”œโ”€โ”€ gRPC and REST APIs
โ””โ”€โ”€ Production battle-tested

Architecture

Client
  โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  REST/gRPC Endpoint โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Manager (versioning)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚
      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”
      โ”‚             โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚Version 1 โ”‚   โ”‚Version 2 โ”‚
โ”‚Model     โ”‚   โ”‚Model     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Setup and Installation

# Install TensorFlow Serving
curl https://apt.tensorflow.org/tensorflow-serving.gpg | sudo apt-key add -
echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list
sudo apt-get update
sudo apt-get install tensorflow-serving

# Or Docker
docker run -p 8500:8500 \
  -v /path/to/models:/models \
  -e MODEL_NAME=my_model \
  tensorflow/serving

SavedModel Format

import tensorflow as tf

def create_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(10, activation='relu', input_shape=(784,)),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation='softmax'),
    ])
    
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
    return model

# Train
model = create_model()
model.fit(train_data, train_labels, epochs=10)

# Save for serving (SavedModel format)
tf.saved_model.save(model, '/models/mnist/1')
# Creates:
# /models/mnist/1/
# โ”œโ”€โ”€ assets/
# โ”œโ”€โ”€ saved_model.pb
# โ””โ”€โ”€ variables/

Deployment

# Directory structure
/models/
โ””โ”€โ”€ mnist/
    โ”œโ”€โ”€ 1/  (version 1)
    โ”‚   โ”œโ”€โ”€ saved_model.pb
    โ”‚   โ””โ”€โ”€ variables/
    โ””โ”€โ”€ 2/  (version 2)
        โ”œโ”€โ”€ saved_model.pb
        โ””โ”€โ”€ variables/

# Start serving
tensorflow_model_server \
  --port=8500 \
  --rest_api_port=8501 \
  --model_base_path=/models

Inference

import grpc
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc

# gRPC Client
channel = grpc.aio.secure_channel(
    'localhost:8500',
    grpc.ssl_channel_credentials()
)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

# Prepare request
request = predict_pb2.PredictRequest()
request.model_spec.name = 'mnist'
request.model_spec.signature_name = 'serving_default'
request.inputs['input'].CopyFrom(
    tf.make_tensor_proto(data, dtype=tf.float32, shape=[1, 784])
)

# Predict
response = stub.Predict(request)

TorchServe

What is TorchServe?

TorchServe (PyTorch serving):
โ”œโ”€โ”€ Optimized for PyTorch models
โ”œโ”€โ”€ Multi-model serving
โ”œโ”€โ”€ Custom handlers
โ”œโ”€โ”€ A/B testing support
โ”œโ”€โ”€ Metrics and monitoring
โ””โ”€โ”€ Active PyTorch community

Model Archive Creation

import torch
from torch.jit import trace

# Train and save model
model = CustomModel()
model.load_state_dict(torch.load('model.pth'))

# Create handler (custom inference logic)
class ModelHandler(BaseHandler):
    def preprocess(self, data):
        # Data preprocessing
        return processed_data
    
    def inference(self, data):
        # Run model
        return self.model(data)
    
    def postprocess(self, inference_output):
        # Format output
        return formatted_output

# Create model archive
torch-model-archiver \
  --model-name my_model \
  --version 1.0 \
  --model-file model.py \
  --serialized-file model.pth \
  --handler model_handler.py \
  --export-path /models

Deployment

# Start TorchServe
torchserve --start \
  --model-store /models \
  --ncs  # No config snapshots

# Register model
curl -X POST \
  "http://localhost:8081/models?url=/models/my_model.mar&batch_size=4&max_batch_delay=100"

# Make prediction
curl -X POST \
  -T test_image.jpg \
  "http://localhost:8080/predictions/my_model"

Comparison: TensorFlow vs TorchServe

Aspect              TensorFlow Serving    TorchServe
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Model Framework     TensorFlow only       PyTorch only
Ease of Setup       Medium                Easy
Performance         Excellent             Excellent
Batching            Built-in              Built-in
Custom Logic        Limited               Full control
Monitoring          Good                  Excellent
Community           Large                 Growing
Multi-model         Limited               Excellent

KServe

What is KServe?

KServe = Kubernetes-native ML serving
โ”œโ”€โ”€ Framework agnostic (TF, PyTorch, SKLearn, XGBoost)
โ”œโ”€โ”€ Kubernetes CRDs
โ”œโ”€โ”€ Auto-scaling (serverless)
โ”œโ”€โ”€ A/B testing and canary
โ”œโ”€โ”€ Explainability
โ””โ”€โ”€ Enterprise features

Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚       Kubernetes Cluster             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚  InferenceService (KServe)   โ”‚    โ”‚
โ”‚  โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค    โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚    โ”‚
โ”‚  โ”‚  โ”‚Predictor โ”‚Explainer โ”‚     โ”‚    โ”‚
โ”‚  โ”‚  โ”‚ (Prod)   โ”‚ (Shadow) โ”‚     โ”‚    โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚    โ”‚
โ”‚  โ”‚                              โ”‚    โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚    โ”‚
โ”‚  โ”‚  โ”‚ KNative Autoscaling     โ”‚ โ”‚    โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Installation

# Install KServe
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.10.0/kserve.yaml

# Install Knative (serverless)
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.9.0/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.9.0/serving-core.yaml

Deployment

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
spec:
  predictor:
    sklearn:
      storageUri: s3://bucket/iris-model
      resources:
        requests:
          memory: "1Gi"
          cpu: "100m"
        limits:
          memory: "2Gi"
          cpu: "500m"
  
  explainer:
    shap:
      storageUri: s3://bucket/shap-explainer
  
  # Canary: route 10% to new version
  canaryTrafficPercent: 10

Autoscaling

apiVersion: autoscaling.knative.dev/v1alpha1
kind: PodAutoscaler
metadata:
  name: sklearn-iris-predictor
spec:
  minScale: 1
  maxScale: 10
  scaleMetric: rps  # Scale by requests/sec
  scaleTarget: 100  # 100 requests/sec per pod

Production Deployment Checklist

Model Management

โœ… Version control (which model, which data)
โœ… Reproducible builds
โœ… Rollback capability
โœ… A/B testing setup
โœ… Shadow mode for new models
โœ… Performance baselines

Monitoring

โœ… Request latency (p50, p95, p99)
โœ… Throughput (requests/sec)
โœ… Error rate and types
โœ… Model accuracy on live data
โœ… Data drift detection
โœ… Resource utilization

Example: Monitoring with Prometheus

from prometheus_client import Counter, Histogram, Gauge

# Define metrics
predictions = Counter(
    'model_predictions_total',
    'Total predictions',
    ['model', 'version']
)

latency = Histogram(
    'model_prediction_latency_seconds',
    'Prediction latency in seconds'
)

accuracy = Gauge(
    'model_accuracy',
    'Current model accuracy on test set'
)

@app.route('/predict', methods=['POST'])
def predict():
    start = time.time()
    
    result = model.predict(data)
    
    # Log metrics
    predictions.labels(model='iris', version='1.0').inc()
    latency.observe(time.time() - start)
    
    return result

Cost Optimization

Infrastructure Costs

Comparison (monthly cost for 1M inference requests):

Setup                    Cost        Notes
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Single Instance         $100-200    No autoscaling
Load Balanced (3x)      $300-600    Manual scaling
Kubernetes (EKS)        $200-500    Autoscaling
KServe Serverless       $150-400    Pay per request

Optimization Strategies

โœ… Batch inference (reduce calls)
โœ… Caching (Redis, memcached)
โœ… Model compression (quantization, pruning)
โœ… Edge deployment (reduce latency)
โœ… Spot instances (60-90% savings)
โœ… Right-sizing (match instance to workload)

Real-World Example: Multi-Model Serving

# KServe multi-model serving
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: multi-model-ensemble
spec:
  predictor:
    componentExtensionSpec:
    - name: model1
      spec:
        sklearn:
          storageUri: s3://bucket/model1
    - name: model2
      spec:
        pytorch:
          storageUri: s3://bucket/model2
    - name: model3
      spec:
        tensorflow:
          storageUri: s3://bucket/model3
    
    # Custom ensemble logic
    transformer:
      custom:
        container:
          image: ensemble:latest
          env:
            - name: MODELS
              value: "model1,model2,model3"

Glossary

  • SavedModel: TensorFlow standard format
  • Model Archive: TorchServe packaged model
  • InferenceService: KServe Kubernetes resource
  • Canary Deployment: Gradual rollout to subset
  • Shadow Mode: Running new model without serving

Resources


Comments