Service meshes manage service-to-service communication in Kubernetes with traffic control, security, and observability.
Service Mesh Fundamentals
What is a Service Mesh?
A service mesh is a dedicated infrastructure layer handling service-to-service communication:
┌─────────────────────────┐
│ Application │
└──────────────┬──────────┘
│
┌──────────────▼──────────────┐
│ Service Mesh (Envoy) │
│ - Load balancing │
│ - Traffic routing │
│ - mTLS encryption │
│ - Observability │
│ - Rate limiting │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ Other Services │
└─────────────────────────────┘
Istio Architecture
Components
## Istio control plane (runs on Kubernetes)
apiVersion: v1
kind: Namespace
metadata:
name: istio-system
---
## Istiod - unified control plane
apiVersion: apps/v1
kind: Deployment
metadata:
name: istiod
namespace: istio-system
spec:
replicas: 3
selector:
matchLabels:
app: istiod
template:
metadata:
labels:
app: istiod
spec:
containers:
- name: discovery
image: istio/pilot:1.15.0
ports:
- containerPort: 15010 # XDS API
- containerPort: 15017 # Validation webhook
- name: proxy-init
securityContext:
privileged: true
command: ["/bin/sh", "-c", "iptables-restore < /etc/istio/init.iptables"]
Traffic Management
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: api-service
spec:
hosts:
- api.example.com
http:
# Route 90% to v1, 10% to v2 (canary deployment)
- match:
- uri:
prefix: "/api"
route:
- destination:
host: api-service
subset: v1
weight: 90
- destination:
host: api-service
subset: v2
weight: 10
timeout: 30s
retries:
attempts: 3
perTryTimeout: 10s
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: api-service
spec:
host: api-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
Security with mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
spec:
# Enforce mTLS for all traffic
mtls:
mode: STRICT # Only allow mTLS
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: api-policy
spec:
selector:
matchLabels:
app: api
rules:
# Allow requests from web service only
- from:
- source:
principals: ["cluster.local/ns/default/sa/web"]
to:
- operation:
methods: ["GET"]
paths: ["/api/public/*"]
# Deny all other traffic
- {} # Empty rule denies all
Linkerd Architecture
Lightweight Alternative
## Linkerd is lighter than Istio
## Data plane: lightweight Rust proxy (2 CPU, 10MB memory per pod)
## Control plane: Go services
apiVersion: v1
kind: Namespace
metadata:
name: linkerd
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: linkerd-controller
namespace: linkerd
spec:
replicas: 1
selector:
matchLabels:
app: linkerd-controller
template:
metadata:
labels:
app: linkerd-controller
annotations:
# Inject Linkerd proxy
linkerd.io/inject: enabled
spec:
containers:
- name: controller
image: cr.l5d.io/linkerd/controller:edge
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
Traffic Splitting
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: api-canary
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
service:
port: 8080
# Analysis metrics
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 5
# Metric checks
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
# Webhook for custom checks
webhooks:
- name: smoke-tests
url: http://flagger-loadtester/smoke
timeout: 30s
Observability
Distributed Tracing
from opentelemetry import trace, metrics
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
## Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger-agent",
agent_port=6831,
)
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
tracer = trace.get_tracer(__name__)
@app.get("/api/orders/{order_id}")
async def get_order(order_id: str):
"""Trace distributed call"""
with tracer.start_as_current_span("get_order") as span:
span.set_attribute("order_id", order_id)
# Call payment service
with tracer.start_as_current_span("fetch_payment_status"):
span.set_attribute("service", "payment")
payment_status = await payment_service.get_status(order_id)
# Call inventory service
with tracer.start_as_current_span("fetch_inventory"):
span.set_attribute("service", "inventory")
inventory = await inventory_service.check(order_id)
return {
'order_id': order_id,
'payment': payment_status,
'inventory': inventory
}
Metrics Collection
## Prometheus scrape config
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
---
## Sample metrics
request_duration_seconds{
service="api",
method="GET",
path="/orders",
status="200"
} 0.125
requests_total{
service="api",
method="GET",
status="200"
} 1543
errors_total{
service="api",
error_type="timeout"
} 5
Logging Aggregation
import logging
import json
from pythonjsonlogger import jsonlogger
## Structured logging for aggregation
logger = logging.getLogger()
logHandler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter()
logHandler.setFormatter(formatter)
logger.addHandler(logHandler)
@app.get("/api/endpoint")
async def endpoint():
# Log with context
logger.info("endpoint called", extra={
'service': 'api',
'endpoint': '/api/endpoint',
'user_id': user_id,
'request_id': request_id,
'duration_ms': 125
})
Circuit Breaker Pattern
from circuitbreaker import circuit
class ServiceClient:
@circuit(failure_threshold=5, recovery_timeout=60)
async def call_external_service(self, endpoint: str):
"""Call external service with circuit breaker"""
try:
async with httpx.AsyncClient() as client:
response = await client.get(endpoint, timeout=5)
return response.json()
except Exception as e:
# After 5 failures, circuit opens for 60 seconds
raise
## Usage
client = ServiceClient()
try:
data = await client.call_external_service('https://api.example.com/data')
except CircuitBreakerListener:
# Use fallback
data = await get_cached_data()
Cost Comparison
class ServiceMeshCosts:
"""Estimate service mesh costs"""
def calculate_istio_costs(self, num_pods: int, num_nodes: int):
# Istio control plane (3 replicas)
control_plane = 3 * 1 # CPU per replica
# Data plane (Envoy proxy per pod)
# 10m CPU + 20Mi RAM per pod
data_plane = num_pods * 0.01
total_cpu = control_plane + data_plane
# Rough AWS cost estimate
monthly_cost = total_cpu * 0.023 * 730 # m5.xlarge pricing
return {
'control_plane_cpu': control_plane,
'data_plane_cpu': data_plane,
'total_cpu': total_cpu,
'monthly_cost_usd': monthly_cost
}
def calculate_linkerd_costs(self, num_pods: int):
# Much lighter than Istio
# Linkerd control plane: ~0.5 CPU
# Proxy per pod: 2m CPU (vs 10m for Istio)
control_plane = 0.5
data_plane = num_pods * 0.002
total_cpu = control_plane + data_plane
monthly_cost = total_cpu * 0.023 * 730
return {
'control_plane_cpu': control_plane,
'data_plane_cpu': data_plane,
'total_cpu': total_cpu,
'monthly_cost_usd': monthly_cost
}
## Example: 100 pods
calc = ServiceMeshCosts()
istio = calc.calculate_istio_costs(num_pods=100, num_nodes=10)
print(f"Istio: ${istio['monthly_cost_usd']:.0f}/month")
## Istio: $185/month
linkerd = calc.calculate_linkerd_costs(num_pods=100)
print(f"Linkerd: ${linkerd['monthly_cost_usd']:.0f}/month")
## Linkerd: $45/month (75% cheaper)
When to Use Service Mesh
Use When
- ✅ 10+ microservices
- ✅ Need fine-grained traffic control
- ✅ mTLS security required
- ✅ Observability critical
- ✅ Team experienced with Kubernetes
Don’t Use When
- ❌ < 5 services (over-engineering)
- ❌ Simple monolithic app
- ❌ Limited DevOps resources
- ❌ Cost-constrained startup
Glossary
- Data Plane: Proxies handling traffic (Envoy)
- Control Plane: Services managing configuration (Istiod)
- Virtual Service: Traffic routing rules
- Destination Rule: Load balancing policies
- mTLS: Mutual TLS encryption
- Canary: Gradual rollout to subset of users
Conclusion
Service meshes solve real operational problems—observability, security, and traffic management—but at the cost of operational complexity. Start with Linkerd for simplicity and performance, graduate to Istio when you need its richer traffic management and authorization policies. Do not adopt a service mesh until you have at least 20-30 services and a dedicated platform team.
Comments