Introduction
In microservices architectures, a single user request can flow through dozens of services, making debugging and performance optimization challenging. Distributed tracing provides visibility into these request flows, enabling teams to identify bottlenecks, understand dependencies, and troubleshoot issues quickly. This guide covers distributed tracing concepts, implementation, and best practices.
Distributed tracing is a method used to profile and monitor applications, especially those built using microservices architecture, to pinpoint where failures occur and what causes poor performance.
Core Concepts
Traces and Spans
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Distributed Trace โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Trace: User Request Flow โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ
โ โ โ API โโโโโถโ Payment โโโโโถโ Order โ โ โ
โ โ โ Gateway โ โ Service โ โ Service โ โ โ
โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ
โ โ โ โ โ โ โ
โ โ Span 1 Span 2 Span 3 โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Trace Context
# W3C Trace Context
trace_context = {
"traceparent": "00-0af7651916cd43dd8448eb211c80319-c7b79e6978c6ff01-01",
# trace-id: 0af7651916cd43dd8448eb211c80319
# parent-id: c7b79e6978c6ff01
# trace-flags: 01 (sampled)
}
Span Structure
class Span:
def __init__(self, name, trace_id, span_id, parent_id=None):
self.name = name
self.trace_id = trace_id
self.span_id = span_id
self.parent_id = parent_id
self.start_time = None
self.end_time = None
self.attributes = {}
self.events = []
self.status = None
def set_attribute(self, key, value):
self.attributes[key] = value
def add_event(self, name, timestamp=None, attributes=None):
self.events.append({
"name": name,
"timestamp": timestamp or time.time(),
"attributes": attributes or {}
})
def set_status(self, code, message=None):
self.status = {"code": code, "message": message}
def finish(self):
self.end_time = time.time()
@property
def duration(self):
if self.start_time and self.end_time:
return self.end_time - self.start_time
return None
OpenTelemetry
Setup and Configuration
# OpenTelemetry Python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import Resource
# Configure resource (service name)
resource = Resource(attributes={
"service.name": "payment-service",
"service.version": "1.2.3",
"deployment.environment": "production"
})
# Create tracer provider
provider = TracerProvider(resource=resource)
# Add Jaeger exporter
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
)
provider.add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
Instrumenting Applications
# Automatic instrumentation
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Instrument Flask
FlaskInstrumentor().instrument_app(app)
# Instrument requests
RequestsInstrumentor().instrument()
# Manual instrumentation
@app.route("/api/payments", methods=["POST"])
def create_payment():
with tracer.start_as_current_span("create_payment") as span:
# Add attributes
span.set_attribute("http.method", "POST")
span.set_attribute("http.url", request.url)
# Add events
span.add_event("Validating payment request")
# Process payment
try:
result = process_payment(request.json)
span.set_attribute("payment.id", result.id)
span.set_attribute("payment.amount", str(result.amount))
return jsonify(result)
except Exception as e:
span.set_status(trace.StatusCode.ERROR, str(e))
span.record_exception(e)
raise
Custom Span Creation
# Custom instrumentation
class PaymentService:
def __init__(self):
self.tracer = trace.get_tracer(__name__)
def process_payment(self, payment_data):
with self.tracer.start_as_current_span(
"payment.process",
attributes={
"payment.amount": str(payment_data["amount"]),
"payment.currency": payment_data["currency"]
}
) as span:
# Validate payment
with self.tracer.start_as_current_span("validate") as validate_span:
self._validate(payment_data)
# Charge card
with self.tracer.start_as_current_span("charge") as charge_span:
result = self._charge(payment_data)
# Save to database
with self.tracer.start_as_current_span("save") as save_span:
self._save(result)
return result
def _validate(self, data):
pass # Validation logic
def _charge(self, data):
pass # Payment gateway call
def _save(self, data):
pass # Database save
Tracing Backend Integration
Jaeger
# Jaeger deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: observability
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.47
ports:
- containerPort: 16686
- containerPort: 6831
env:
- name: COLLECTOR_OTLP_ENABLED
value: "true"
Zipkin
# Zipkin exporter
from opentelemetry.exporter.zipkin.json import ZipkinExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
zipkin_exporter = ZipkinExporter(
endpoint="http://zipkin:9411/api/v2/spans",
)
provider.add_span_processor(
BatchSpanProcessor(zipkin_exporter)
)
Grafana Tempo
# Tempo configuration
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
http:
grpc:
ingester:
max_block_duration: 5m
storage:
trace:
backend: local
wal:
path: /var/tempo/wal
local:
path: /var/tempo/blocks
Context Propagation
W3C Standard
# Extract trace context from HTTP headers
from opentelemetry import trace
from opentelemetry.trace import set_span_in_context
def extract_trace_context(headers):
"""Extract trace context from HTTP headers."""
traceparent = headers.get("traceparent")
if not traceparent:
return None
# Parse traceparent
# Format: 00-0af7651916cd43dd8448eb211c80319-c7b79e6978c6ff01-01
parts = traceparent.split("-")
if len(parts) != 4:
return None
return {
"trace_id": parts[1],
"parent_id": parts[2],
"trace_flags": parts[3]
}
Custom Propagators
# Custom propagator for message queues
class MessageQueuePropagator:
"""Propagate trace context via message queue headers."""
def inject(self, span_context, carrier):
carrier["x-trace-id"] = span_context.trace_id
carrier["x-span-id"] = span_context.span_id
def extract(self, carrier):
if "x-trace-id" in carrier:
return TraceContext(
trace_id=carrier["x-trace-id"],
span_id=carrier["x-span-id"]
)
return None
Analyzing Traces
Common Patterns
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Trace Analysis Patterns โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Normal Request: โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Gateway (5ms) โ Auth (2ms) โ Payment (50ms) โ DB (10ms) โ
โ Total: ~70ms โ
โ โ
โ With Bottleneck: โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Gateway (5ms) โ Auth (2ms) โโโโโโโโโโโโโโโโโโโโโโ Payment โ
โ โ โ
โ DB (2000ms) โ SLOW! โ
โ โ
โ With Cascade Failure: โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Gateway (5ms) โ Auth (2ms) โ Payment (ERROR) โ DB (TIMEOUTโ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Performance Optimization
# Identify slow operations
def analyze_trace(spans):
"""Find performance bottlenecks."""
slow_spans = []
for span in spans:
duration = span.duration_ms
if duration > 1000: # Over 1 second
slow_spans.append({
"name": span.name,
"duration": duration,
"service": span.service_name
})
return sorted(slow_spans, key=lambda x: x["duration"], reverse=True)
Best Practices
- Instrument everything: Add tracing to all service interactions
- Use semantic conventions: Follow OpenTelemetry naming standards
- Sample appropriately: Don’t trace 100% of requests in high-traffic systems
- Add meaningful attributes: Include business context
- Use spans strategically: Don’t over-segment
- Monitor span duration: Track latency percentiles
- Correlate with metrics: Connect traces to logs and metrics
Conclusion
Distributed tracing is essential for understanding complex distributed systems. By implementing proper tracing with OpenTelemetry and using backend tools like Jaeger or Tempo, teams can quickly identify issues, optimize performance, and maintain reliable services.
Comments