Introduction
Distributed tracing provides visibility into complex microservices architectures. With requests spanning dozens of services, tracing is essential for debugging, performance optimization, and understanding system behavior.
Key Statistics:
- OpenTelemetry adoption: 48.5% of organizations (2026), with 81% considering it critical
- Average MTTR improvement with tracing: 60%
- eBPF-based zero-code instrumentation now covers HTTP, gRPC, SQL, and more
- 90%+ of new projects adopt OpenTelemetry as their instrumentation standard
Tracing Architecture
A trace represents a single request as it travels through a distributed system. Each unit of work is a span — an operation with a start time, duration, and metadata.
flowchart LR
R[User Request] --> A[API Gateway]
A --> B[Auth Service]
A --> C[Order Service]
C --> D[Payment Service]
C --> E[Inventory Service]
D --> F[Bank API]
subgraph OTLP[OpenTelemetry Collector]
G[Collect traces<br/>batch, sample, export]
end
A -.-> G
B -.-> G
C -.-> G
D -.-> G
E -.-> G
G --> H[(Trace Backend<br/>Jaeger / Tempo / Datadog)]
H --> I[Visualize & Query]
Each service contributes spans linked by trace context propagated via HTTP headers or gRPC metadata. The OpenTelemetry Collector receives spans, applies sampling and batching, then exports to one or more backends.
Trace Context Propagation
The W3C Trace Context standard (traceparent and tracestate headers) is the universal mechanism for propagating trace IDs across service boundaries:
Request flow:
Service A Service B Service C
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ trace_id: abc │──http──▶│ trace_id: abc │──http──▶│ trace_id: abc │
│ span_id: 1 │ header │ span_id: 2 │ header │ span_id: 3 │
│ parent_id: - │ │ parent_id: 1 │ │ parent_id: 2 │
└──────────────┘ └──────────────┘ └──────────────┘
OpenTelemetry: The Universal Standard
OpenTelemetry (OTel) has become the industry standard for instrumentation in 2026. It is a CNCF graduated project with over 4,000 contributors from 1,200+ companies. Adoption has reached 48.5% of organizations, and 81% of engineering teams consider it critical to their observability strategy.
Why OpenTelemetry Matters for Tracing
- Vendor-neutral: Instrument once, export to Jaeger, Tempo, Datadog, or any OTel-compatible backend
- Semantic conventions: Standardized attribute names (
http.method,db.system,messaging.destination) ensure consistent telemetry across languages and frameworks - Auto-instrumentation: Zero-code agents for Java, Python, Node.js, Go, .NET, Ruby, and PHP
- Collector ecosystem: The OTel Collector is a vendor-agnostic gateway for receiving, processing, and exporting telemetry
Automatic Instrumentation
#!/usr/bin/env python3
"""OpenTelemetry Python auto-instrumentation."""
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
trace.set_tracer_provider(
TracerProvider(
resource=Resource.create({
SERVICE_NAME: "my-service",
"service.version": "1.0.0",
"deployment.environment": "production"
})
)
)
otlp_exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(otlp_exporter)
)
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
Important: The OTLPSpanExporter is the standard OpenTelemetry Protocol (OTLP) exporter in 2026. Older Jaeger-specific exporters are deprecated. Always export to the OTel Collector first, which can then route to Jaeger, Tempo, or other backends.
Manual Instrumentation
#!/usr/bin/env python3
"""Manual span creation with OpenTelemetry."""
from opentelemetry import trace
from opentelemetry.semconv.trace import SpanAttributes
tracer = trace.get_tracer(__name__)
@tracer.start_as_current_span("process_order")
def process_order(order_id):
with tracer.start_as_current_span("validate_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute(SpanAttributes.HTTP_METHOD, "POST")
span.add_event("validation.started", {"order_id": order_id})
validate_order(order_id)
with tracer.start_as_current_span("check_inventory") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.operation", "SELECT")
check_inventory(order_id)
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("order.id", order_id)
process_payment(order_id)
try:
confirm_order(order_id)
except Exception as e:
span = trace.get_current_span()
span.set_status(trace.StatusCode.ERROR, str(e))
span.record_exception(e)
raise
Semantic Conventions
Semantic conventions are the standardized attribute schema that ensures telemetry from different services and languages can be correlated. In 2026, stable conventions exist for:
| Domain | Status | Key Attributes |
|---|---|---|
| HTTP | Stable | http.method, http.url, http.status_code |
| Database | Stable | db.system, db.statement, db.operation |
| Messaging | Stable | messaging.system, messaging.destination, messaging.operation |
| RPC | Stable | rpc.system, rpc.service, rpc.method |
| FaaS | RC | faas.name, faas.trigger, faas.invocation_id |
| Kubernetes | RC | k8s.pod.name, k8s.namespace.name, k8s.deployment.name |
# Using semantic conventions ensures consistency
from opentelemetry.semconv.trace import SpanAttributes
span.set_attribute(SpanAttributes.HTTP_METHOD, "POST")
span.set_attribute(SpanAttributes.HTTP_STATUS_CODE, 200)
span.set_attribute(SpanAttributes.DB_SYSTEM, "postgresql")
OpenTelemetry eBPF Instrumentation (OBI)
The biggest 2026 story in distributed tracing is zero-code instrumentation via eBPF. Grafana donated Beyla to OpenTelemetry in 2025, forming the OpenTelemetry eBPF Instrumentation (OBI) project. OBI uses eBPF to automatically inspect application executables and OS networking at the kernel level, capturing traces without any code changes, library installations, or service restarts.
How OBI Works
eBPF (extended Berkeley Packet Filter) runs sandboxed programs inside the Linux kernel. OBI attaches probes to kernel functions that handle network traffic, HTTP parsing, and system calls. When a request arrives, OBI:
- Detects the HTTP/gRPC transaction at the kernel level
- Generates OpenTelemetry-compatible span data
- Propagates trace context via W3C headers injected at the network layer
- Exports spans to the OTel Collector
Supported Protocols and Languages (2026)
| Protocol/Language | Status |
|---|---|
| HTTP/HTTPS | Stable |
| gRPC | Stable |
| SQL (MySQL, PostgreSQL) | Stable |
| Redis | Stable |
| MQTT, AMQP, NATS | In development |
| MongoDB | In development |
| Go | Full support |
| Java | Full support |
| Python | Full support |
| Node.js | Full support |
| .NET (.NET 8+, Framework 4.x) | In development |
| Ruby, PHP | Community support |
Deployment
# OBI as a Kubernetes DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-ebpf
spec:
selector:
matchLabels:
name: otel-ebpf
template:
metadata:
labels:
name: otel-ebpf
spec:
hostNetwork: true
containers:
- name: obi
image: otel/opentelemetry-ebpf-instrumentation:latest
volumeMounts:
- mountPath: /sys/kernel/debug
name: kernel-debug
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector:4317"
volumes:
- name: kernel-debug
hostPath:
path: /sys/kernel/debug
When to Use OBI vs SDK Instrumentation
| Scenario | Recommended Approach |
|---|---|
| Greenfield applications | OTel SDK with auto-instrumentation for full context |
| Legacy apps you cannot modify | OBI (zero-code) |
| Quick visibility for a new service | OBI for immediate baseline, add SDK later |
| Business-specific context needed | OBI + manual SDK spans (hybrid) |
| High-cardinality custom attributes | SDK instrumentation |
| Multi-runtime monolith | OBI for immediate coverage |
OBI automatically detects when a service already emits OpenTelemetry signals and avoids duplicating data. This makes it safe to deploy in environments where some teams use SDKs and others do not.
Comparison of Distributed Tracing Tools
| Tool | Type | OTel Support | Self-Hosted | Log/Metric Correlation | Storage | Best For |
|---|---|---|---|---|---|---|
| Jaeger | Open-source | Native | Yes | No | Elasticsearch, Cassandra, Kafka | Kubernetes-native teams |
| Zipkin | Open-source | Yes | Yes | No | MySQL, Cassandra, Elasticsearch | Simple, lightweight setups |
| Grafana Tempo | Open-source | Native | Yes | Yes (via Grafana) | Object storage (S3/GCS) | Grafana stack users |
| SigNoz | Open-source | Native | Yes | Yes | ClickHouse | Full-stack open-source observability |
| Datadog APM | SaaS | Yes | No | Yes | Managed | All-in-one enterprise teams |
| Honeycomb | SaaS | Yes | No | Partial | Managed | High-cardinality event analysis |
| Sematext | SaaS | Native | No | Yes | Managed | Cost-conscious teams |
Jaeger
Jaeger is a CNCF-graduated distributed tracing platform originally developed at Uber. It integrates natively with OpenTelemetry and is battle-tested at scale.
Architecture
Jaeger uses a modular architecture with separate components for ingestion (collector), storage, query, and UI. This allows independent scaling of each component.
# Jaeger operator deployment (production strategy)
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: production
spec:
strategy: production
collector:
maxReplicas: 3
resources:
limits:
cpu: 500m
memory: 512Mi
query:
replicas: 2
storage:
type: elasticsearch
elasticsearch:
nodeCount: 3
redundancyPolicy: SingleRedundancy
ingress:
enabled: true
hosts:
- jaeger.example.com
Sampling Strategies
Jaeger supports adaptive sampling that adjusts rates based on traffic. The OTel Collector handles most sampling in modern deployments:
#!/usr/bin/env python3
"""Sampling strategies via OTel Collector configuration."""
# In 2026, sampling is configured in the OTel Collector, not in application code.
# Example collector config for tail-based sampling:
"""
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
tail_sampling:
decision_wait: 30s
num_traces: 100000
policies:
- name: always-sample-errors
type: status_code
config:
status_code: ERROR
- name: sample-slow-traces
type: latency
config:
threshold_ms: 1000
- name: probabilistic
type: probabilistic
config:
sampling_percentage: 10
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling]
exporters: [jaeger]
"""
Querying Traces
#!/usr/bin/env python3
"""Query Jaeger traces via its API."""
import requests
from datetime import datetime, timedelta
JAEGER_QUERY = "http://jaeger-query:16686"
def find_slow_traces(service: str, min_duration_ms: int = 1000, limit: int = 10):
end = datetime.utcnow()
start = end - timedelta(hours=1)
resp = requests.get(f"{JAEGER_QUERY}/api/traces", params={
"service": service,
"start": int(start.timestamp() * 1e6),
"end": int(end.timestamp() * 1e6),
"minDuration": f"{min_duration_ms}ms",
"limit": limit
})
return resp.json().get("data", [])
def find_error_traces(service: str, limit: int = 10):
resp = requests.get(f"{JAEGER_QUERY}/api/traces", params={
"service": service,
"tags": json.dumps({"error": "true"}),
"limit": limit
})
return resp.json().get("data", [])
Strengths and Limitations
Strengths: OpenTelemetry-native, CNCF-graduated, adaptive sampling, Kubernetes ecosystem integration.
Limitations: No native log/metric correlation, limited query capabilities beyond basic filtering, Elasticsearch/Cassandra operational overhead at scale.
Zipkin
Zipkin is one of the original open-source distributed tracing systems, created at Twitter in 2012 (inspired by Google’s Dapper paper). In 2026, Zipkin is best viewed as a lightweight, educational, or infrastructure-level tracing backend rather than a modern tracing platform.
Deployment
# Simple Zipkin deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: zipkin
spec:
replicas: 1
selector:
matchLabels:
app: zipkin
template:
metadata:
labels:
app: zipkin
spec:
containers:
- name: zipkin
image: openzipkin/zipkin:latest
env:
- name: STORAGE_TYPE
value: elasticsearch
- name: ES_HOSTS
value: http://elasticsearch:9200
ports:
- containerPort: 9411
Architecture
Zipkin uses a simpler monolithic architecture compared to Jaeger — a single server handles both ingestion and querying. This makes setup faster but provides less flexibility at scale.
| Aspect | Zipkin | Jaeger |
|---|---|---|
| Default storage | In-memory (dev), Cassandra/ES (prod) | Cassandra, ES, Kafka |
| Sampling | Fixed-rate, probability-based | Adaptive, remote sampling |
| Deployment | Single binary | Multi-component (agent, collector, query) |
| Python support | Community-driven | Official client |
| PHP support | Stronger community | Limited |
| Query capabilities | Simple filtering | Advanced filtering, service maps |
When to Use Zipkin
Zipkin is best for simpler setups, legacy systems, or teams that need wide language support with minimal operational overhead. For new production deployments, Grafana Tempo or Jaeger are recommended over Zipkin in 2026.
Datadog APM
Datadog APM provides distributed tracing as part of its comprehensive observability platform. It is the most widely adopted commercial tracing solution, known for extensive integrations and a polished UI.
Configuration
# Datadog agent with APM and OTLP ingestion
apiVersion: v1
kind: ConfigMap
metadata:
name: datadog-agent-config
data:
datadog.yaml: |
apm_config:
enabled: true
receiver_port: 8126
otlp_config:
receiver:
protocols:
grpc:
endpoint: 0.0.0.0:4317
logs_enabled: true
process_config:
enabled: true
# Enable APM in your application
export DD_SERVICE="my-service"
export DD_ENV="production"
export DD_VERSION="1.0.0"
export DD_PROFILING_ENABLED=true
Ingestion and Sampling
Datadog uses a combination of head-based and tail-based sampling:
#!/usr/bin/env python3
"""Datadog APM sampling configuration."""
# Configure via DD_TRACE_SAMPLING_RULES env var
# JSON array of rules with service, name, and sample_rate
"""
DD_TRACE_SAMPLING_RULES='[
{"service": "payment-service", "sample_rate": 1.0},
{"service": "notification-service", "sample_rate": 0.1},
{"service": "*", "sample_rate": 0.5}
]'
"""
# Or via Datadog Agent config
"""
apm_config:
sampling_rules:
- service: payment-service
sample_rate: 1.0
- service: notification-service
sample_rate: 0.1
"""
Pricing Reality
Datadog APM starts at $31 per host per month, with additional charges for indexed spans, custom metrics, and continuous profiling. Costs scale with traffic and can surprise teams at volume — a common complaint is “bill shock.” Datadog remains the most expensive tracing option for high-volume systems.
Strengths and Limitations
Strengths: Unified platform (traces, logs, metrics, RUM), AI-driven root cause analysis, automatic instrumentation for 500+ technologies, service maps.
Limitations: High cost at scale, vendor lock-in (proprietary agent recommended over OTel for full feature set), complex pricing with unexpected overages.
Grafana Tempo
Tempo is an open-source, high-scale distributed tracing backend designed for cost-effective operation. It stores traces in object storage (S3, GCS) rather than Elasticsearch or Cassandra, dramatically reducing infrastructure cost and complexity.
Key Features
- Object storage only: No Elasticsearch or Cassandra needed — uses S3, GCS, or Azure Blob Storage
- TraceQL: Powerful query language for finding traces by attribute, duration, status, and structure
- Deep Grafana integration: Traces to Logs, Traces to Metrics via exemplars
- Accepts OTLP, Jaeger, and Zipkin formats natively
# Tempo configuration
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
region: us-east-1
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
Querying with TraceQL
# Find all traces from payment-service that took longer than 2 seconds
{ resource.service.name = "payment-service" && duration > 2s }
# Find all error traces in the checkout flow
{ resource.service.name = "order-service" && status = error }
# Find traces where a specific user agent was used
{ http.user_agent = "MobileApp/2.0" }
OpenTelemetry Collector Pipeline
In 2026, the OTel Collector is the standard gateway for all tracing data. It handles receiving traces from SDKs and OBI, processing (sampling, batching, filtering), and exporting to one or more backends.
# Complete OTel Collector configuration
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
tail_sampling:
decision_wait: 30s
num_traces: 100000
policies:
- name: sample-errors
type: status_code
config:
status_code: ERROR
- name: sample-slow
type: latency
config:
threshold_ms: 1000
- name: sample-regular
type: probabilistic
config:
sampling_percentage: 10
exporters:
otlp:
endpoint: tempo:4317
tls:
insecure: true
datadog:
api:
key: "${DD_API_KEY}"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp, datadog]
Best Practices for Distributed Tracing
1. Use OpenTelemetry as Your Standard
Instrument all services with OTel SDKs or OBI. Avoid proprietary agents — they create vendor lock-in that makes future migrations costly.
2. Deploy a Collector Pipeline
Always send traces to the OTel Collector first, not directly to backends. The Collector provides buffering, sampling, retries, and multi-backend export without changing application code.
3. Implement Smart Sampling
| Strategy | When to Use | Storage Impact |
|---|---|---|
| Head-based (probabilistic) | High-volume, routine traffic | Low — static rate |
| Tail-based (policy-driven) | Need all errors + slow traces | Medium — buffer all, keep some |
| Adaptive | Variable traffic patterns | Low — adjusts automatically |
Always sample errors at 100%. Use tail-based sampling to keep complete traces for high-latency requests.
4. Name Spans Meaningfully
# BAD — generic, unhelpful
tracer.start_span("process")
tracer.start_span("query")
# GOOD — descriptive, searchable
tracer.start_span("validate_order_items")
tracer.start_span("db.users.find_by_email")
5. Add Business Context
Attach business-relevant attributes to spans for richer debugging:
span.set_attribute("order.id", order_id)
span.set_attribute("customer.tier", customer_tier)
span.set_attribute("payment.method", payment_method)
span.set_attribute("cart.item_count", len(items))
6. Use Semantic Conventions
Always use standardized attribute names from OpenTelemetry semantic conventions. This ensures your traces are queryable across services, languages, and observability platforms.
7. Start Small, Expand Gradually
Implement tracing in a single critical service first, understand the data you are collecting, and gradually expand coverage. Focus on the traces that matter: high-latency requests, error paths, and new deployments.
8. Monitor Trace Volume
Trace volume grows with traffic. Set up monitoring on span ingestion rates and configure sampling before hitting storage or cost limits. Use the OTel Collector’s memory_limiter processor as a safety net.
Conclusion
Distributed tracing is essential for understanding behavior in microservice architectures. The 2026 landscape is dominated by OpenTelemetry as the universal instrumentation standard, with two major innovations:
- OpenTelemetry eBPF Instrumentation (OBI) — Zero-code tracing via kernel-level eBPF probes eliminates the instrumentation barrier for legacy services and rapid onboarding
- OpenTelemetry Collector as the universal gateway — All traces flow through the Collector, enabling consistent sampling, multi-backend export, and vendor independence
For most teams, the recommended stack is: OTel SDK (or OBI for zero-code) → OTel Collector → Grafana Tempo (self-hosted, cost-effective, object storage) or Jaeger (Kubernetes-native). For teams that need an all-in-one platform with logs and metrics, Grafana Tempo + Loki + Mimir or a commercial solution like Datadog (with budget awareness) are viable options.
Start small. Instrument one service first. Use semantic conventions. Monitor your trace volume. The visibility you gain into request flows, bottlenecks, and error paths will transform how you debug and optimize your systems.
Resources
- OpenTelemetry Documentation
- OpenTelemetry eBPF Instrumentation
- Jaeger Documentation
- Zipkin
- Grafana Tempo
- Datadog APM
- OpenTelemetry Semantic Conventions
- W3C Trace Context
Comments