Skip to main content

Distributed Tracing: Jaeger, Zipkin, DataDog, and OpenTelemetry in 2026

Created: February 18, 2026 Larry Qu 12 min read

Introduction

Distributed tracing provides visibility into complex microservices architectures. With requests spanning dozens of services, tracing is essential for debugging, performance optimization, and understanding system behavior.

Key Statistics:

  • OpenTelemetry adoption: 48.5% of organizations (2026), with 81% considering it critical
  • Average MTTR improvement with tracing: 60%
  • eBPF-based zero-code instrumentation now covers HTTP, gRPC, SQL, and more
  • 90%+ of new projects adopt OpenTelemetry as their instrumentation standard

Tracing Architecture

A trace represents a single request as it travels through a distributed system. Each unit of work is a span — an operation with a start time, duration, and metadata.

flowchart LR
    R[User Request] --> A[API Gateway]
    A --> B[Auth Service]
    A --> C[Order Service]
    C --> D[Payment Service]
    C --> E[Inventory Service]
    D --> F[Bank API]
    subgraph OTLP[OpenTelemetry Collector]
        G[Collect traces<br/>batch, sample, export]
    end
    A -.-> G
    B -.-> G
    C -.-> G
    D -.-> G
    E -.-> G
    G --> H[(Trace Backend<br/>Jaeger / Tempo / Datadog)]
    H --> I[Visualize & Query]

Each service contributes spans linked by trace context propagated via HTTP headers or gRPC metadata. The OpenTelemetry Collector receives spans, applies sampling and batching, then exports to one or more backends.

Trace Context Propagation

The W3C Trace Context standard (traceparent and tracestate headers) is the universal mechanism for propagating trace IDs across service boundaries:

Request flow:
  Service A                    Service B                    Service C
  ┌──────────────┐            ┌──────────────┐            ┌──────────────┐
  │ trace_id: abc │──http──▶│ trace_id: abc │──http──▶│ trace_id: abc │
  │ span_id: 1    │  header  │ span_id: 2    │  header  │ span_id: 3    │
  │ parent_id: -  │          │ parent_id: 1  │          │ parent_id: 2  │
  └──────────────┘            └──────────────┘            └──────────────┘

OpenTelemetry: The Universal Standard

OpenTelemetry (OTel) has become the industry standard for instrumentation in 2026. It is a CNCF graduated project with over 4,000 contributors from 1,200+ companies. Adoption has reached 48.5% of organizations, and 81% of engineering teams consider it critical to their observability strategy.

Why OpenTelemetry Matters for Tracing

  • Vendor-neutral: Instrument once, export to Jaeger, Tempo, Datadog, or any OTel-compatible backend
  • Semantic conventions: Standardized attribute names (http.method, db.system, messaging.destination) ensure consistent telemetry across languages and frameworks
  • Auto-instrumentation: Zero-code agents for Java, Python, Node.js, Go, .NET, Ruby, and PHP
  • Collector ecosystem: The OTel Collector is a vendor-agnostic gateway for receiving, processing, and exporting telemetry

Automatic Instrumentation

#!/usr/bin/env python3
"""OpenTelemetry Python auto-instrumentation."""

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

trace.set_tracer_provider(
    TracerProvider(
        resource=Resource.create({
            SERVICE_NAME: "my-service",
            "service.version": "1.0.0",
            "deployment.environment": "production"
        })
    )
)

otlp_exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")

trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(otlp_exporter)
)

FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

Important: The OTLPSpanExporter is the standard OpenTelemetry Protocol (OTLP) exporter in 2026. Older Jaeger-specific exporters are deprecated. Always export to the OTel Collector first, which can then route to Jaeger, Tempo, or other backends.

Manual Instrumentation

#!/usr/bin/env python3
"""Manual span creation with OpenTelemetry."""

from opentelemetry import trace
from opentelemetry.semconv.trace import SpanAttributes

tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("process_order")
def process_order(order_id):
    with tracer.start_as_current_span("validate_order") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute(SpanAttributes.HTTP_METHOD, "POST")
        span.add_event("validation.started", {"order_id": order_id})
        validate_order(order_id)

    with tracer.start_as_current_span("check_inventory") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("db.system", "postgresql")
        span.set_attribute("db.operation", "SELECT")
        check_inventory(order_id)

    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("order.id", order_id)
        process_payment(order_id)

    try:
        confirm_order(order_id)
    except Exception as e:
        span = trace.get_current_span()
        span.set_status(trace.StatusCode.ERROR, str(e))
        span.record_exception(e)
        raise

Semantic Conventions

Semantic conventions are the standardized attribute schema that ensures telemetry from different services and languages can be correlated. In 2026, stable conventions exist for:

Domain Status Key Attributes
HTTP Stable http.method, http.url, http.status_code
Database Stable db.system, db.statement, db.operation
Messaging Stable messaging.system, messaging.destination, messaging.operation
RPC Stable rpc.system, rpc.service, rpc.method
FaaS RC faas.name, faas.trigger, faas.invocation_id
Kubernetes RC k8s.pod.name, k8s.namespace.name, k8s.deployment.name
# Using semantic conventions ensures consistency
from opentelemetry.semconv.trace import SpanAttributes

span.set_attribute(SpanAttributes.HTTP_METHOD, "POST")
span.set_attribute(SpanAttributes.HTTP_STATUS_CODE, 200)
span.set_attribute(SpanAttributes.DB_SYSTEM, "postgresql")

OpenTelemetry eBPF Instrumentation (OBI)

The biggest 2026 story in distributed tracing is zero-code instrumentation via eBPF. Grafana donated Beyla to OpenTelemetry in 2025, forming the OpenTelemetry eBPF Instrumentation (OBI) project. OBI uses eBPF to automatically inspect application executables and OS networking at the kernel level, capturing traces without any code changes, library installations, or service restarts.

How OBI Works

eBPF (extended Berkeley Packet Filter) runs sandboxed programs inside the Linux kernel. OBI attaches probes to kernel functions that handle network traffic, HTTP parsing, and system calls. When a request arrives, OBI:

  1. Detects the HTTP/gRPC transaction at the kernel level
  2. Generates OpenTelemetry-compatible span data
  3. Propagates trace context via W3C headers injected at the network layer
  4. Exports spans to the OTel Collector

Supported Protocols and Languages (2026)

Protocol/Language Status
HTTP/HTTPS Stable
gRPC Stable
SQL (MySQL, PostgreSQL) Stable
Redis Stable
MQTT, AMQP, NATS In development
MongoDB In development
Go Full support
Java Full support
Python Full support
Node.js Full support
.NET (.NET 8+, Framework 4.x) In development
Ruby, PHP Community support

Deployment

# OBI as a Kubernetes DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-ebpf
spec:
  selector:
    matchLabels:
      name: otel-ebpf
  template:
    metadata:
      labels:
        name: otel-ebpf
    spec:
      hostNetwork: true
      containers:
      - name: obi
        image: otel/opentelemetry-ebpf-instrumentation:latest
        volumeMounts:
        - mountPath: /sys/kernel/debug
          name: kernel-debug
        env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://otel-collector:4317"
      volumes:
      - name: kernel-debug
        hostPath:
          path: /sys/kernel/debug

When to Use OBI vs SDK Instrumentation

Scenario Recommended Approach
Greenfield applications OTel SDK with auto-instrumentation for full context
Legacy apps you cannot modify OBI (zero-code)
Quick visibility for a new service OBI for immediate baseline, add SDK later
Business-specific context needed OBI + manual SDK spans (hybrid)
High-cardinality custom attributes SDK instrumentation
Multi-runtime monolith OBI for immediate coverage

OBI automatically detects when a service already emits OpenTelemetry signals and avoids duplicating data. This makes it safe to deploy in environments where some teams use SDKs and others do not.


Comparison of Distributed Tracing Tools

Tool Type OTel Support Self-Hosted Log/Metric Correlation Storage Best For
Jaeger Open-source Native Yes No Elasticsearch, Cassandra, Kafka Kubernetes-native teams
Zipkin Open-source Yes Yes No MySQL, Cassandra, Elasticsearch Simple, lightweight setups
Grafana Tempo Open-source Native Yes Yes (via Grafana) Object storage (S3/GCS) Grafana stack users
SigNoz Open-source Native Yes Yes ClickHouse Full-stack open-source observability
Datadog APM SaaS Yes No Yes Managed All-in-one enterprise teams
Honeycomb SaaS Yes No Partial Managed High-cardinality event analysis
Sematext SaaS Native No Yes Managed Cost-conscious teams

Jaeger

Jaeger is a CNCF-graduated distributed tracing platform originally developed at Uber. It integrates natively with OpenTelemetry and is battle-tested at scale.

Architecture

Jaeger uses a modular architecture with separate components for ingestion (collector), storage, query, and UI. This allows independent scaling of each component.

# Jaeger operator deployment (production strategy)
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production
spec:
  strategy: production
  collector:
    maxReplicas: 3
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
  query:
    replicas: 2
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
      redundancyPolicy: SingleRedundancy
  ingress:
    enabled: true
    hosts:
      - jaeger.example.com

Sampling Strategies

Jaeger supports adaptive sampling that adjusts rates based on traffic. The OTel Collector handles most sampling in modern deployments:

#!/usr/bin/env python3
"""Sampling strategies via OTel Collector configuration."""

# In 2026, sampling is configured in the OTel Collector, not in application code.
# Example collector config for tail-based sampling:

"""
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 100000
    policies:
      - name: always-sample-errors
        type: status_code
        config:
          status_code: ERROR
      - name: sample-slow-traces
        type: latency
        config:
          threshold_ms: 1000
      - name: probabilistic
        type: probabilistic
        config:
          sampling_percentage: 10

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling]
      exporters: [jaeger]
"""

Querying Traces

#!/usr/bin/env python3
"""Query Jaeger traces via its API."""

import requests
from datetime import datetime, timedelta

JAEGER_QUERY = "http://jaeger-query:16686"

def find_slow_traces(service: str, min_duration_ms: int = 1000, limit: int = 10):
    end = datetime.utcnow()
    start = end - timedelta(hours=1)
    resp = requests.get(f"{JAEGER_QUERY}/api/traces", params={
        "service": service,
        "start": int(start.timestamp() * 1e6),
        "end": int(end.timestamp() * 1e6),
        "minDuration": f"{min_duration_ms}ms",
        "limit": limit
    })
    return resp.json().get("data", [])

def find_error_traces(service: str, limit: int = 10):
    resp = requests.get(f"{JAEGER_QUERY}/api/traces", params={
        "service": service,
        "tags": json.dumps({"error": "true"}),
        "limit": limit
    })
    return resp.json().get("data", [])

Strengths and Limitations

Strengths: OpenTelemetry-native, CNCF-graduated, adaptive sampling, Kubernetes ecosystem integration.

Limitations: No native log/metric correlation, limited query capabilities beyond basic filtering, Elasticsearch/Cassandra operational overhead at scale.


Zipkin

Zipkin is one of the original open-source distributed tracing systems, created at Twitter in 2012 (inspired by Google’s Dapper paper). In 2026, Zipkin is best viewed as a lightweight, educational, or infrastructure-level tracing backend rather than a modern tracing platform.

Deployment

# Simple Zipkin deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: zipkin
spec:
  replicas: 1
  selector:
    matchLabels:
      app: zipkin
  template:
    metadata:
      labels:
        app: zipkin
    spec:
      containers:
      - name: zipkin
        image: openzipkin/zipkin:latest
        env:
        - name: STORAGE_TYPE
          value: elasticsearch
        - name: ES_HOSTS
          value: http://elasticsearch:9200
        ports:
        - containerPort: 9411

Architecture

Zipkin uses a simpler monolithic architecture compared to Jaeger — a single server handles both ingestion and querying. This makes setup faster but provides less flexibility at scale.

Aspect Zipkin Jaeger
Default storage In-memory (dev), Cassandra/ES (prod) Cassandra, ES, Kafka
Sampling Fixed-rate, probability-based Adaptive, remote sampling
Deployment Single binary Multi-component (agent, collector, query)
Python support Community-driven Official client
PHP support Stronger community Limited
Query capabilities Simple filtering Advanced filtering, service maps

When to Use Zipkin

Zipkin is best for simpler setups, legacy systems, or teams that need wide language support with minimal operational overhead. For new production deployments, Grafana Tempo or Jaeger are recommended over Zipkin in 2026.


Datadog APM

Datadog APM provides distributed tracing as part of its comprehensive observability platform. It is the most widely adopted commercial tracing solution, known for extensive integrations and a polished UI.

Configuration

# Datadog agent with APM and OTLP ingestion
apiVersion: v1
kind: ConfigMap
metadata:
  name: datadog-agent-config
data:
  datadog.yaml: |
    apm_config:
      enabled: true
      receiver_port: 8126
    otlp_config:
      receiver:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    logs_enabled: true
    process_config:
      enabled: true
# Enable APM in your application
export DD_SERVICE="my-service"
export DD_ENV="production"
export DD_VERSION="1.0.0"
export DD_PROFILING_ENABLED=true

Ingestion and Sampling

Datadog uses a combination of head-based and tail-based sampling:

#!/usr/bin/env python3
"""Datadog APM sampling configuration."""

# Configure via DD_TRACE_SAMPLING_RULES env var
# JSON array of rules with service, name, and sample_rate

"""
DD_TRACE_SAMPLING_RULES='[
  {"service": "payment-service", "sample_rate": 1.0},
  {"service": "notification-service", "sample_rate": 0.1},
  {"service": "*", "sample_rate": 0.5}
]'
"""

# Or via Datadog Agent config
"""
apm_config:
  sampling_rules:
    - service: payment-service
      sample_rate: 1.0
    - service: notification-service
      sample_rate: 0.1
"""

Pricing Reality

Datadog APM starts at $31 per host per month, with additional charges for indexed spans, custom metrics, and continuous profiling. Costs scale with traffic and can surprise teams at volume — a common complaint is “bill shock.” Datadog remains the most expensive tracing option for high-volume systems.

Strengths and Limitations

Strengths: Unified platform (traces, logs, metrics, RUM), AI-driven root cause analysis, automatic instrumentation for 500+ technologies, service maps.

Limitations: High cost at scale, vendor lock-in (proprietary agent recommended over OTel for full feature set), complex pricing with unexpected overages.


Grafana Tempo

Tempo is an open-source, high-scale distributed tracing backend designed for cost-effective operation. It stores traces in object storage (S3, GCS) rather than Elasticsearch or Cassandra, dramatically reducing infrastructure cost and complexity.

Key Features

  • Object storage only: No Elasticsearch or Cassandra needed — uses S3, GCS, or Azure Blob Storage
  • TraceQL: Powerful query language for finding traces by attribute, duration, status, and structure
  • Deep Grafana integration: Traces to Logs, Traces to Metrics via exemplars
  • Accepts OTLP, Jaeger, and Zipkin formats natively
# Tempo configuration
storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com
      region: us-east-1

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317

Querying with TraceQL

# Find all traces from payment-service that took longer than 2 seconds
{ resource.service.name = "payment-service" && duration > 2s }

# Find all error traces in the checkout flow
{ resource.service.name = "order-service" && status = error }

# Find traces where a specific user agent was used
{ http.user_agent = "MobileApp/2.0" }

OpenTelemetry Collector Pipeline

In 2026, the OTel Collector is the standard gateway for all tracing data. It handles receiving traces from SDKs and OBI, processing (sampling, batching, filtering), and exporting to one or more backends.

# Complete OTel Collector configuration
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
  tail_sampling:
    decision_wait: 30s
    num_traces: 100000
    policies:
      - name: sample-errors
        type: status_code
        config:
          status_code: ERROR
      - name: sample-slow
        type: latency
        config:
          threshold_ms: 1000
      - name: sample-regular
        type: probabilistic
        config:
          sampling_percentage: 10

exporters:
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true
  datadog:
    api:
      key: "${DD_API_KEY}"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp, datadog]

Best Practices for Distributed Tracing

1. Use OpenTelemetry as Your Standard

Instrument all services with OTel SDKs or OBI. Avoid proprietary agents — they create vendor lock-in that makes future migrations costly.

2. Deploy a Collector Pipeline

Always send traces to the OTel Collector first, not directly to backends. The Collector provides buffering, sampling, retries, and multi-backend export without changing application code.

3. Implement Smart Sampling

Strategy When to Use Storage Impact
Head-based (probabilistic) High-volume, routine traffic Low — static rate
Tail-based (policy-driven) Need all errors + slow traces Medium — buffer all, keep some
Adaptive Variable traffic patterns Low — adjusts automatically

Always sample errors at 100%. Use tail-based sampling to keep complete traces for high-latency requests.

4. Name Spans Meaningfully

# BAD — generic, unhelpful
tracer.start_span("process")
tracer.start_span("query")

# GOOD — descriptive, searchable
tracer.start_span("validate_order_items")
tracer.start_span("db.users.find_by_email")

5. Add Business Context

Attach business-relevant attributes to spans for richer debugging:

span.set_attribute("order.id", order_id)
span.set_attribute("customer.tier", customer_tier)
span.set_attribute("payment.method", payment_method)
span.set_attribute("cart.item_count", len(items))

6. Use Semantic Conventions

Always use standardized attribute names from OpenTelemetry semantic conventions. This ensures your traces are queryable across services, languages, and observability platforms.

7. Start Small, Expand Gradually

Implement tracing in a single critical service first, understand the data you are collecting, and gradually expand coverage. Focus on the traces that matter: high-latency requests, error paths, and new deployments.

8. Monitor Trace Volume

Trace volume grows with traffic. Set up monitoring on span ingestion rates and configure sampling before hitting storage or cost limits. Use the OTel Collector’s memory_limiter processor as a safety net.


Conclusion

Distributed tracing is essential for understanding behavior in microservice architectures. The 2026 landscape is dominated by OpenTelemetry as the universal instrumentation standard, with two major innovations:

  1. OpenTelemetry eBPF Instrumentation (OBI) — Zero-code tracing via kernel-level eBPF probes eliminates the instrumentation barrier for legacy services and rapid onboarding
  2. OpenTelemetry Collector as the universal gateway — All traces flow through the Collector, enabling consistent sampling, multi-backend export, and vendor independence

For most teams, the recommended stack is: OTel SDK (or OBI for zero-code) → OTel Collector → Grafana Tempo (self-hosted, cost-effective, object storage) or Jaeger (Kubernetes-native). For teams that need an all-in-one platform with logs and metrics, Grafana Tempo + Loki + Mimir or a commercial solution like Datadog (with budget awareness) are viable options.

Start small. Instrument one service first. Use semantic conventions. Monitor your trace volume. The visibility you gain into request flows, bottlenecks, and error paths will transform how you debug and optimize your systems.

Resources

Comments

👍 Was this article helpful?