Distributed Tracing with Jaeger and OpenTelemetry: Mastering Observability in Microservices
A user reports that your application is slow. You check the API response timeโit’s fast. You check the databaseโit’s responsive. You check individual servicesโthey’re all performing well. Yet the user still experiences a slow application. This is the distributed tracing problem.
In monolithic applications, debugging is straightforward: you look at logs and metrics from a single process. In microservices architectures, a single user request might traverse dozens of services, databases, and external APIs. Without distributed tracing, understanding what’s happening is nearly impossible.
Distributed tracing solves this by following requests through your entire system, showing exactly where time is spent and where failures occur. Combined with OpenTelemetry for instrumentation and Jaeger for visualization, you gain unprecedented visibility into your microservices.
Understanding Distributed Tracing
What Is Distributed Tracing?
Distributed tracing is a technique for tracking requests as they flow through a distributed system. It captures the path a request takes, the time spent in each service, and any errors that occur along the way.
Key components:
- Trace: A complete request journey through your system
- Span: An individual operation within that journey (e.g., database query, API call)
- Context: Information passed between services to correlate spans
Why Distributed Tracing Matters
Performance optimization: Identify which services are slow and why.
User Request (1000ms total)
โโ API Gateway (50ms)
โโ Auth Service (100ms)
โโ User Service (200ms)
โ โโ Database Query (180ms) โ Bottleneck!
โโ Order Service (400ms)
โ โโ Payment API (350ms) โ Slow external service
โโ Response Assembly (250ms)
Dependency mapping: Understand how services interact.
API Gateway
โโ Auth Service
โโ User Service
โ โโ Database
โโ Order Service
โ โโ Database
โ โโ Payment API
โโ Notification Service
Root cause analysis: Find where errors originate.
Request fails in Order Service
โ Trace shows it called Payment API
โ Payment API returned 500 error
โ Root cause: External service failure
OpenTelemetry: The Instrumentation Standard
OpenTelemetry is a vendor-neutral observability framework that provides APIs and SDKs for collecting traces, metrics, and logs. It standardizes how applications are instrumented, making it easy to switch backends or use multiple backends simultaneously.
OpenTelemetry Architecture
Application Code
โ
OpenTelemetry API (Instrumentation)
โ
OpenTelemetry SDK (Collection & Processing)
โ
Exporters (Send to backends)
โ
Jaeger, Prometheus, Datadog, etc.
Key OpenTelemetry Concepts
Tracer: Creates and manages spans.
Span: Represents a single operation with timing and metadata.
Context: Carries trace information across service boundaries.
Exporter: Sends collected data to observability backends.
Instrumenting with OpenTelemetry
Here’s a practical example using Python:
from opentelemetry import trace, metrics
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)
# Set up tracer provider
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
# Auto-instrument Flask and requests
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()
# Get tracer
tracer = trace.get_tracer(__name__)
# Create custom spans
@app.route('/api/orders')
def create_order():
with tracer.start_as_current_span("create_order") as span:
span.set_attribute("user_id", user_id)
span.set_attribute("order_total", total)
# Nested span for database operation
with tracer.start_as_current_span("save_to_database") as db_span:
db_span.set_attribute("table", "orders")
# Save order to database
order = save_order(data)
# Nested span for external API call
with tracer.start_as_current_span("process_payment") as payment_span:
payment_span.set_attribute("amount", total)
# Process payment
result = process_payment(order)
return {"order_id": order.id}
Automatic Instrumentation
OpenTelemetry provides automatic instrumentation for popular frameworks:
# Install instrumentation packages
pip install opentelemetry-instrumentation-flask
pip install opentelemetry-instrumentation-django
pip install opentelemetry-instrumentation-requests
pip install opentelemetry-instrumentation-sqlalchemy
# Enable auto-instrumentation
opentelemetry-instrument python app.py
Jaeger: The Tracing Backend
Jaeger is an open-source distributed tracing platform that stores, processes, and visualizes traces. It’s designed for high-volume trace collection and provides powerful querying and analysis capabilities.
Jaeger Architecture
Applications (OpenTelemetry)
โ
Jaeger Agent (UDP receiver)
โ
Jaeger Collector (Processing)
โ
Storage Backend (Elasticsearch, Cassandra, Badger)
โ
Jaeger Query (API & UI)
Running Jaeger
The simplest way to get started is with Docker:
# Run all-in-one Jaeger (includes agent, collector, storage, UI)
docker run -d \
-p 6831:6831/udp \
-p 16686:16686 \
jaegertracing/all-in-one:latest
# Access UI at http://localhost:16686
For production, use a more robust setup:
# docker-compose.yml
version: '3'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.14.0
environment:
- discovery.type=single-node
ports:
- "9200:9200"
jaeger-collector:
image: jaegertracing/jaeger-collector:latest
environment:
- SPAN_STORAGE_TYPE=elasticsearch
- ES_SERVER_URLS=http://elasticsearch:9200
ports:
- "14268:14268"
- "14250:14250"
depends_on:
- elasticsearch
jaeger-query:
image: jaegertracing/jaeger-query:latest
environment:
- SPAN_STORAGE_TYPE=elasticsearch
- ES_SERVER_URLS=http://elasticsearch:9200
ports:
- "16686:16686"
depends_on:
- elasticsearch
OpenTelemetry and Jaeger Together
The Complete Pipeline
1. Application creates spans (OpenTelemetry)
2. Spans are batched and exported (OpenTelemetry SDK)
3. Jaeger Agent receives spans (UDP)
4. Jaeger Collector processes spans
5. Spans stored in backend (Elasticsearch)
6. Jaeger UI queries and visualizes traces
Context Propagation
For tracing to work across services, context must be propagated. OpenTelemetry handles this automatically with standard headers:
# Service A: Outgoing request
import requests
from opentelemetry.propagate import inject
headers = {}
inject(headers) # Adds trace context headers
response = requests.get("http://service-b/api/data", headers=headers)
# Service B: Incoming request
from opentelemetry.propagate import extract
# Extract context from incoming headers
ctx = extract(request.headers)
with tracer.start_as_current_span("handle_request", context=ctx):
# This span is automatically linked to the parent trace
pass
Trace Correlation
Traces are automatically correlated across services using trace IDs:
Service A (Trace ID: abc123)
โโ Span: handle_request
โโ Calls Service B with Trace ID: abc123
Service B (Trace ID: abc123)
โโ Span: receive_request
โโ Span: process_data
โโ Calls Service C with Trace ID: abc123
Service C (Trace ID: abc123)
โโ Span: database_query
โโ Returns to Service B
All spans with Trace ID: abc123 are grouped into one trace
Practical Implementation Guide
Step 1: Install Dependencies
# Core OpenTelemetry packages
pip install opentelemetry-api
pip install opentelemetry-sdk
pip install opentelemetry-exporter-jaeger
# Instrumentation for your frameworks
pip install opentelemetry-instrumentation-flask
pip install opentelemetry-instrumentation-sqlalchemy
pip install opentelemetry-instrumentation-requests
Step 2: Configure OpenTelemetry
# tracing.py
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
def init_tracing(service_name):
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
# Auto-instrument frameworks
FlaskInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()
return trace.get_tracer(__name__)
Step 3: Use Tracing in Your Application
# app.py
from flask import Flask
from tracing import init_tracing
app = Flask(__name__)
tracer = init_tracing("my-service")
@app.route('/api/users/<user_id>')
def get_user(user_id):
with tracer.start_as_current_span("get_user") as span:
span.set_attribute("user_id", user_id)
# Database query is automatically traced
user = User.query.get(user_id)
# External API call is automatically traced
profile = requests.get(f"https://api.example.com/profile/{user_id}")
return {"user": user, "profile": profile.json()}
Step 4: Configure Sampling
Sampling reduces overhead by tracing only a percentage of requests:
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
# Sample 10% of traces
sampler = TraceIdRatioBased(0.1)
trace_provider = TracerProvider(sampler=sampler)
trace.set_tracer_provider(trace_provider)
Key Concepts Deep Dive
Spans and Attributes
Spans represent operations. Attributes provide context:
with tracer.start_as_current_span("database_query") as span:
# Set attributes for context
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.name", "users_db")
span.set_attribute("db.statement", "SELECT * FROM users WHERE id = ?")
span.set_attribute("db.rows_affected", 1)
# Record events
span.add_event("query_started")
# ... execute query ...
span.add_event("query_completed")
Span Status and Errors
Mark spans as failed when errors occur:
from opentelemetry.trace import Status, StatusCode
try:
with tracer.start_as_current_span("risky_operation") as span:
# Perform operation
result = risky_operation()
except Exception as e:
span.set_status(Status(StatusCode.ERROR))
span.record_exception(e)
raise
Best Practices
1. Instrument at Service Boundaries
Focus on tracing service-to-service calls and external APIs:
# Good: Trace external calls
with tracer.start_as_current_span("external_api_call"):
response = requests.get("https://api.example.com/data")
# Less important: Trace internal function calls
# (unless they're performance-critical)
2. Use Meaningful Span Names
Span names should be descriptive and consistent:
# Good
with tracer.start_as_current_span("fetch_user_profile"):
pass
# Bad
with tracer.start_as_current_span("do_stuff"):
pass
3. Add Relevant Attributes
Attributes help with filtering and analysis:
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("payment_method", "credit_card")
span.set_attribute("amount", 99.99)
span.set_attribute("currency", "USD")
4. Sample Appropriately
Balance visibility with performance:
# High-traffic services: Sample 1-5%
sampler = TraceIdRatioBased(0.01)
# Low-traffic services: Sample 100%
sampler = TraceIdRatioBased(1.0)
# Sample errors at 100%, normal requests at 1%
class ErrorSampler(Sampler):
def should_sample(self, trace_id, span_name, attributes):
if attributes.get("error"):
return True
return random.random() < 0.01
Common Pitfalls
Pitfall 1: Missing Context Propagation
Traces break when context isn’t propagated between services:
# Wrong: No context propagation
response = requests.get("http://service-b/api/data")
# Right: Propagate context
headers = {}
inject(headers)
response = requests.get("http://service-b/api/data", headers=headers)
Pitfall 2: Too Many Spans
Creating too many spans increases overhead:
# Wrong: Span for every operation
for item in items:
with tracer.start_as_current_span("process_item"):
process(item)
# Right: Single span for batch operation
with tracer.start_as_current_span("process_items") as span:
span.set_attribute("item_count", len(items))
for item in items:
process(item)
Pitfall 3: Sensitive Data in Spans
Never include passwords, tokens, or PII in spans:
# Wrong
span.set_attribute("password", user_password)
span.set_attribute("api_key", secret_key)
# Right
span.set_attribute("user_id", user_id)
span.set_attribute("operation", "authentication")
Conclusion
Distributed tracing with OpenTelemetry and Jaeger transforms how you understand and debug microservices. By following requests through your entire system, you gain visibility that’s impossible with traditional monitoring alone.
Key takeaways:
- Distributed tracing is essential for understanding microservices behavior
- OpenTelemetry standardizes instrumentation across your stack
- Jaeger provides powerful visualization and analysis capabilities
- Context propagation is critical for trace correlation
- Start simple with auto-instrumentation, then add custom spans
- Sample strategically to balance visibility with performance
- Avoid common pitfalls like missing context propagation and sensitive data
The investment in distributed tracing pays dividends in faster debugging, better performance optimization, and ultimately, more reliable systems. Start implementing today, and you’ll wonder how you ever debugged microservices without it.
Comments