Skip to main content
โšก Calmops

Distributed Tracing with Jaeger and OpenTelemetry: Mastering Observability in Microservices

Distributed Tracing with Jaeger and OpenTelemetry: Mastering Observability in Microservices

A user reports that your application is slow. You check the API response timeโ€”it’s fast. You check the databaseโ€”it’s responsive. You check individual servicesโ€”they’re all performing well. Yet the user still experiences a slow application. This is the distributed tracing problem.

In monolithic applications, debugging is straightforward: you look at logs and metrics from a single process. In microservices architectures, a single user request might traverse dozens of services, databases, and external APIs. Without distributed tracing, understanding what’s happening is nearly impossible.

Distributed tracing solves this by following requests through your entire system, showing exactly where time is spent and where failures occur. Combined with OpenTelemetry for instrumentation and Jaeger for visualization, you gain unprecedented visibility into your microservices.

Understanding Distributed Tracing

What Is Distributed Tracing?

Distributed tracing is a technique for tracking requests as they flow through a distributed system. It captures the path a request takes, the time spent in each service, and any errors that occur along the way.

Key components:

  • Trace: A complete request journey through your system
  • Span: An individual operation within that journey (e.g., database query, API call)
  • Context: Information passed between services to correlate spans

Why Distributed Tracing Matters

Performance optimization: Identify which services are slow and why.

User Request (1000ms total)
โ”œโ”€ API Gateway (50ms)
โ”œโ”€ Auth Service (100ms)
โ”œโ”€ User Service (200ms)
โ”‚  โ””โ”€ Database Query (180ms) โ† Bottleneck!
โ”œโ”€ Order Service (400ms)
โ”‚  โ””โ”€ Payment API (350ms) โ† Slow external service
โ””โ”€ Response Assembly (250ms)

Dependency mapping: Understand how services interact.

API Gateway
โ”œโ”€ Auth Service
โ”œโ”€ User Service
โ”‚  โ””โ”€ Database
โ”œโ”€ Order Service
โ”‚  โ”œโ”€ Database
โ”‚  โ””โ”€ Payment API
โ””โ”€ Notification Service

Root cause analysis: Find where errors originate.

Request fails in Order Service
โ†’ Trace shows it called Payment API
โ†’ Payment API returned 500 error
โ†’ Root cause: External service failure

OpenTelemetry: The Instrumentation Standard

OpenTelemetry is a vendor-neutral observability framework that provides APIs and SDKs for collecting traces, metrics, and logs. It standardizes how applications are instrumented, making it easy to switch backends or use multiple backends simultaneously.

OpenTelemetry Architecture

Application Code
    โ†“
OpenTelemetry API (Instrumentation)
    โ†“
OpenTelemetry SDK (Collection & Processing)
    โ†“
Exporters (Send to backends)
    โ†“
Jaeger, Prometheus, Datadog, etc.

Key OpenTelemetry Concepts

Tracer: Creates and manages spans.

Span: Represents a single operation with timing and metadata.

Context: Carries trace information across service boundaries.

Exporter: Sends collected data to observability backends.

Instrumenting with OpenTelemetry

Here’s a practical example using Python:

from opentelemetry import trace, metrics
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
    agent_host_name="localhost",
    agent_port=6831,
)

# Set up tracer provider
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

# Auto-instrument Flask and requests
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()

# Get tracer
tracer = trace.get_tracer(__name__)

# Create custom spans
@app.route('/api/orders')
def create_order():
    with tracer.start_as_current_span("create_order") as span:
        span.set_attribute("user_id", user_id)
        span.set_attribute("order_total", total)
        
        # Nested span for database operation
        with tracer.start_as_current_span("save_to_database") as db_span:
            db_span.set_attribute("table", "orders")
            # Save order to database
            order = save_order(data)
        
        # Nested span for external API call
        with tracer.start_as_current_span("process_payment") as payment_span:
            payment_span.set_attribute("amount", total)
            # Process payment
            result = process_payment(order)
        
        return {"order_id": order.id}

Automatic Instrumentation

OpenTelemetry provides automatic instrumentation for popular frameworks:

# Install instrumentation packages
pip install opentelemetry-instrumentation-flask
pip install opentelemetry-instrumentation-django
pip install opentelemetry-instrumentation-requests
pip install opentelemetry-instrumentation-sqlalchemy

# Enable auto-instrumentation
opentelemetry-instrument python app.py

Jaeger: The Tracing Backend

Jaeger is an open-source distributed tracing platform that stores, processes, and visualizes traces. It’s designed for high-volume trace collection and provides powerful querying and analysis capabilities.

Jaeger Architecture

Applications (OpenTelemetry)
    โ†“
Jaeger Agent (UDP receiver)
    โ†“
Jaeger Collector (Processing)
    โ†“
Storage Backend (Elasticsearch, Cassandra, Badger)
    โ†“
Jaeger Query (API & UI)

Running Jaeger

The simplest way to get started is with Docker:

# Run all-in-one Jaeger (includes agent, collector, storage, UI)
docker run -d \
  -p 6831:6831/udp \
  -p 16686:16686 \
  jaegertracing/all-in-one:latest

# Access UI at http://localhost:16686

For production, use a more robust setup:

# docker-compose.yml
version: '3'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.14.0
    environment:
      - discovery.type=single-node
    ports:
      - "9200:9200"

  jaeger-collector:
    image: jaegertracing/jaeger-collector:latest
    environment:
      - SPAN_STORAGE_TYPE=elasticsearch
      - ES_SERVER_URLS=http://elasticsearch:9200
    ports:
      - "14268:14268"
      - "14250:14250"
    depends_on:
      - elasticsearch

  jaeger-query:
    image: jaegertracing/jaeger-query:latest
    environment:
      - SPAN_STORAGE_TYPE=elasticsearch
      - ES_SERVER_URLS=http://elasticsearch:9200
    ports:
      - "16686:16686"
    depends_on:
      - elasticsearch

OpenTelemetry and Jaeger Together

The Complete Pipeline

1. Application creates spans (OpenTelemetry)
2. Spans are batched and exported (OpenTelemetry SDK)
3. Jaeger Agent receives spans (UDP)
4. Jaeger Collector processes spans
5. Spans stored in backend (Elasticsearch)
6. Jaeger UI queries and visualizes traces

Context Propagation

For tracing to work across services, context must be propagated. OpenTelemetry handles this automatically with standard headers:

# Service A: Outgoing request
import requests
from opentelemetry.propagate import inject

headers = {}
inject(headers)  # Adds trace context headers
response = requests.get("http://service-b/api/data", headers=headers)

# Service B: Incoming request
from opentelemetry.propagate import extract

# Extract context from incoming headers
ctx = extract(request.headers)
with tracer.start_as_current_span("handle_request", context=ctx):
    # This span is automatically linked to the parent trace
    pass

Trace Correlation

Traces are automatically correlated across services using trace IDs:

Service A (Trace ID: abc123)
โ”œโ”€ Span: handle_request
โ””โ”€ Calls Service B with Trace ID: abc123

Service B (Trace ID: abc123)
โ”œโ”€ Span: receive_request
โ”œโ”€ Span: process_data
โ””โ”€ Calls Service C with Trace ID: abc123

Service C (Trace ID: abc123)
โ”œโ”€ Span: database_query
โ””โ”€ Returns to Service B

All spans with Trace ID: abc123 are grouped into one trace

Practical Implementation Guide

Step 1: Install Dependencies

# Core OpenTelemetry packages
pip install opentelemetry-api
pip install opentelemetry-sdk
pip install opentelemetry-exporter-jaeger

# Instrumentation for your frameworks
pip install opentelemetry-instrumentation-flask
pip install opentelemetry-instrumentation-sqlalchemy
pip install opentelemetry-instrumentation-requests

Step 2: Configure OpenTelemetry

# tracing.py
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

def init_tracing(service_name):
    jaeger_exporter = JaegerExporter(
        agent_host_name="localhost",
        agent_port=6831,
    )
    
    trace.set_tracer_provider(TracerProvider())
    trace.get_tracer_provider().add_span_processor(
        BatchSpanProcessor(jaeger_exporter)
    )
    
    # Auto-instrument frameworks
    FlaskInstrumentor().instrument()
    SQLAlchemyInstrumentor().instrument()
    
    return trace.get_tracer(__name__)

Step 3: Use Tracing in Your Application

# app.py
from flask import Flask
from tracing import init_tracing

app = Flask(__name__)
tracer = init_tracing("my-service")

@app.route('/api/users/<user_id>')
def get_user(user_id):
    with tracer.start_as_current_span("get_user") as span:
        span.set_attribute("user_id", user_id)
        
        # Database query is automatically traced
        user = User.query.get(user_id)
        
        # External API call is automatically traced
        profile = requests.get(f"https://api.example.com/profile/{user_id}")
        
        return {"user": user, "profile": profile.json()}

Step 4: Configure Sampling

Sampling reduces overhead by tracing only a percentage of requests:

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

# Sample 10% of traces
sampler = TraceIdRatioBased(0.1)

trace_provider = TracerProvider(sampler=sampler)
trace.set_tracer_provider(trace_provider)

Key Concepts Deep Dive

Spans and Attributes

Spans represent operations. Attributes provide context:

with tracer.start_as_current_span("database_query") as span:
    # Set attributes for context
    span.set_attribute("db.system", "postgresql")
    span.set_attribute("db.name", "users_db")
    span.set_attribute("db.statement", "SELECT * FROM users WHERE id = ?")
    span.set_attribute("db.rows_affected", 1)
    
    # Record events
    span.add_event("query_started")
    # ... execute query ...
    span.add_event("query_completed")

Span Status and Errors

Mark spans as failed when errors occur:

from opentelemetry.trace import Status, StatusCode

try:
    with tracer.start_as_current_span("risky_operation") as span:
        # Perform operation
        result = risky_operation()
except Exception as e:
    span.set_status(Status(StatusCode.ERROR))
    span.record_exception(e)
    raise

Best Practices

1. Instrument at Service Boundaries

Focus on tracing service-to-service calls and external APIs:

# Good: Trace external calls
with tracer.start_as_current_span("external_api_call"):
    response = requests.get("https://api.example.com/data")

# Less important: Trace internal function calls
# (unless they're performance-critical)

2. Use Meaningful Span Names

Span names should be descriptive and consistent:

# Good
with tracer.start_as_current_span("fetch_user_profile"):
    pass

# Bad
with tracer.start_as_current_span("do_stuff"):
    pass

3. Add Relevant Attributes

Attributes help with filtering and analysis:

with tracer.start_as_current_span("process_payment") as span:
    span.set_attribute("payment_method", "credit_card")
    span.set_attribute("amount", 99.99)
    span.set_attribute("currency", "USD")

4. Sample Appropriately

Balance visibility with performance:

# High-traffic services: Sample 1-5%
sampler = TraceIdRatioBased(0.01)

# Low-traffic services: Sample 100%
sampler = TraceIdRatioBased(1.0)

# Sample errors at 100%, normal requests at 1%
class ErrorSampler(Sampler):
    def should_sample(self, trace_id, span_name, attributes):
        if attributes.get("error"):
            return True
        return random.random() < 0.01

Common Pitfalls

Pitfall 1: Missing Context Propagation

Traces break when context isn’t propagated between services:

# Wrong: No context propagation
response = requests.get("http://service-b/api/data")

# Right: Propagate context
headers = {}
inject(headers)
response = requests.get("http://service-b/api/data", headers=headers)

Pitfall 2: Too Many Spans

Creating too many spans increases overhead:

# Wrong: Span for every operation
for item in items:
    with tracer.start_as_current_span("process_item"):
        process(item)

# Right: Single span for batch operation
with tracer.start_as_current_span("process_items") as span:
    span.set_attribute("item_count", len(items))
    for item in items:
        process(item)

Pitfall 3: Sensitive Data in Spans

Never include passwords, tokens, or PII in spans:

# Wrong
span.set_attribute("password", user_password)
span.set_attribute("api_key", secret_key)

# Right
span.set_attribute("user_id", user_id)
span.set_attribute("operation", "authentication")

Conclusion

Distributed tracing with OpenTelemetry and Jaeger transforms how you understand and debug microservices. By following requests through your entire system, you gain visibility that’s impossible with traditional monitoring alone.

Key takeaways:

  1. Distributed tracing is essential for understanding microservices behavior
  2. OpenTelemetry standardizes instrumentation across your stack
  3. Jaeger provides powerful visualization and analysis capabilities
  4. Context propagation is critical for trace correlation
  5. Start simple with auto-instrumentation, then add custom spans
  6. Sample strategically to balance visibility with performance
  7. Avoid common pitfalls like missing context propagation and sensitive data

The investment in distributed tracing pays dividends in faster debugging, better performance optimization, and ultimately, more reliable systems. Start implementing today, and you’ll wonder how you ever debugged microservices without it.

Comments