Observability Essentials: Prometheus, Jaeger, and ELK Stack for Complete System Visibility

In modern distributed systems, something is always failing. A service is slow, a database query times out, or a user reports an error. Without proper observability, finding the root cause is like searching for a needle in a haystack. With it, you have a clear picture of what’s happening across your entire system.

Observability is the ability to understand the internal state of a system based on its external outputs. It goes beyond traditional monitoring by providing deep insights into system behavior. The three pillars of observability—metrics, traces, and logs—each tell a different story about your system. Together, they provide comprehensive visibility.

In this guide, we’ll explore how Prometheus, Jaeger, and the ELK stack form the foundation of a complete observability solution.

The Three Pillars of Observability

Metrics: What’s Happening?

Metrics are numerical measurements of system behavior over time. They answer questions like: How many requests per second? What’s the CPU usage? How many errors occurred?

Characteristics:

Time-series data (values over time)
Aggregated and summarized
Low cardinality (limited unique combinations)
Efficient storage and querying

Traces: How Did It Happen?

Traces follow a request through your entire system, showing how different services interact. They answer questions like: Why is this request slow? Which service is the bottleneck? Where did the error originate?

Characteristics:

Request-level detail
Shows service dependencies
Captures timing information
High cardinality (many unique combinations)

Logs: What Exactly Happened?

Logs are detailed records of events in your system. They answer questions like: What error message did we get? What was the user doing? What state was the system in?

Characteristics:

Unstructured or semi-structured text
Event-level detail
High volume
Searchable and queryable

Prometheus: Metrics and Monitoring

Prometheus is a time-series database and monitoring system designed for reliability and operational simplicity. It collects metrics from your applications and infrastructure, stores them efficiently, and provides powerful querying capabilities.

How Prometheus Works

Prometheus uses a pull model: it scrapes metrics from applications at regular intervals. Applications expose metrics on an HTTP endpoint (typically /metrics), and Prometheus periodically fetches them.

# prometheus.yml - Basic configuration
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'api-server'
    static_configs:
      - targets: ['localhost:8080']
  
  - job_name: 'database'
    static_configs:
      - targets: ['localhost:5432']

Instrumenting Applications

Applications expose metrics using client libraries:

# Python example using prometheus_client
from prometheus_client import Counter, Histogram, start_http_server
import time

# Define metrics
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint'])
request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration')

@request_duration.time()
def handle_request(method, endpoint):
    # Your request handling logic
    request_count.labels(method=method, endpoint=endpoint).inc()
    time.sleep(0.1)

# Start metrics server
start_http_server(8000)

# Metrics available at http://localhost:8000/metrics

Querying Metrics with PromQL

Prometheus Query Language (PromQL) allows powerful metric analysis:

# Current request rate (requests per second)
rate(http_requests_total[5m])

# 95th percentile request latency
histogram_quantile(0.95, http_request_duration_seconds_bucket)

# Error rate percentage
(rate(http_requests_total{status="500"}[5m]) / rate(http_requests_total[5m])) * 100

# Memory usage over time
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

When to Use Prometheus

Monitoring infrastructure and applications
Alerting on threshold violations
Capacity planning and trend analysis
Performance optimization
SLA tracking

Jaeger: Distributed Tracing

Jaeger is a distributed tracing platform that helps you understand request flows through your microservices architecture. It’s particularly valuable in complex systems where requests span multiple services.

Understanding Distributed Tracing

A trace represents a complete request journey through your system. Each trace contains spans—individual operations within that journey.

User Request
    ↓
[API Gateway Span]
    ├─→ [Auth Service Span]
    ├─→ [User Service Span]
    │   └─→ [Database Query Span]
    └─→ [Cache Lookup Span]

Instrumenting with Jaeger

Applications send trace data to Jaeger using client libraries:

# Python example using jaeger-client
from jaeger_client import Config
from opentelemetry import trace

def init_jaeger_tracer(service_name):
    config = Config(
        config={
            'sampler': {
                'type': 'const',
                'param': 1,
            },
            'logging': True,
        },
        service_name=service_name,
    )
    return config.initialize_tracer()

tracer = init_jaeger_tracer('my-service')

# Create spans
with tracer.start_active_span('process_request') as scope:
    span = scope.span
    span.set_tag('user_id', user_id)
    
    # Nested span for database operation
    with tracer.start_active_span('database_query') as db_scope:
        db_scope.span.set_tag('query', 'SELECT * FROM users')
        # Execute query
        pass

Analyzing Traces

Jaeger UI provides visualization and analysis:

Trace view: See the complete request flow
Service dependencies: Understand how services interact
Latency analysis: Identify slow operations
Error tracking: Find where errors occur in the flow

When to Use Jaeger

Debugging slow requests
Understanding service dependencies
Identifying bottlenecks in microservices
Tracing errors across services
Performance optimization in distributed systems

ELK Stack: Log Aggregation and Analysis

The ELK stack (Elasticsearch, Logstash, Kibana) provides centralized log management. Elasticsearch stores logs, Logstash processes and enriches them, and Kibana provides visualization and analysis.

Architecture Overview

Applications
    ↓
Logstash (Processing)
    ↓
Elasticsearch (Storage)
    ↓
Kibana (Visualization)

Logstash Configuration

Logstash processes logs from various sources:

# logstash.conf
input {
  tcp {
    port => 5000
    codec => json
  }
  file {
    path => "/var/log/app/*.log"
    start_position => "beginning"
  }
}

filter {
  if [type] == "api_request" {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    date {
      match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

Querying Logs in Kibana

Kibana provides powerful search and visualization:

# Find all errors in the last hour
level: ERROR AND timestamp: [now-1h TO now]

# Find slow API requests
type: api_request AND response_time: [1000 TO *]

# Count errors by service
level: ERROR | stats count() by service_name

When to Use ELK Stack

Centralized log aggregation
Debugging application issues
Compliance and audit logging
Security analysis and threat detection
Operational troubleshooting

How They Work Together

These three tools complement each other, each providing different perspectives:

The Complete Picture

Prometheus (Metrics)
├─ "System is slow"
├─ "Error rate increased"
└─ "CPU usage spiked"
    ↓
Jaeger (Traces)
├─ "Request to service A takes 5s"
├─ "Database query is the bottleneck"
└─ "Error occurs in service B"
    ↓
ELK (Logs)
├─ "Database connection timeout"
├─ "Out of memory error"
└─ "Authentication failed"

Practical Investigation Workflow

Alert from Prometheus: Error rate increased
Check Jaeger: Identify which service is slow
Review ELK logs: Find the root cause (database error, timeout, etc.)
Correlate data: Use timestamps and request IDs to connect all three

Example: Investigating a Performance Issue

1. Prometheus Alert: p95 latency > 1s
   ↓
2. Jaeger Investigation: 
   - Trace shows request spending 800ms in payment service
   - Payment service calls external API
   ↓
3. ELK Log Search:
   - Find logs from payment service during that time
   - Discover: "External API timeout after 5 retries"
   ↓
4. Root Cause: External API was down, causing timeouts

Implementation Considerations

Choosing What to Monitor

Metrics (Prometheus):

Application performance (latency, throughput, errors)
Resource utilization (CPU, memory, disk)
Business metrics (transactions, conversions)

Traces (Jaeger):

Request flows through microservices
Performance bottlenecks
Error propagation

Logs (ELK):

Detailed error messages
Application state at specific times
Security events

Sampling and Cost

Collecting everything is expensive. Use sampling strategically:

# Sample 10% of traces
sampler = ProbabilisticSampler(rate=0.1)

# Sample errors at 100%, normal requests at 1%
if is_error:
    sampler = ProbabilisticSampler(rate=1.0)
else:
    sampler = ProbabilisticSampler(rate=0.01)

Data Retention

Balance visibility with storage costs:

Metrics: 15 days (Prometheus default)
Traces: 72 hours (typical)
Logs: 30 days (common practice)

Integration Points

Modern observability platforms integrate these tools:

OpenTelemetry: Unified instrumentation standard
Grafana: Unified visualization across Prometheus and other sources
Correlation IDs: Link traces and logs using request IDs

Getting Started

Minimal Setup

# Start Prometheus
docker run -p 9090:9090 prom/prometheus

# Start Jaeger
docker run -p 6831:6831/udp -p 16686:16686 jaegertracing/all-in-one

# Start ELK Stack
docker-compose up -d elasticsearch logstash kibana

First Steps

Instrument your application with metrics, traces, and logs
Configure collection (Prometheus scrape, Jaeger agent, Logstash)
Set up dashboards in Prometheus and Kibana
Create alerts based on key metrics
Practice investigation using all three tools together

Conclusion

Observability is not a single tool but a comprehensive approach to understanding your systems. Prometheus, Jaeger, and ELK stack each play a crucial role:

Prometheus answers “what’s happening” with metrics and alerting
Jaeger answers “how did it happen” with distributed tracing
ELK answers “what exactly happened” with detailed logs

Together, they provide the visibility needed to operate complex systems reliably.

Key takeaways:

Implement all three pillars: Metrics, traces, and logs provide complementary insights
Start simple: Begin with basic instrumentation and expand gradually
Use correlation IDs: Link data across tools for easier investigation
Sample strategically: Balance visibility with cost
Practice investigation: Regularly use these tools to understand your systems

The investment in observability pays dividends in reduced mean time to resolution (MTTR), better system understanding, and ultimately, more reliable services. Start implementing these tools today, and you’ll be better equipped to handle the inevitable failures that come with distributed systems.

The Three Pillars of Observability

Metrics: What’s Happening?

Traces: How Did It Happen?

Logs: What Exactly Happened?

Prometheus: Metrics and Monitoring

How Prometheus Works

Instrumenting Applications

Querying Metrics with PromQL

When to Use Prometheus

Jaeger: Distributed Tracing

Understanding Distributed Tracing

Instrumenting with Jaeger

Analyzing Traces

When to Use Jaeger

ELK Stack: Log Aggregation and Analysis

Architecture Overview

Logstash Configuration

Querying Logs in Kibana

When to Use ELK Stack

How They Work Together

The Complete Picture

Practical Investigation Workflow

Example: Investigating a Performance Issue

Implementation Considerations

Choosing What to Monitor

Sampling and Cost

Data Retention

Integration Points

Getting Started

Minimal Setup

First Steps

Conclusion

Comments