Skip to main content
โšก Calmops

Observability Essentials: Prometheus, Jaeger, and ELK Stack for Complete System Visibility

In modern distributed systems, something is always failing. A service is slow, a database query times out, or a user reports an error. Without proper observability, finding the root cause is like searching for a needle in a haystack. With it, you have a clear picture of what’s happening across your entire system.

Observability is the ability to understand the internal state of a system based on its external outputs. It goes beyond traditional monitoring by providing deep insights into system behavior. The three pillars of observabilityโ€”metrics, traces, and logsโ€”each tell a different story about your system. Together, they provide comprehensive visibility.

In this guide, we’ll explore how Prometheus, Jaeger, and the ELK stack form the foundation of a complete observability solution.

The Three Pillars of Observability

Metrics: What’s Happening?

Metrics are numerical measurements of system behavior over time. They answer questions like: How many requests per second? What’s the CPU usage? How many errors occurred?

Characteristics:

  • Time-series data (values over time)
  • Aggregated and summarized
  • Low cardinality (limited unique combinations)
  • Efficient storage and querying

Traces: How Did It Happen?

Traces follow a request through your entire system, showing how different services interact. They answer questions like: Why is this request slow? Which service is the bottleneck? Where did the error originate?

Characteristics:

  • Request-level detail
  • Shows service dependencies
  • Captures timing information
  • High cardinality (many unique combinations)

Logs: What Exactly Happened?

Logs are detailed records of events in your system. They answer questions like: What error message did we get? What was the user doing? What state was the system in?

Characteristics:

  • Unstructured or semi-structured text
  • Event-level detail
  • High volume
  • Searchable and queryable

Prometheus: Metrics and Monitoring

Prometheus is a time-series database and monitoring system designed for reliability and operational simplicity. It collects metrics from your applications and infrastructure, stores them efficiently, and provides powerful querying capabilities.

How Prometheus Works

Prometheus uses a pull model: it scrapes metrics from applications at regular intervals. Applications expose metrics on an HTTP endpoint (typically /metrics), and Prometheus periodically fetches them.

# prometheus.yml - Basic configuration
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'api-server'
    static_configs:
      - targets: ['localhost:8080']
  
  - job_name: 'database'
    static_configs:
      - targets: ['localhost:5432']

Instrumenting Applications

Applications expose metrics using client libraries:

# Python example using prometheus_client
from prometheus_client import Counter, Histogram, start_http_server
import time

# Define metrics
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint'])
request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration')

@request_duration.time()
def handle_request(method, endpoint):
    # Your request handling logic
    request_count.labels(method=method, endpoint=endpoint).inc()
    time.sleep(0.1)

# Start metrics server
start_http_server(8000)

# Metrics available at http://localhost:8000/metrics

Querying Metrics with PromQL

Prometheus Query Language (PromQL) allows powerful metric analysis:

# Current request rate (requests per second)
rate(http_requests_total[5m])

# 95th percentile request latency
histogram_quantile(0.95, http_request_duration_seconds_bucket)

# Error rate percentage
(rate(http_requests_total{status="500"}[5m]) / rate(http_requests_total[5m])) * 100

# Memory usage over time
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

When to Use Prometheus

  • Monitoring infrastructure and applications
  • Alerting on threshold violations
  • Capacity planning and trend analysis
  • Performance optimization
  • SLA tracking

Jaeger: Distributed Tracing

Jaeger is a distributed tracing platform that helps you understand request flows through your microservices architecture. It’s particularly valuable in complex systems where requests span multiple services.

Understanding Distributed Tracing

A trace represents a complete request journey through your system. Each trace contains spansโ€”individual operations within that journey.

User Request
    โ†“
[API Gateway Span]
    โ”œโ”€โ†’ [Auth Service Span]
    โ”œโ”€โ†’ [User Service Span]
    โ”‚   โ””โ”€โ†’ [Database Query Span]
    โ””โ”€โ†’ [Cache Lookup Span]

Instrumenting with Jaeger

Applications send trace data to Jaeger using client libraries:

# Python example using jaeger-client
from jaeger_client import Config
from opentelemetry import trace

def init_jaeger_tracer(service_name):
    config = Config(
        config={
            'sampler': {
                'type': 'const',
                'param': 1,
            },
            'logging': True,
        },
        service_name=service_name,
    )
    return config.initialize_tracer()

tracer = init_jaeger_tracer('my-service')

# Create spans
with tracer.start_active_span('process_request') as scope:
    span = scope.span
    span.set_tag('user_id', user_id)
    
    # Nested span for database operation
    with tracer.start_active_span('database_query') as db_scope:
        db_scope.span.set_tag('query', 'SELECT * FROM users')
        # Execute query
        pass

Analyzing Traces

Jaeger UI provides visualization and analysis:

  • Trace view: See the complete request flow
  • Service dependencies: Understand how services interact
  • Latency analysis: Identify slow operations
  • Error tracking: Find where errors occur in the flow

When to Use Jaeger

  • Debugging slow requests
  • Understanding service dependencies
  • Identifying bottlenecks in microservices
  • Tracing errors across services
  • Performance optimization in distributed systems

ELK Stack: Log Aggregation and Analysis

The ELK stack (Elasticsearch, Logstash, Kibana) provides centralized log management. Elasticsearch stores logs, Logstash processes and enriches them, and Kibana provides visualization and analysis.

Architecture Overview

Applications
    โ†“
Logstash (Processing)
    โ†“
Elasticsearch (Storage)
    โ†“
Kibana (Visualization)

Logstash Configuration

Logstash processes logs from various sources:

# logstash.conf
input {
  tcp {
    port => 5000
    codec => json
  }
  file {
    path => "/var/log/app/*.log"
    start_position => "beginning"
  }
}

filter {
  if [type] == "api_request" {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    date {
      match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

Querying Logs in Kibana

Kibana provides powerful search and visualization:

# Find all errors in the last hour
level: ERROR AND timestamp: [now-1h TO now]

# Find slow API requests
type: api_request AND response_time: [1000 TO *]

# Count errors by service
level: ERROR | stats count() by service_name

When to Use ELK Stack

  • Centralized log aggregation
  • Debugging application issues
  • Compliance and audit logging
  • Security analysis and threat detection
  • Operational troubleshooting

How They Work Together

These three tools complement each other, each providing different perspectives:

The Complete Picture

Prometheus (Metrics)
โ”œโ”€ "System is slow"
โ”œโ”€ "Error rate increased"
โ””โ”€ "CPU usage spiked"
    โ†“
Jaeger (Traces)
โ”œโ”€ "Request to service A takes 5s"
โ”œโ”€ "Database query is the bottleneck"
โ””โ”€ "Error occurs in service B"
    โ†“
ELK (Logs)
โ”œโ”€ "Database connection timeout"
โ”œโ”€ "Out of memory error"
โ””โ”€ "Authentication failed"

Practical Investigation Workflow

  1. Alert from Prometheus: Error rate increased
  2. Check Jaeger: Identify which service is slow
  3. Review ELK logs: Find the root cause (database error, timeout, etc.)
  4. Correlate data: Use timestamps and request IDs to connect all three

Example: Investigating a Performance Issue

1. Prometheus Alert: p95 latency > 1s
   โ†“
2. Jaeger Investigation: 
   - Trace shows request spending 800ms in payment service
   - Payment service calls external API
   โ†“
3. ELK Log Search:
   - Find logs from payment service during that time
   - Discover: "External API timeout after 5 retries"
   โ†“
4. Root Cause: External API was down, causing timeouts

Implementation Considerations

Choosing What to Monitor

Metrics (Prometheus):

  • Application performance (latency, throughput, errors)
  • Resource utilization (CPU, memory, disk)
  • Business metrics (transactions, conversions)

Traces (Jaeger):

  • Request flows through microservices
  • Performance bottlenecks
  • Error propagation

Logs (ELK):

  • Detailed error messages
  • Application state at specific times
  • Security events

Sampling and Cost

Collecting everything is expensive. Use sampling strategically:

# Sample 10% of traces
sampler = ProbabilisticSampler(rate=0.1)

# Sample errors at 100%, normal requests at 1%
if is_error:
    sampler = ProbabilisticSampler(rate=1.0)
else:
    sampler = ProbabilisticSampler(rate=0.01)

Data Retention

Balance visibility with storage costs:

  • Metrics: 15 days (Prometheus default)
  • Traces: 72 hours (typical)
  • Logs: 30 days (common practice)

Integration Points

Modern observability platforms integrate these tools:

  • OpenTelemetry: Unified instrumentation standard
  • Grafana: Unified visualization across Prometheus and other sources
  • Correlation IDs: Link traces and logs using request IDs

Getting Started

Minimal Setup

# Start Prometheus
docker run -p 9090:9090 prom/prometheus

# Start Jaeger
docker run -p 6831:6831/udp -p 16686:16686 jaegertracing/all-in-one

# Start ELK Stack
docker-compose up -d elasticsearch logstash kibana

First Steps

  1. Instrument your application with metrics, traces, and logs
  2. Configure collection (Prometheus scrape, Jaeger agent, Logstash)
  3. Set up dashboards in Prometheus and Kibana
  4. Create alerts based on key metrics
  5. Practice investigation using all three tools together

Conclusion

Observability is not a single tool but a comprehensive approach to understanding your systems. Prometheus, Jaeger, and ELK stack each play a crucial role:

  • Prometheus answers “what’s happening” with metrics and alerting
  • Jaeger answers “how did it happen” with distributed tracing
  • ELK answers “what exactly happened” with detailed logs

Together, they provide the visibility needed to operate complex systems reliably.

Key takeaways:

  1. Implement all three pillars: Metrics, traces, and logs provide complementary insights
  2. Start simple: Begin with basic instrumentation and expand gradually
  3. Use correlation IDs: Link data across tools for easier investigation
  4. Sample strategically: Balance visibility with cost
  5. Practice investigation: Regularly use these tools to understand your systems

The investment in observability pays dividends in reduced mean time to resolution (MTTR), better system understanding, and ultimately, more reliable services. Start implementing these tools today, and you’ll be better equipped to handle the inevitable failures that come with distributed systems.

Comments