Skip to main content
โšก Calmops

Observability Stack: Prometheus, Grafana, Jaeger Setup

Introduction

Observability is the foundation of reliable production systems. The three pillarsโ€”metrics, logs, and tracesโ€”provide visibility into system behavior. Prometheus collects metrics, Grafana visualizes them, and Jaeger traces requests across services. Building a comprehensive observability stack enables rapid problem detection and resolution.

This comprehensive guide covers building and operating an observability stack.


Core Concepts

Metrics

Quantitative measurements of system behavior (CPU, memory, requests).

Logs

Detailed records of events and errors.

Traces

Request flow across distributed services.

Prometheus

Time-series database for metrics.

Grafana

Visualization platform for metrics.

Jaeger

Distributed tracing system.

Scrape

Prometheus collecting metrics from targets.

Alert

Notification triggered by metric threshold.

Dashboard

Visualization of metrics and alerts.


Prometheus Setup

Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'prometheus'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093

rule_files:
  - 'alerts.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'application'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'
    scrape_interval: 5s
  
  - job_name: 'kubernetes'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Alert Rules

# alerts.yml
groups:
  - name: application
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }}"
      
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        annotations:
          summary: "High latency detected"
          description: "P95 latency is {{ $value }}s"
      
      - alert: HighMemoryUsage
        expr: process_resident_memory_bytes / 1024 / 1024 > 500
        for: 5m
        annotations:
          summary: "High memory usage"
          description: "Memory usage is {{ $value }}MB"

Grafana Setup

Data Source Configuration

{
  "name": "Prometheus",
  "type": "prometheus",
  "url": "http://prometheus:9090",
  "access": "proxy",
  "isDefault": true
}

Dashboard Example

{
  "dashboard": {
    "title": "Application Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
          }
        ],
        "type": "graph"
      },
      {
        "title": "P95 Latency",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Memory Usage",
        "targets": [
          {
            "expr": "process_resident_memory_bytes / 1024 / 1024"
          }
        ],
        "type": "gauge"
      }
    ]
  }
}

Jaeger Setup

Configuration

# jaeger-config.yaml
sampler:
  type: const
  param: 1

reporter_loggers: true

logging: true

Instrumentation

from jaeger_client import Config
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
    agent_host_name="localhost",
    agent_port=6831,
)

# Set up tracer
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

tracer = trace.get_tracer(__name__)

# Use tracer
with tracer.start_as_current_span("process_request") as span:
    span.set_attribute("user_id", 123)
    span.set_attribute("request_path", "/api/users")
    
    # Do work
    result = process_request()
    
    span.set_attribute("status", "success")

Metrics Implementation

Custom Metrics

from prometheus_client import Counter, Gauge, Histogram, start_http_server
import time

# Define metrics
request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['endpoint'],
    buckets=(0.1, 0.5, 1.0, 2.0, 5.0)
)

active_connections = Gauge(
    'active_connections',
    'Active connections'
)

# Track metrics
def track_request(method, endpoint, status, duration):
    request_count.labels(method=method, endpoint=endpoint, status=status).inc()
    request_duration.labels(endpoint=endpoint).observe(duration)

# Start metrics server
start_http_server(8000)

# Simulate requests
for i in range(100):
    start = time.time()
    # Process request
    time.sleep(0.1)
    duration = time.time() - start
    
    track_request('GET', '/api/users', 200, duration)
    active_connections.set(i % 50)

Alerting

Alert Manager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  receiver: 'default'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#alerts'
  
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'
  
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#warnings'

Best Practices

  1. Instrument Everything: Add metrics to all critical paths
  2. Meaningful Metrics: Track business and technical metrics
  3. Appropriate Cardinality: Avoid high-cardinality labels
  4. Retention Policy: Define data retention
  5. Alert Tuning: Avoid alert fatigue
  6. Dashboards: Create actionable dashboards
  7. Documentation: Document metrics and alerts
  8. Testing: Test alerting rules
  9. Runbooks: Create runbooks for alerts
  10. Continuous Improvement: Refine observability based on incidents

External Resources

Prometheus

Grafana

Jaeger


Conclusion

Observability is essential for production systems. By implementing metrics, logs, and traces, you gain visibility into system behavior and can rapidly detect and resolve issues.

Start with key metrics, build dashboards, and gradually expand observability as your system grows.

Observability is the foundation of reliable systems.

Comments