Introduction
Observability is the foundation of reliable production systems. The three pillarsโmetrics, logs, and tracesโprovide visibility into system behavior. Prometheus collects metrics, Grafana visualizes them, and Jaeger traces requests across services. Building a comprehensive observability stack enables rapid problem detection and resolution.
This comprehensive guide covers building and operating an observability stack.
Core Concepts
Metrics
Quantitative measurements of system behavior (CPU, memory, requests).
Logs
Detailed records of events and errors.
Traces
Request flow across distributed services.
Prometheus
Time-series database for metrics.
Grafana
Visualization platform for metrics.
Jaeger
Distributed tracing system.
Scrape
Prometheus collecting metrics from targets.
Alert
Notification triggered by metric threshold.
Dashboard
Visualization of metrics and alerts.
Prometheus Setup
Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'prometheus'
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
rule_files:
- 'alerts.yml'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'application'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/metrics'
scrape_interval: 5s
- job_name: 'kubernetes'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Alert Rules
# alerts.yml
groups:
- name: application
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }}s"
- alert: HighMemoryUsage
expr: process_resident_memory_bytes / 1024 / 1024 > 500
for: 5m
annotations:
summary: "High memory usage"
description: "Memory usage is {{ $value }}MB"
Grafana Setup
Data Source Configuration
{
"name": "Prometheus",
"type": "prometheus",
"url": "http://prometheus:9090",
"access": "proxy",
"isDefault": true
}
Dashboard Example
{
"dashboard": {
"title": "Application Metrics",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
}
],
"type": "graph"
},
{
"title": "P95 Latency",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
}
],
"type": "graph"
},
{
"title": "Memory Usage",
"targets": [
{
"expr": "process_resident_memory_bytes / 1024 / 1024"
}
],
"type": "gauge"
}
]
}
}
Jaeger Setup
Configuration
# jaeger-config.yaml
sampler:
type: const
param: 1
reporter_loggers: true
logging: true
Instrumentation
from jaeger_client import Config
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)
# Set up tracer
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
tracer = trace.get_tracer(__name__)
# Use tracer
with tracer.start_as_current_span("process_request") as span:
span.set_attribute("user_id", 123)
span.set_attribute("request_path", "/api/users")
# Do work
result = process_request()
span.set_attribute("status", "success")
Metrics Implementation
Custom Metrics
from prometheus_client import Counter, Gauge, Histogram, start_http_server
import time
# Define metrics
request_count = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['endpoint'],
buckets=(0.1, 0.5, 1.0, 2.0, 5.0)
)
active_connections = Gauge(
'active_connections',
'Active connections'
)
# Track metrics
def track_request(method, endpoint, status, duration):
request_count.labels(method=method, endpoint=endpoint, status=status).inc()
request_duration.labels(endpoint=endpoint).observe(duration)
# Start metrics server
start_http_server(8000)
# Simulate requests
for i in range(100):
start = time.time()
# Process request
time.sleep(0.1)
duration = time.time() - start
track_request('GET', '/api/users', 200, duration)
active_connections.set(i % 50)
Alerting
Alert Manager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: 'default'
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#warnings'
Best Practices
- Instrument Everything: Add metrics to all critical paths
- Meaningful Metrics: Track business and technical metrics
- Appropriate Cardinality: Avoid high-cardinality labels
- Retention Policy: Define data retention
- Alert Tuning: Avoid alert fatigue
- Dashboards: Create actionable dashboards
- Documentation: Document metrics and alerts
- Testing: Test alerting rules
- Runbooks: Create runbooks for alerts
- Continuous Improvement: Refine observability based on incidents
External Resources
Prometheus
Grafana
Jaeger
Conclusion
Observability is essential for production systems. By implementing metrics, logs, and traces, you gain visibility into system behavior and can rapidly detect and resolve issues.
Start with key metrics, build dashboards, and gradually expand observability as your system grows.
Observability is the foundation of reliable systems.
Comments