Introduction
Observability is the ability to understand internal system state from external outputs. In modern distributed systems, observability goes beyond traditional monitoringโit enables you to debug issues you’ve never seen before. This guide covers building a comprehensive observability stack.
The three pillars of observabilityโmetrics, logs, and tracesโwork together to provide complete visibility into your systems.
The Three Pillars
Metrics, Logs, and Traces
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Three Pillars of Observability โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Metrics โ โ Logs โ โ Traces โ โ
โ โ โ โ โ โ โ โ
โ โ Numeric โ โ Event โ โ Request โ โ
โ โ measurements โ โ records โ โ flow โ โ
โ โ โ โ โ โ โ โ
โ โ Aggregated โ โ Timestamped โ โ Distributed โ โ
โ โ over time โ โ records โ โ spans โ โ
โ โ โ โ โ โ โ โ
โ โ - Counters โ โ - Info โ โ - Latency โ โ
โ โ - Gauges โ โ - Warnings โ โ - Errors โ โ
โ โ - Histograms โ โ - Errors โ โ - Flow โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Prometheus
Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Prometheus Architecture โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Prometheus Server โ โ
โ โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ โ
โ โ โScraper โ โTSDB โ โAlerts โ โ โ
โ โ โ โ โ โ โ โ โ โ
โ โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โฒ โฒ โฒ โ
โ โ โ โ โ
โ โโโโโโดโโโโโ โโโโโโโดโโโโโโ โโโโโโโดโโโโโโ โ
โ โTargets โ โAlerts โ โ Grafana โ โ
โ โ(Exporters)โ โManager โ โ Query UI โ โ
โ โโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Metrics Types
from prometheus_client import Counter, Gauge, Histogram, Summary
# Counter: Always increasing
requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
# Gauge: Can go up and down
active_connections = Gauge(
'active_connections',
'Number of active connections',
['service']
)
# Histogram: Distribution of values
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint'],
buckets=[0.01, 0.05, 0.1, 0.5, 1, 5]
)
# Summary: Similar to histogram
request_size = Summary(
'http_request_size_bytes',
'HTTP request size'
)
# Using metrics
@app.route('/api/users')
def users():
requests_total.labels(method='GET', endpoint='/api/users', status='200').inc()
with request_duration.labels(method='GET', endpoint='/api/users').time():
# Process request
pass
Service Discovery
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-applications'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
Grafana
Dashboard Configuration
{
"dashboard": {
"title": "Payment Service Overview",
"panels": [
{
"title": "Requests per Second",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{endpoint}}"
}
]
},
{
"title": "Error Rate",
"type": "stat",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
"unit": "percentunit"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"value": 0, "color": "green"},
{"value": 0.01, "color": "yellow"},
{"value": 0.05, "color": "red"}
]
}
}
}
},
{
"title": "P99 Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "P99"
}
]
}
]
}
}
Alert Rules
groups:
- name: payment-service
rules:
- alert: HighErrorRate
expr: |
sum(rate(payment_requests_total{status=~"5.."}[5m]))
/ sum(rate(payment_requests_total[5m])) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Payment service error rate is high"
description: "{{ $value | humanizePercentage }} error rate"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(payment_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Payment service latency is high"
description: "P99 latency is {{ $value | humanizeDuration }}"
Log Aggregation
Loki Configuration
# loki-config.yaml
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
instance_addr: 127.0.0.1
kvstore:
store: inmemory
schema_config:
configs:
- from: 2026-01-01
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h
Log Queries
# Find errors in payment service
{job="payment-service"} |= "ERROR"
# Parse and filter
{job="payment-service"} | json | level="error" | duration > 1s
# Count by level
sum(count_over_time({job="payment-service"}[5m])) by (level)
# Latency percentiles
histogram_quantile(0.99, sum(rate({job="api"}[5m])) by (le))
Building Dashboards
Key Metrics by Service
# Application metrics to track
metrics_by_service = {
"api": [
"requests_total (counter)",
"request_duration_seconds (histogram)",
"request_size_bytes (histogram)",
"errors_total (counter)"
],
"database": [
"connections_active (gauge)",
"query_duration_seconds (histogram)",
"queries_total (counter)"
],
"cache": [
"hits_total (counter)",
"misses_total (counter)",
"evictions_total (counter)"
]
}
Dashboard Best Practices
- Use red/green/yellow thresholds: Clear visual indicators
- Show trends: Include time series, not just current values
- Link to runbooks: Include links from alerts to documentation
- Set appropriate ranges: Match to your SLOs
- Use meaningful names: Service + metric description
Alert Design
Alert Quality
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Good Alert Characteristics โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โ Actionable: Clear what to do โ
โ โ Timely: Fires before users notice โ
โ โ Accurate: Low false positive rate โ
โ โ Relevant: Indicates real problem โ
โ โ Prioritized: Severity matches impact โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Runbook Integration
annotations:
summary: "Database connection pool exhausted"
description: "{{ $value }} active connections"
runbook_url: "https://wiki.example.com/runbooks/db-connections"
# Runbook content
# # Database Connection Pool Exhaustion
#
# ## Symptoms
# - High number of active connections
# - Applications timing out on DB queries
#
# ## Impact
# - Failed database writes
# - User-facing errors
#
# ## Resolution
# 1. Check for long-running queries
# 2. Identify and kill blocking sessions
# 3. Scale up connection pool if needed
Best Practices
- Start with SLOs: Define what good looks like
- Alert on symptoms: Not causes
- Use golden signals: Latency, traffic, errors, saturation
- Keep dashboards focused: One per service
- Automate remediation: Where possible
Conclusion
A comprehensive observability stack enables you to understand, debug, and optimize your systems. By combining metrics, logs, and traces with thoughtful alerting, you can maintain high reliability and quickly resolve issues.
Comments