Observability Stack: Building Comprehensive Monitoring

Introduction

Observability is the ability to understand internal system state from external outputs. In modern distributed systems, observability goes beyond traditional monitoring—it enables you to debug issues you’ve never seen before. This guide covers building a comprehensive observability stack.

The three pillars of observability—metrics, logs, and traces—work together to provide complete visibility into your systems.

The Three Pillars

Metrics, Logs, and Traces

┌─────────────────────────────────────────────────────────────┐
│                  Three Pillars of Observability                │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐   │
│  │   Metrics     │  │    Logs      │  │    Traces    │   │
│  │              │  │              │  │              │   │
│  │ Numeric      │  │ Event        │  │ Request      │   │
│  │ measurements │  │ records      │  │ flow         │   │
│  │              │  │              │  │              │   │
│  │ Aggregated   │  │ Timestamped  │  │ Distributed  │   │
│  │ over time    │  │ records     │  │ spans        │   │
│  │              │  │              │  │              │   │
│  │ - Counters   │  │ - Info       │  │ - Latency    │   │
│  │ - Gauges     │  │ - Warnings   │  │ - Errors     │   │
│  │ - Histograms │  │ - Errors     │  │ - Flow       │   │
│  └──────────────┘  └──────────────┘  └──────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Prometheus

Architecture

┌─────────────────────────────────────────────────────────────┐
│                  Prometheus Architecture                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────────────────────────────────┐             │
│  │           Prometheus Server               │             │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐  │             │
│  │  │Scraper │  │TSDB   │  │Alerts  │  │             │
│  │  │        │  │      │  │       │  │             │
│  │  └─────────┘  └─────────┘  └─────────┘  │             │
│  └──────────────────────────────────────────┘             │
│         ▲                ▲               ▲                   │
│         │                │               │                   │
│    ┌────┴────┐    ┌─────┴─────┐   ┌─────┴─────┐          │
│    │Targets  │    │Alerts    │   │ Grafana   │          │
│    │(Exporters)│   │Manager   │   │ Query UI  │          │
│    └─────────┘    └───────────┘   └───────────┘          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Metrics Types

from prometheus_client import Counter, Gauge, Histogram, Summary

# Counter: Always increasing
requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Gauge: Can go up and down
active_connections = Gauge(
    'active_connections',
    'Number of active connections',
    ['service']
)

# Histogram: Distribution of values
request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1, 5]
)

# Summary: Similar to histogram
request_size = Summary(
    'http_request_size_bytes',
    'HTTP request size'
)

# Using metrics
@app.route('/api/users')
def users():
    requests_total.labels(method='GET', endpoint='/api/users', status='200').inc()
    
    with request_duration.labels(method='GET', endpoint='/api/users').time():
        # Process request
        pass

Service Discovery

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-applications'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

Grafana

Dashboard Configuration

{
  "dashboard": {
    "title": "Payment Service Overview",
    "panels": [
      {
        "title": "Requests per Second",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{method}} {{endpoint}}"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
            "unit": "percentunit"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"value": 0, "color": "green"},
                {"value": 0.01, "color": "yellow"},
                {"value": 0.05, "color": "red"}
              ]
            }
          }
        }
      },
      {
        "title": "P99 Latency",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "P99"
          }
        ]
      }
    ]
  }
}

Alert Rules

groups:
  - name: payment-service
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(payment_requests_total{status=~"5.."}[5m])) 
          / sum(rate(payment_requests_total[5m])) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Payment service error rate is high"
          description: "{{ $value | humanizePercentage }} error rate"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, 
            rate(payment_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Payment service latency is high"
          description: "P99 latency is {{ $value | humanizeDuration }}"

Log Aggregation

Loki Configuration

# loki-config.yaml
server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2026-01-01
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 168h

Log Queries

# Find errors in payment service
{job="payment-service"} |= "ERROR"

# Parse and filter
{job="payment-service"} | json | level="error" | duration > 1s

# Count by level
sum(count_over_time({job="payment-service"}[5m])) by (level)

# Latency percentiles
histogram_quantile(0.99, sum(rate({job="api"}[5m])) by (le))

Building Dashboards

Key Metrics by Service

# Application metrics to track
metrics_by_service = {
    "api": [
        "requests_total (counter)",
        "request_duration_seconds (histogram)",
        "request_size_bytes (histogram)",
        "errors_total (counter)"
    ],
    "database": [
        "connections_active (gauge)",
        "query_duration_seconds (histogram)",
        "queries_total (counter)"
    ],
    "cache": [
        "hits_total (counter)",
        "misses_total (counter)",
        "evictions_total (counter)"
    ]
}

Dashboard Best Practices

Use red/green/yellow thresholds: Clear visual indicators
Show trends: Include time series, not just current values
Link to runbooks: Include links from alerts to documentation
Set appropriate ranges: Match to your SLOs
Use meaningful names: Service + metric description

Alert Design

Alert Quality

┌─────────────────────────────────────────────────────────────┐
│                   Good Alert Characteristics                 │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ✓ Actionable: Clear what to do                           │
│  ✓ Timely: Fires before users notice                       │
│  ✓ Accurate: Low false positive rate                        │
│  ✓ Relevant: Indicates real problem                        │
│  ✓ Prioritized: Severity matches impact                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Runbook Integration

annotations:
  summary: "Database connection pool exhausted"
  description: "{{ $value }} active connections"
  runbook_url: "https://wiki.example.com/runbooks/db-connections"
  
# Runbook content
# # Database Connection Pool Exhaustion
# 
# ## Symptoms
# - High number of active connections
# - Applications timing out on DB queries
# 
# ## Impact
# - Failed database writes
# - User-facing errors
# 
# ## Resolution
# 1. Check for long-running queries
# 2. Identify and kill blocking sessions
# 3. Scale up connection pool if needed

Best Practices

Start with SLOs: Define what good looks like
Alert on symptoms: Not causes
Use golden signals: Latency, traffic, errors, saturation
Keep dashboards focused: One per service
Automate remediation: Where possible

Conclusion

A comprehensive observability stack enables you to understand, debug, and optimize your systems. By combining metrics, logs, and traces with thoughtful alerting, you can maintain high reliability and quickly resolve issues.