Introduction
Infrastructure monitoring forms the backbone of operational excellence. With the average cost of downtime reaching $300,000 per hour, having robust monitoring is critical for any organization running production systems.
Key Statistics:
- 68% of organizations struggle with alert fatigue
- Mean Time to Detection (MTTD) averages 197 days without proper monitoring
- Companies with mature observability reduce MTTR by 50%
- Prometheus is used by 75% of CNCF adopters
Prometheus Architecture
Core Components
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Prometheus Server โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ TSDB โ โ Retrieval โ โ HTTP Server โ โ
โ โ (Storage) โโโโค (Scraper) โโโโค (API) โ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ Exporters โ โ Pushgateway โ โ Alertmanager โ
โ (Metrics) โ โ (Batch Jobs) โ โ (Alerts) โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
Installation
# Using Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=15d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.storage=50Gi
Prometheus Configuration
# prometheus.yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
environment: 'us-east-1'
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager.monitoring.svc.cluster.local:9093
rule_files:
- "/etc/prometheus/rules/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
Exporters
Node Exporter
# Install node exporter
kubectl apply -f https://raw.githubusercontent.com/prometheus/node_exporter/master/examples/kubernetes/node-exporter-daemonset.yaml
# Key metrics to collect
# - node_cpu_seconds_total
# - node_memory_MemAvailable_bytes
# - node_filesystem_avail_bytes
# - node_network_receive_bytes_total
# - node_disk_read_bytes_total
Blackbox Exporter
# blackbox-config.yml
modules:
http_2xx:
prober: http
http:
preferred_ip_protocol: "ip4"
tcp_connect:
prober: tcp
dns:
prober: dns
dns:
transport_protocol: "udp"
preferred_ip_protocol: "ip4"
# blackbox-exporter-deployment.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: blackbox-exporter-config
data:
config.yml: |
modules:
http_2xx:
prober: http
timeout: 5s
tcp_connect:
prober: tcp
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: blackbox-exporter
spec:
selector:
matchLabels:
app: blackbox-exporter
template:
metadata:
labels:
app: blackbox-exporter
spec:
containers:
- name: blackbox-exporter
image: prom/blackbox-exporter:latest
ports:
- containerPort: 9115
volumeMounts:
- name: config
mountPath: /config
volumes:
- name: config
configMap:
name: blackbox-exporter-config
Custom Application Exporter
#!/usr/bin/env python3
"""Custom metrics exporter."""
from prometheus_client import start_http_server, Counter, Gauge, Histogram
import random
import time
REQUESTS = Counter('app_requests_total', 'Total requests', ['method', 'status'])
ACTIVE_USERS = Gauge('app_active_users', 'Number of active users')
RESPONSE_TIME = Histogram('app_response_time_seconds', 'Response time in seconds')
def process_request():
method = random.choice(['GET', 'POST', 'PUT', 'DELETE'])
status = random.choice([200, 200, 200, 404, 500])
REQUESTS.labels(method=method, status=status).inc()
start = time.time()
time.sleep(random.uniform(0.01, 0.5))
duration = time.time() - start
RESPONSE_TIME.observe(duration)
ACTIVE_USERS.set(random.randint(10, 100))
if __name__ == '__main__':
start_http_server(8000)
print("Exporter running on port 8000")
while True:
process_request()
time.sleep(1)
Grafana Configuration
Dashboard as Code
# dashboard-config.yaml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
Example Dashboard JSON
{
"dashboard": {
"title": "Infrastructure Overview",
"tags": ["infrastructure", "overview"],
"timezone": "browser",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}}"
}
],
"thresholds": [
{"value": 80, "colorMode": "critical", "op": "gt"},
{"value": 60, "colorMode": "warning", "op": "gt"}
]
},
{
"title": "Memory Usage",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "Network Traffic",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [
{
"expr": "rate(node_network_receive_bytes_total[5m])",
"legendFormat": "{{instance}} - RX"
},
{
"expr": "rate(node_network_transmit_bytes_total[5m])",
"legendFormat": "{{instance}} - TX"
}
]
},
{
"title": "Disk Usage",
"type": "gauge",
"gridPos": {"h": 8, "w": 6, "x": 12, "y": 8},
"targets": [
{
"expr": "100 * (1 - (node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"}))",
"legendFormat": "Root"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"value": 0, "color": "green"},
{"value": 70, "color": "yellow"},
{"value": 90, "color": "red"}
]
},
"unit": "percent"
}
}
}
]
}
}
Alert Rules
# prometheus-rules.yml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: infrastructure-alerts
labels:
prometheus: k8s
spec:
groups:
- name: infrastructure
rules:
- alert: HighCPUUsage
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes"
- alert: HighMemoryUsage
expr: |
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is above 85%"
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
for: 10m
labels:
severity: critical
annotations:
summary: "Disk space critically low"
description: "Less than 10% disk space remaining"
AlertManager Configuration
# alertmanager-config.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: '[email protected]'
smtp_auth_username: 'alerts'
smtp_auth_password: '${SMTP_PASSWORD}'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
continue: true
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'default'
email_configs:
- to: '[email protected]'
send_resolved: true
- name: 'critical-alerts'
email_configs:
- to: '[email protected]'
send_resolved: true
slack_configs:
- channel: '#critical-alerts'
send_resolved: true
api_url: '${SLACK_WEBHOOK_URL}'
pagerduty_configs:
- service_key: '${PAGERDUTY_KEY}'
severity: critical
- name: 'warning-alerts'
email_configs:
- to: '[email protected]'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
Service Discovery
AWS EC2 Discovery
scrape_configs:
- job_name: 'aws-ec2'
ec2_sd_configs:
- region: us-east-1
access_key: ${AWS_ACCESS_KEY}
secret_key: ${AWS_SECRET_KEY}
port: 9100
filters:
- name: tag:Monitoring
values: ["enabled"]
relabel_configs:
- source_labels: [__meta_ec2_instance_id]
target_label: instance
- source_labels: [__meta_ec2_tag_Name]
target_label: name
- source_labels: [__meta_ec2_availability_zone]
target_label: availability_zone
Kubernetes Service Discovery
# Full Kubernetes discovery
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: service
namespaces:
names:
- production
- staging
relabel_configs:
- action: keep
regex: true
source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
- action: replace
regex: (.+)
source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?(?::\d+)?
source_labels: [__meta_kubernetes_service_annotation_prometheus_io_port]
target_label: __address__
Best Practices
Metric Cardinality
# Bad - High cardinality
http_request_duration_seconds_bucket{le,method,endpoint,user_id,trace_id}
# Good - Low cardinality
http_request_duration_seconds_bucket{le,method,endpoint}
Recording Rules
groups:
- name: application
interval: 30s
rules:
- record: job:http_requests_total:sum5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_request_duration_seconds:histogram_quantile_95
expr: histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))
Federation
global:
external_labels:
cluster: 'us-west-2'
remote_write:
- url: https://central-prometheus/api/v1/write
tls_config:
ca_file: /etc/prometheus/certs/ca.crt
basic_auth:
username: federation
password: ${FEDERATION_PASSWORD}
Comments