Metrics Collection: Prometheus, StatsD, and Custom Metrics
TL;DR: This guide covers implementing metrics collection using Prometheus, StatsD, and custom application metrics. Learn about metrics types, instrumentation, and building observable systems.
Introduction
Metrics provide quantitative measurements of system behavior:
- Counters - Cumulative values (total requests)
- Gauges - Point-in-time values (memory usage)
- Histograms - Value distributions (request latency)
- Summaries - Aggregated percentiles (response times)
Prometheus Basics
Prometheus Installation
# Run Prometheus
docker run -p 9090:9090 -v prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'my-application'
static_configs:
- targets: ['localhost:8080']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
Application Instrumentation
Go Application
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"net/http"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: []float64{0.001, 0.01, 0.1, 0.5, 1, 2, 5},
},
[]string{"method", "endpoint"},
)
activeConnections = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
prometheus.MustRegister(activeConnections)
}
func metricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(
r.Method, r.URL.Path,
))
defer timer.ObserveTime()
// Increment counter
httpRequestsTotal.WithLabelValues(
r.Method, r.URL.Path, "200",
).Inc()
next.ServeHTTP(w, r)
})
}
Python Application
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from flask import Flask, Response
app = Flask(__name__)
# Define metrics
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
http_request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint'],
buckets=(0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0)
)
active_connections = Gauge(
'active_connections',
'Number of active connections'
)
@app.route('/metrics')
def metrics():
return Response(generate_latest(), mimetype='text/plain')
@app.before_request
def before_request():
active_connections.inc()
@app.after_request
def after_request(response):
http_requests_total.labels(
method=request.method,
endpoint=request.endpoint,
status=response.status_code
).inc()
active_connections.dec()
return response
Custom Business Metrics
E-commerce Metrics
// Order processing metrics
var (
ordersPlaced = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "orders_placed_total",
Help: "Total number of orders placed",
},
[]string{"status", "payment_method"},
)
orderValue = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "order_value_dollars",
Help: "Value of orders in dollars",
Buckets: []float64{10, 25, 50, 100, 250, 500, 1000},
},
[]string{"category"},
)
cartAbandonmentRate = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "cart_abandonment_rate",
Help: "Rate of cart abandonment",
},
)
)
func recordOrder(order Order) {
ordersPlaced.WithLabelValues(
order.Status,
order.PaymentMethod,
).Inc()
orderValue.WithLabelValues(
order.Category,
).Observe(order.Value)
}
StatsD Integration
StatsD Server
# Run StatsD
docker run -p 8125:8125/udp -p 8126:8126 \
graphiteapp/statsd-exporter
Sending StatsD Metrics
import statsd
# Initialize client
client = statsd.StatsClient('localhost', 8125)
# Increment counter
client.increment('requests.total')
# Record value
client.gauge('active_users', 150)
# Record timing
client.timing('request.duration', 250) # milliseconds
# Record with tags (DataDog format)
client.increment('requests.total', tags=['env:production', 'service:api'])
StatsD to Prometheus
# prometheus.yml with StatsD
scrape_configs:
- job_name: 'statsd-exporter'
static_configs:
- targets: ['localhost:9102']
Histograms and Percentiles
Understanding Buckets
// Default buckets for HTTP latency
[]float64{
0.005, // 5ms
0.01, // 10ms
0.025, // 25ms
0.05, // 50ms
0.1, // 100ms
0.25, // 250ms
0.5, // 500ms
1.0, // 1s
2.5, // 2.5s
5.0, // 5s
}
// Querying percentiles in Prometheus
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m])) // p50
histogram_quantile(0.90, rate(http_request_duration_seconds_bucket[5m])) // p90
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) // p95
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) // p99
Service Level Metrics
RED Metrics
| Metric | Description | Prometheus Query |
|---|---|---|
| Rate | Requests per second | sum(rate(http_requests_total[5m])) |
| Errors | Error rate | sum(rate(http_requests_total{status=~"5.."}[5m])) |
| Duration | Response time | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) |
USE Metrics
| Metric | Description |
|---|---|
| Utilization | Resource usage |
| Saturation | Queue depth, load |
| Errors | Error rate |
Alerting Rules
# alerts.yml
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
Conclusion
Metrics collection enables:
- Quantitative monitoring - Measure system behavior
- Performance optimization - Identify bottlenecks
- Alerting - Detect anomalies
- Capacity planning - Plan for growth
- SLO tracking - Monitor service levels
Comments