Introduction
Prometheus collects fana visualizes them. Together they give you dashboards, alerts, and the data to answer “is my system healthy?” and “why is it slow?”
Prerequisites: Docker and Docker Compose installed. Basic understanding of metrics concepts.
Quick Start with Docker Compose
# docker-compose.monitoring.yml
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
grafana:
image: grafana/grafana:latest
ports:
- "3001:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
GF_USERS_ALLOW_SIGN_UP: false
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
volumes:
prometheus-data:
grafana-data:
docker compose -f docker-compose.monitoring.yml up -d
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3001 (admin/admin)
Prometheus Configuration
# prometheus.yml
global:
scrape_ # how often to scrape targets
evaluation_interval: 15s # how often to evaluate rules
# Alert rules
rule_files:
- "alerts/*.yml"
# Alertmanager
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
# Scrape targets
scrape_configs:
# Your Node.js application
- job_name: 'myapp'
static_configs:
- targets: ['app:3000']
metrics_path: '/metrics'
# Node Exporter (system metrics: CPU, memory, disk)
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# PostgreSQL exporter
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
Instrumenting Node.js
npm install prom-client
// metrics.js
import { Registry, Counter, Histogram, Gauge, collectDefaultMetrics } from 'prom-client';
const registry = new Registry();
// Colry, CPU, event loop lag)
collectDefaultMetrics({ register: registry });
// HTTP request counter
export const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code'],
registers: [registry],
});
// HTTP request duration histogram
export const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['me', 'status_code'],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
registers: [registry],
});
// Active connections gauge
export const activeConnections = new Gauge({
name: 'active_connections',
help: 'Number of active connections',
registers: [registry],
});
// Business metric: orders created
export const ordersCreated = new Counter({
name: 'orders_created_total',
help: 'Total orders created',
heus Documentation](https://prometheus.io/docs/)
- [Grafana Documentation](https://grafana.com/docs/)
- [PromQL Cheat Sheet](https://promlabs.com/promql-cheat-sheet/)
- [prom-client (Node.js)](https://github.com/siimon/prom-client)
- [Grafana Dashboard Library](https://grafana.com/grafana/dashboards/)
- [Google SRE: Four Golden Signals](https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals)
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
The Four Golden Signals
Monitor these four metrics for any service:
| Signal | What it measures | Example metric |
|---|---|---|
| Latency | How long requests take | http_request_duration_seconds |
| Traffic | How much demand | http_requests_total |
| Errors | Rate of failed requests | http_requests_total{status=~"5.."} |
| Saturation | How “full” the service is | CPU, memory, queue depth |
Resources
-
[Prometepeat_interval: 4h receiver: ‘slack-notifications’
routes:
- match: severity: critical receiver: ‘pagerduty’ continue: true
receivers:
-
name: ‘slack-notifications’ slack_configs:
- api_url: ‘https://hooks.slack.com/services/YOUR/WEBHOOK/URL' channel: ‘#alerts’ title: ‘{{ .GroupLabels.alertname }}’ text: ‘{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}’
-
name: ‘pagerduty’ pagerduty_configs:
- routing_key: ‘YOUR_PAGERDUTY_KEY’ notations: summary: “Application is down”
High memory
- alert: HighMemoryUsage expr: | process_resident_memory_bytes / 1024 / 1024 > 512 for: 10m labels: severity: warning annotations: summary: “Memory usage above 512MB: {{ $value | humanize }}MB”
### Alertmanager Configuration
```yaml
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
ror rate has been above 5% for 5 minutes"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency above 1 second"
# Service down
- alert: ServiceDown
expr: up{job="myapp"} == 0
for: 1m
labels:
severity: critical
anashboard ID `1860`
- PostgreSQL: Dashboard ID `9628`
## Alerting
### Alert Rules
```yaml
# alerts/app.yml
groups:
- name: application
rules:
# High error rate
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate: {{ $value | humanizePercentage }}"
description: "Err))",
"legendFormat": "p95"
}]
},
{
"title": "Memory Usage",
"type": "graph",
"targets": [{
"expr": "process_resident_memory_bytes / 1024 / 1024",
"legendFormat": "RSS (MB)"
}]
}
]
}
}
Import community dashboards: Grafana has thousands of pre-built dashboards at grafana.com/grafana/dashboards. Popular ones:
- Node.js: Dashboard ID
11159 - Node Exporter (system): D “expr”: “sum(rate(http_requests_total{status_code=~‘5..’}[5m])) / sum(rate(http_requests_total[5m])) * 100” }], “thresholds”: { “steps”: [ {“color”: “green”, “value”: 0}, {“color”: “yellow”, “value”: 1}, {“color”: “red”, “value”: 5} ] } }, { “title”: “P95 Latency”, “type”: “graph”, “targets”: [{ “expr”: “histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])heus url: http://prometheus:9090 isDefault: true editable: false
### Key Panels for an Application Dashboard
```json
{
"dashboard": {
"title": "Application Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [{
"expr": "sum(rate(http_requests_total[5m])) by (route)",
"legendFormat": "{{route}}"
}]
},
{
"title": "Error Rate %",
"type": "stat",
"targets": [{
in last hour
increase(http_requests_total[1h])
# Memory usage (MB)
process_resident_memory_bytes / 1024 / 1024
# CPU usage percentage
rate(process_cpu_seconds_total[5m]) * 100
# Event loop lag (Node.js)
nodejs_eventloop_lag_seconds
# Active connections
active_connections
# Orders per minute
rate(orders_created_total{status="success"}[1m]) * 60
Grafana Dashboards
Provisioning a Dashboard
# grafana/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometuery language. These are the queries you'll use most:
```promql
# Request rate (requests per second over last 5 minutes)
rate(http_requests_total[5m])
# Error rate (percentage of 5xx responses)
rate(http_requests_total{status_code=~"5.."}[5m])
/ rate(http_requests_total[5m]) * 100
# P95 latency
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
# P99 latency by route
histogram_quantile(0.99,
sum by (route, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# Total requestsait registry.metrics());
});
// Your routes
app.get('/api/users', async (req, res) => {
const users = await getUsers();
res.json(users);
});
// Track business metrics
app.post('/api/orders', async (req, res) => {
try {
const order = await createOrder(req.body);
ordersCreated.inc({ status: 'success' });
res.json(order);
} catch (err) {
ordersCreated.inc({ status: 'error' });
throw err;
}
});
PromQL: Querying Metrics
PromQL is Prometheus’s qotal.inc(labels); httpRequestDuration.observe(labels, duration); });
next();
}
```javascript
// app.js
import express from 'express';
import { registry } from './metrics.js';
import { metricsMiddleware } from './middleware/metrics.js';
const app = express();
// Apply metrics middleware to all routes
app.use(metricsMiddleware);
// Expose metrics endpoint for Prometheus to scrape
app.get('/metrics', async (req, res) => {
res.set('Content-Type', registry.contentType);
res.end(aw registry };
// middleware/metrics.js โ Express middleware
import { httpRequestsTotal, httpRequestDuration } from '../metrics.js';
export function metricsMiddleware(req, res, next) {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route?.path || req.path;
const labels = {
method: req.method,
route,
status_code: res.statusCode,
};
httpRequestsT labelNames: ['status'],
registers: [registry],
});
export {
Comments