Skip to main content
โšก Calmops

Monitoring with Prometheus and Grafana: A Practical Setup Guide

Introduction

Prometheus collects fana visualizes them. Together they give you dashboards, alerts, and the data to answer “is my system healthy?” and “why is it slow?”

Prerequisites: Docker and Docker Compose installed. Basic understanding of metrics concepts.

Quick Start with Docker Compose

# docker-compose.monitoring.yml
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3001:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
      GF_USERS_ALLOW_SIGN_UP: false
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
  - ./grafana/datasources:/etc/grafana/provisioning/datasources

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

volumes:
  prometheus-data:
  grafana-data:
docker compose -f docker-compose.monitoring.yml up -d
# Prometheus: http://localhost:9090
# Grafana:    http://localhost:3001 (admin/admin)

Prometheus Configuration

# prometheus.yml
global:
  scrape_   # how often to scrape targets
  evaluation_interval: 15s   # how often to evaluate rules

# Alert rules
rule_files:
  - "alerts/*.yml"

# Alertmanager
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

# Scrape targets
scrape_configs:
  # Your Node.js application
  - job_name: 'myapp'
    static_configs:
      - targets: ['app:3000']
    metrics_path: '/metrics'

  # Node Exporter (system metrics: CPU, memory, disk)
  - job_name: 'node-exporter'
    static_configs:
    - targets: ['node-exporter:9100']

  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # PostgreSQL exporter
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']

Instrumenting Node.js

npm install prom-client
// metrics.js
import { Registry, Counter, Histogram, Gauge, collectDefaultMetrics } from 'prom-client';

const registry = new Registry();

// Colry, CPU, event loop lag)
collectDefaultMetrics({ register: registry });

// HTTP request counter
export const httpRequestsTotal = new Counter({
    name: 'http_requests_total',
    help: 'Total number of HTTP requests',
    labelNames: ['method', 'route', 'status_code'],
    registers: [registry],
});

// HTTP request duration histogram
export const httpRequestDuration = new Histogram({
    name: 'http_request_duration_seconds',
    help: 'HTTP request duration in seconds',
    labelNames: ['me', 'status_code'],
    buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
    registers: [registry],
});

// Active connections gauge
export const activeConnections = new Gauge({
    name: 'active_connections',
    help: 'Number of active connections',
    registers: [registry],
});

// Business metric: orders created
export const ordersCreated = new Counter({
    name: 'orders_created_total',
    help: 'Total orders created',
heus Documentation](https://prometheus.io/docs/)
- [Grafana Documentation](https://grafana.com/docs/)
- [PromQL Cheat Sheet](https://promlabs.com/promql-cheat-sheet/)
- [prom-client (Node.js)](https://github.com/siimon/prom-client)
- [Grafana Dashboard Library](https://grafana.com/grafana/dashboards/)
- [Google SRE: Four Golden Signals](https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals)
    description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'

The Four Golden Signals

Monitor these four metrics for any service:

Signal What it measures Example metric
Latency How long requests take http_request_duration_seconds
Traffic How much demand http_requests_total
Errors Rate of failed requests http_requests_total{status=~"5.."}
Saturation How “full” the service is CPU, memory, queue depth

Resources

  • [Prometepeat_interval: 4h receiver: ‘slack-notifications’

    routes:

    • match: severity: critical receiver: ‘pagerduty’ continue: true

receivers:

  • name: ‘slack-notifications’ slack_configs:

  • name: ‘pagerduty’ pagerduty_configs:

    • routing_key: ‘YOUR_PAGERDUTY_KEY’ notations: summary: “Application is down”

    High memory

    • alert: HighMemoryUsage expr: | process_resident_memory_bytes / 1024 / 1024 > 512 for: 10m labels: severity: warning annotations: summary: “Memory usage above 512MB: {{ $value | humanize }}MB”

### Alertmanager Configuration

```yaml
# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  ror rate has been above 5% for 5 minutes"

      # High latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
          ) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency above 1 second"

      # Service down
      - alert: ServiceDown
        expr: up{job="myapp"} == 0
        for: 1m
        labels:
          severity: critical
        anashboard ID `1860`
- PostgreSQL: Dashboard ID `9628`

## Alerting

### Alert Rules

```yaml
# alerts/app.yml
groups:
  - name: application
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate: {{ $value | humanizePercentage }}"
          description: "Err))",
          "legendFormat": "p95"
        }]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [{
          "expr": "process_resident_memory_bytes / 1024 / 1024",
          "legendFormat": "RSS (MB)"
        }]
      }
    ]
  }
}

Import community dashboards: Grafana has thousands of pre-built dashboards at grafana.com/grafana/dashboards. Popular ones:

  • Node.js: Dashboard ID 11159
  • Node Exporter (system): D “expr”: “sum(rate(http_requests_total{status_code=~‘5..’}[5m])) / sum(rate(http_requests_total[5m])) * 100” }], “thresholds”: { “steps”: [ {“color”: “green”, “value”: 0}, {“color”: “yellow”, “value”: 1}, {“color”: “red”, “value”: 5} ] } }, { “title”: “P95 Latency”, “type”: “graph”, “targets”: [{ “expr”: “histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])heus url: http://prometheus:9090 isDefault: true editable: false

### Key Panels for an Application Dashboard

```json
{
  "dashboard": {
    "title": "Application Overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [{
          "expr": "sum(rate(http_requests_total[5m])) by (route)",
          "legendFormat": "{{route}}"
        }]
      },
      {
        "title": "Error Rate %",
        "type": "stat",
        "targets": [{
          in last hour
increase(http_requests_total[1h])

# Memory usage (MB)
process_resident_memory_bytes / 1024 / 1024

# CPU usage percentage
rate(process_cpu_seconds_total[5m]) * 100

# Event loop lag (Node.js)
nodejs_eventloop_lag_seconds

# Active connections
active_connections

# Orders per minute
rate(orders_created_total{status="success"}[1m]) * 60

Grafana Dashboards

Provisioning a Dashboard

# grafana/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometuery language. These are the queries you'll use most:

```promql
# Request rate (requests per second over last 5 minutes)
rate(http_requests_total[5m])

# Error rate (percentage of 5xx responses)
rate(http_requests_total{status_code=~"5.."}[5m])
/ rate(http_requests_total[5m]) * 100

# P95 latency
histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket[5m])
)

# P99 latency by route
histogram_quantile(0.99,
  sum by (route, le) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

# Total requestsait registry.metrics());
});

// Your routes
app.get('/api/users', async (req, res) => {
    const users = await getUsers();
    res.json(users);
});

// Track business metrics
app.post('/api/orders', async (req, res) => {
    try {
        const order = await createOrder(req.body);
        ordersCreated.inc({ status: 'success' });
        res.json(order);
    } catch (err) {
        ordersCreated.inc({ status: 'error' });
        throw err;
    }
});

PromQL: Querying Metrics

PromQL is Prometheus’s qotal.inc(labels); httpRequestDuration.observe(labels, duration); });

next();

}


```javascript
// app.js
import express from 'express';
import { registry } from './metrics.js';
import { metricsMiddleware } from './middleware/metrics.js';

const app = express();

// Apply metrics middleware to all routes
app.use(metricsMiddleware);

// Expose metrics endpoint for Prometheus to scrape
app.get('/metrics', async (req, res) => {
    res.set('Content-Type', registry.contentType);
    res.end(aw registry };
// middleware/metrics.js โ€” Express middleware
import { httpRequestsTotal, httpRequestDuration } from '../metrics.js';

export function metricsMiddleware(req, res, next) {
    const start = Date.now();

    res.on('finish', () => {
        const duration = (Date.now() - start) / 1000;
        const route = req.route?.path || req.path;
        const labels = {
            method: req.method,
            route,
            status_code: res.statusCode,
        };

        httpRequestsT    labelNames: ['status'],
    registers: [registry],
});

export {

Comments