Introduction
Monitoring is essential for maintaining healthy Linux systems. This guide covers monitoring tools and techniques — from quick command-line checks with top and htop to full production stacks with Prometheus and Grafana.
Command Line Tools
Essential Commands
# System resource usage
top # Interactive process viewer
htop # Enhanced top (colorful)
btop # Modern top (graphs, mouse support)
atop # Advanced top with persistent logging
glances # Cross-platform monitoring
# CPU
mpstat -P ALL 1 # Per-CPU stats
sar -u 1 # CPU utilization
# Memory
free -h # Memory usage
vmstat 1 # Virtual memory stats
# Disk
df -h # Disk usage
iostat -x 1 # I/O statistics
du -sh * # Directory sizes
# Network
iftop # Network bandwidth per connection
nload # Network bandwidth per interface
nethogs # Bandwidth per process
ss -tuln # Listening ports
top, htop, and btop Comparison
All three tools show running processes, but they differ in capability:
| Feature | top | htop | btop |
|---|---|---|---|
| Default install | Yes (pre-installed) | No (apt install htop) |
No (apt install btop) |
| Interface | Minimal, monochrome | Colored, scrollable | Full graphical (GPU-like) |
| Mouse support | No | Yes | Yes |
| Tree view | No | Yes (F5) | Yes |
| Sorting | Interactive (Shift+F) | Click column headers | Click column headers |
| Per-process IO | No | Yes | Yes |
| CPU frequency | No | Yes (configurable) | Yes (real-time graph) |
| Network graphs | No | No | Built-in |
| Kill processes | k |
F9 | Click + confirm |
| Config file | Interactive only | ~/.config/htop/htoprc |
~/.config/btop/btop.conf |
For quick diagnostics on any server, top suffices. For daily interactive use, htop is the sweet spot. For monitoring dashboards in a terminal, btop provides the richest visual experience.
htop Customization
# Install htop
sudo apt install htop
# Custom htop config
# ~/.config/htop/htoprc
config:
show_cpu_usage: 1
show_cpu_frequency: 1
show_cpu_temperature: 1
show_memory_usage: 1
detailed_cpu_time: 1
columns:
- PID
- USER
- PRIORITY
- NICE
- M_SIZE
- M_RESIDENT
- STATE
- CPU
- MEM
- TIME
- Command
btop Quick Start
sudo apt install btop
btop
Press 1 to toggle CPU graph grouping, m for memory details, and ? for the full key bindings list. btop reads from /proc and /sys — no kernel modules needed.
CPU Monitoring
mpstat Per-CPU Breakdown
# Show per-CPU utilization every 2 seconds
mpstat -P ALL 2
# Output:
# 02:30:01 PM CPU %usr %nice %sys %iowait %irq %soft %steal %idle
# 02:30:03 PM all 12.5 0.0 3.1 0.0 0.0 0.0 0.0 84.4
# 02:30:03 PM 0 15.0 0.0 5.0 0.0 0.0 0.0 0.0 80.0
# 02:30:03 PM 1 10.0 0.0 1.0 0.0 0.0 0.0 0.0 88.0
High %iowait indicates the CPU is waiting for disk I/O — look at what process is writing heavily. High %sys indicates kernel activity (system calls, drivers). High %steal (in VMs) means the hypervisor is under-provisioning CPU.
Memory Monitoring
free and vmstat
free -h
# total used free shared buff/cache available
# Mem: 31G 12G 8.2G 1.2G 11G 17G
# Swap: 2.0G 0.0G 2.0G
available is the key metric — it estimates how much memory is available for new processes without swapping. It includes free memory plus reclaimable cache. If available drops below 10% of total, consider adding RAM or reducing the workload.
vmstat 1
# procs ---------memory---------- ---swap-- --io-- --system-- -----cpu-----
# r b swpd free buff cache si so bi bo in cs us sy id wa
# 2 0 0 8.2M 2.1G 8.8G 0 0 10 55 500 800 12 3 84 1
Columns to watch:
r: processes waiting for CPU (run queue). If > CPU count * 10, the system is overloaded.b: processes blocked on I/O.si/so: swap-in/swap-out. Non-zero values mean memory pressure.wa: CPU time waiting for I/O.
/proc/meminfo
The raw data behind free and vmstat:
cat /proc/meminfo
# MemTotal: 32912320 kB
# MemFree: 8612340 kB
# MemAvailable: 17945678 kB
# Buffers: 234567 kB
# Cached: 9123456 kB
# SwapCached: 0 kB
# ...
Parse specific values for alerting:
# Memory usage percentage
awk '/MemTotal/{t=$2} /MemAvailable/{a=$2} END{printf "%.1f%%\n", (1-a/t)*100}' /proc/meminfo
Disk I/O Monitoring
iostat
# Extended I/O stats, updated every 2 seconds
iostat -x 2
# Device r/s w/s rkB/s wkB/s await svctm %util
# sda 45.2 12.1 567.8 123.4 2.3 0.15 5.2%
Key metrics:
r/s/w/s: read/write operations per second.await: average I/O time (queue + service) in milliseconds. Above 10ms indicates a slow device.%util: percentage of time the device was busy. 100% means saturation.svctm: average service time (how long the device takes to process a request). Very short (sub-ms) for SSDs.
iotop
Monitor disk I/O per process in real time:
# Requires root
sudo iotop -o
# Total DISK READ: 45.67 M/s | Total DISK WRITE: 12.34 M/s
# PID PRIO DISK READ DISK WRITE COMMAND
# 1234 be/4 45.23 M/s 0.00 B/s nginx
# 5678 be/4 0.00 B/s 12.34 M/s postgres
The -o flag shows only processes actively doing I/O. Use -P to show threads instead of processes.
Network Monitoring
nload
Show bandwidth usage per interface with real-time graphs:
nload eth0
nload displays incoming and outgoing traffic in a split-panel format with a running graph. Use the left/right arrow keys to switch interfaces.
iftop
Show bandwidth per connection:
sudo iftop -i eth0
Port to port traffic — useful for identifying which remote hosts are consuming bandwidth. Press T to show cumulative totals, S to sort by source, t to cycle through display modes.
nethogs
Show bandwidth per process:
sudo nethogs eth0
# PID USER DEVICE SENT RECEIVED COMMAND
# 1234 nginx eth0 1.2Mbps 3.4Mbps nginx: worker
# 5678 postgres eth0 0.5Mbps 2.1Mbps postgres
nethogs is the most actionable tool — it tells you which process is consuming bandwidth. Use it to identify a runaway download or an unexpected data transfer.
SAR (System Activity Reporter)
The sysstat package collects and reports system activity data. It runs as a background daemon (sysstat) that snapshots metrics every 10 minutes (configurable in /etc/default/sysstat).
Installation and Setup
sudo apt install sysstat
# Enable data collection
sudo systemctl enable sysstat
sudo systemctl start sysstat
# Configuration in /etc/default/sysstat
ENABLED="true"
Historical Reports
# CPU report for today
sar -u
# Memory report for a specific date
sar -r -f /var/log/sysstat/sa25
# CPU report between specific hours
sar -u -s 08:00 -e 12:00
# Disk I/O (device-level)
sar -d 1 5
# Network interfaces
sar -n DEV 1
# Context switches and processes
sar -w 1 5
# Paging statistics
sar -B 1 5
Report Analysis
Key patterns to watch in SAR reports:
# CPU saturation: consistently high %user + %system (> 80%)
02:00:01 PM CPU %user %nice %system %iowait %steal %idle
02:10:01 PM all 75.2 0.0 12.1 0.5 0.0 12.2
# Memory pressure: high swap usage
02:00:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbswpused %swpused
02:10:01 PM 123456 31687690 96.1 234567 8912345 123456 6.0
High %iowait points to disk bottlenecks. High swap usage points to insufficient RAM.
Monitoring Stack: Prometheus + Grafana
For a production environment, Prometheus and Grafana provide long-term metrics storage, flexible querying, and rich dashboards.
Docker Compose Setup
# docker-compose.yml
version: '3'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_INSTALL_PLUGINS=grafana-piechart-panel
node_exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
volumes:
prometheus_data:
grafana_data:
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'myservice'
static_configs:
- targets: ['myservice:8080']
Deploy the stack:
docker-compose up -d
Open Grafana at http://localhost:3000 (admin/admin), add Prometheus as a data source at http://prometheus:9090, and you have a full monitoring stack.
node_exporter Metrics
node_exporter exposes hundreds of system metrics. Key metrics for alerting and dashboards:
| Metric | Type | What It Measures |
|---|---|---|
node_cpu_seconds_total |
Counter | CPU time in each mode (user, system, idle, iowait) |
node_memory_MemTotal_bytes |
Gauge | Total physical memory |
node_memory_MemAvailable_bytes |
Gauge | Memory available for allocation |
node_disk_io_time_seconds_total |
Counter | Time spent doing I/O |
node_disk_read_bytes_total |
Counter | Total bytes read from disk |
node_network_receive_bytes_total |
Counter | Total bytes received over network |
node_filesystem_avail_bytes |
Gauge | Available disk space per mount |
node_load1 / node_load5 / node_load15 |
Gauge | System load averages |
CPU usage percentage query:
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory usage percentage:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Disk space remaining:
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100
Alerting
Prometheus Alert Rules
Create alerts.yml and reference it from prometheus.yml:
# prometheus.yml
rule_files:
- "alerts.yml"
# alerts.yml
groups:
- name: node_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% for 5+ minutes"
- alert: CriticalCPUUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
for: 2m
labels:
severity: critical
annotations:
summary: "Critical CPU usage on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100 < 10
for: 2m
labels:
severity: critical
annotations:
summary: "Disk space low on {{ $labels.instance }}"
- alert: HighDiskIOWait
expr: rate(node_cpu_seconds_total{mode="iowait"}[5m]) * 100 > 20
for: 5m
labels:
severity: warning
annotations:
summary: "High I/O wait on {{ $labels.instance }}"
- alert: NodeDown
expr: up{job="node"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"
Alertmanager Configuration
Prometheus alerts are sent to Alertmanager, which handles grouping, inhibition, and notifications (email, Slack, PagerDuty):
# alertmanager.yml
route:
receiver: 'slack'
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts'
Grafana Dashboards
Dashboard Recommendations
| Dashboard | ID (Grafana.com) | Description |
|---|---|---|
| Node Exporter Full | 1860 | Comprehensive system metrics |
| Node Exporter Server Metrics | 16098 | Simplified server overview |
| Linux Hosts Metrics | 10180 | Multi-host view |
| 1 Node Dashboard | 11076 | Single node deep dive |
Import a dashboard from Grafana.com:
- Log in to Grafana (localhost:3000).
- Go to Create → Import.
- Enter the dashboard ID (e.g.,
1860) and click Load. - Select the Prometheus data source.
Custom Dashboard Queries
CPU usage per core:
100 - (avg by (cpu, instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Network traffic per interface:
rate(node_network_receive_bytes_total{instance="localhost:9100"}[5m])
Disk read/write throughput:
rate(node_disk_read_bytes_total{instance="localhost:9100"}[5m])
rate(node_disk_written_bytes_total{instance="localhost:9100"}[5m])
Conclusion
Monitoring is crucial for system reliability. Start with the command-line tools (htop, iostat, iftop, sar) for daily checks and reactive debugging. Deploy Prometheus + Grafana with node_exporter for proactive alerting and historical analysis. The combination covers you from “what is happening right now” to “what happened last week.”
Comments