Skip to main content

Linux System Monitoring Complete Guide 2026

Published: March 6, 2026 Updated: May 25, 2026 Larry Qu 9 min read

Introduction

Monitoring is essential for maintaining healthy Linux systems. This guide covers monitoring tools and techniques — from quick command-line checks with top and htop to full production stacks with Prometheus and Grafana.

Command Line Tools

Essential Commands

# System resource usage
top              # Interactive process viewer
htop             # Enhanced top (colorful)
btop             # Modern top (graphs, mouse support)
atop             # Advanced top with persistent logging
glances          # Cross-platform monitoring

# CPU
mpstat -P ALL 1  # Per-CPU stats
sar -u 1         # CPU utilization

# Memory
free -h          # Memory usage
vmstat 1          # Virtual memory stats

# Disk
df -h            # Disk usage
iostat -x 1      # I/O statistics
du -sh *         # Directory sizes

# Network
iftop            # Network bandwidth per connection
nload            # Network bandwidth per interface
nethogs          # Bandwidth per process
ss -tuln         # Listening ports

top, htop, and btop Comparison

All three tools show running processes, but they differ in capability:

Feature top htop btop
Default install Yes (pre-installed) No (apt install htop) No (apt install btop)
Interface Minimal, monochrome Colored, scrollable Full graphical (GPU-like)
Mouse support No Yes Yes
Tree view No Yes (F5) Yes
Sorting Interactive (Shift+F) Click column headers Click column headers
Per-process IO No Yes Yes
CPU frequency No Yes (configurable) Yes (real-time graph)
Network graphs No No Built-in
Kill processes k F9 Click + confirm
Config file Interactive only ~/.config/htop/htoprc ~/.config/btop/btop.conf

For quick diagnostics on any server, top suffices. For daily interactive use, htop is the sweet spot. For monitoring dashboards in a terminal, btop provides the richest visual experience.

htop Customization

# Install htop
sudo apt install htop

# Custom htop config
# ~/.config/htop/htoprc
config:
    show_cpu_usage: 1
    show_cpu_frequency: 1
    show_cpu_temperature: 1
    show_memory_usage: 1
    detailed_cpu_time: 1
    
columns:
    - PID
    - USER
    - PRIORITY
    - NICE
    - M_SIZE
    - M_RESIDENT
    - STATE
    - CPU
    - MEM
    - TIME
    - Command

btop Quick Start

sudo apt install btop
btop

Press 1 to toggle CPU graph grouping, m for memory details, and ? for the full key bindings list. btop reads from /proc and /sys — no kernel modules needed.

CPU Monitoring

mpstat Per-CPU Breakdown

# Show per-CPU utilization every 2 seconds
mpstat -P ALL 2

# Output:
# 02:30:01 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %idle
# 02:30:03 PM  all    12.5    0.0     3.1    0.0     0.0     0.0     0.0    84.4
# 02:30:03 PM    0    15.0    0.0     5.0    0.0     0.0     0.0     0.0    80.0
# 02:30:03 PM    1    10.0    0.0     1.0    0.0     0.0     0.0     0.0    88.0

High %iowait indicates the CPU is waiting for disk I/O — look at what process is writing heavily. High %sys indicates kernel activity (system calls, drivers). High %steal (in VMs) means the hypervisor is under-provisioning CPU.

Memory Monitoring

free and vmstat

free -h
#               total   used   free   shared  buff/cache  available
# Mem:           31G    12G    8.2G    1.2G      11G        17G
# Swap:          2.0G   0.0G   2.0G

available is the key metric — it estimates how much memory is available for new processes without swapping. It includes free memory plus reclaimable cache. If available drops below 10% of total, consider adding RAM or reducing the workload.

vmstat 1
# procs  ---------memory----------   ---swap--  --io--  --system--  -----cpu-----
#  r  b   swpd   free  buff  cache   si   so   bi   bo   in   cs  us  sy  id  wa
#  2  0      0  8.2M  2.1G  8.8G    0    0   10   55  500  800   12   3  84   1

Columns to watch:

  • r: processes waiting for CPU (run queue). If > CPU count * 10, the system is overloaded.
  • b: processes blocked on I/O.
  • si/so: swap-in/swap-out. Non-zero values mean memory pressure.
  • wa: CPU time waiting for I/O.

/proc/meminfo

The raw data behind free and vmstat:

cat /proc/meminfo
# MemTotal:       32912320 kB
# MemFree:         8612340 kB
# MemAvailable:   17945678 kB
# Buffers:          234567 kB
# Cached:          9123456 kB
# SwapCached:            0 kB
# ...

Parse specific values for alerting:

# Memory usage percentage
awk '/MemTotal/{t=$2} /MemAvailable/{a=$2} END{printf "%.1f%%\n", (1-a/t)*100}' /proc/meminfo

Disk I/O Monitoring

iostat

# Extended I/O stats, updated every 2 seconds
iostat -x 2

# Device  r/s   w/s  rkB/s  wkB/s  await  svctm  %util
# sda    45.2  12.1  567.8  123.4   2.3   0.15   5.2%

Key metrics:

  • r/s / w/s: read/write operations per second.
  • await: average I/O time (queue + service) in milliseconds. Above 10ms indicates a slow device.
  • %util: percentage of time the device was busy. 100% means saturation.
  • svctm: average service time (how long the device takes to process a request). Very short (sub-ms) for SSDs.

iotop

Monitor disk I/O per process in real time:

# Requires root
sudo iotop -o

# Total DISK READ: 45.67 M/s | Total DISK WRITE: 12.34 M/s
#   PID  PRIO  DISK READ  DISK WRITE  COMMAND
#  1234 be/4   45.23 M/s   0.00 B/s   nginx
#  5678 be/4    0.00 B/s  12.34 M/s   postgres

The -o flag shows only processes actively doing I/O. Use -P to show threads instead of processes.

Network Monitoring

nload

Show bandwidth usage per interface with real-time graphs:

nload eth0

nload displays incoming and outgoing traffic in a split-panel format with a running graph. Use the left/right arrow keys to switch interfaces.

iftop

Show bandwidth per connection:

sudo iftop -i eth0

Port to port traffic — useful for identifying which remote hosts are consuming bandwidth. Press T to show cumulative totals, S to sort by source, t to cycle through display modes.

nethogs

Show bandwidth per process:

sudo nethogs eth0

# PID   USER    DEVICE   SENT     RECEIVED  COMMAND
# 1234  nginx   eth0     1.2Mbps  3.4Mbps   nginx: worker
# 5678  postgres eth0    0.5Mbps  2.1Mbps   postgres

nethogs is the most actionable tool — it tells you which process is consuming bandwidth. Use it to identify a runaway download or an unexpected data transfer.

SAR (System Activity Reporter)

The sysstat package collects and reports system activity data. It runs as a background daemon (sysstat) that snapshots metrics every 10 minutes (configurable in /etc/default/sysstat).

Installation and Setup

sudo apt install sysstat

# Enable data collection
sudo systemctl enable sysstat
sudo systemctl start sysstat

# Configuration in /etc/default/sysstat
ENABLED="true"

Historical Reports

# CPU report for today
sar -u

# Memory report for a specific date
sar -r -f /var/log/sysstat/sa25

# CPU report between specific hours
sar -u -s 08:00 -e 12:00

# Disk I/O (device-level)
sar -d 1 5

# Network interfaces
sar -n DEV 1

# Context switches and processes
sar -w 1 5

# Paging statistics
sar -B 1 5

Report Analysis

Key patterns to watch in SAR reports:

# CPU saturation: consistently high %user + %system (> 80%)
02:00:01 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
02:10:01 PM     all     75.2       0.0      12.1       0.5       0.0      12.2

# Memory pressure: high swap usage
02:00:01 PM kbmemfree kbmemused  %memused  kbbuffers  kbcached  kbswpused  %swpused
02:10:01 PM   123456   31687690     96.1     234567    8912345    123456        6.0

High %iowait points to disk bottlenecks. High swap usage points to insufficient RAM.

Monitoring Stack: Prometheus + Grafana

For a production environment, Prometheus and Grafana provide long-term metrics storage, flexible querying, and rich dashboards.

Docker Compose Setup

# docker-compose.yml
version: '3'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_INSTALL_PLUGINS=grafana-piechart-panel

  node_exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'

volumes:
  prometheus_data:
  grafana_data:

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'myservice'
    static_configs:
      - targets: ['myservice:8080']

Deploy the stack:

docker-compose up -d

Open Grafana at http://localhost:3000 (admin/admin), add Prometheus as a data source at http://prometheus:9090, and you have a full monitoring stack.

node_exporter Metrics

node_exporter exposes hundreds of system metrics. Key metrics for alerting and dashboards:

Metric Type What It Measures
node_cpu_seconds_total Counter CPU time in each mode (user, system, idle, iowait)
node_memory_MemTotal_bytes Gauge Total physical memory
node_memory_MemAvailable_bytes Gauge Memory available for allocation
node_disk_io_time_seconds_total Counter Time spent doing I/O
node_disk_read_bytes_total Counter Total bytes read from disk
node_network_receive_bytes_total Counter Total bytes received over network
node_filesystem_avail_bytes Gauge Available disk space per mount
node_load1 / node_load5 / node_load15 Gauge System load averages

CPU usage percentage query:

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory usage percentage:

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Disk space remaining:

node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100

Alerting

Prometheus Alert Rules

Create alerts.yml and reference it from prometheus.yml:

# prometheus.yml
rule_files:
  - "alerts.yml"
# alerts.yml
groups:
  - name: node_alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% for 5+ minutes"

      - alert: CriticalCPUUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Critical CPU usage on {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100 < 10
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"

      - alert: HighDiskIOWait
        expr: rate(node_cpu_seconds_total{mode="iowait"}[5m]) * 100 > 20
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High I/O wait on {{ $labels.instance }}"

      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"

Alertmanager Configuration

Prometheus alerts are sent to Alertmanager, which handles grouping, inhibition, and notifications (email, Slack, PagerDuty):

# alertmanager.yml
route:
  receiver: 'slack'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts'

Grafana Dashboards

Dashboard Recommendations

Dashboard ID (Grafana.com) Description
Node Exporter Full 1860 Comprehensive system metrics
Node Exporter Server Metrics 16098 Simplified server overview
Linux Hosts Metrics 10180 Multi-host view
1 Node Dashboard 11076 Single node deep dive

Import a dashboard from Grafana.com:

  1. Log in to Grafana (localhost:3000).
  2. Go to CreateImport.
  3. Enter the dashboard ID (e.g., 1860) and click Load.
  4. Select the Prometheus data source.

Custom Dashboard Queries

CPU usage per core:

100 - (avg by (cpu, instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Network traffic per interface:

rate(node_network_receive_bytes_total{instance="localhost:9100"}[5m])

Disk read/write throughput:

rate(node_disk_read_bytes_total{instance="localhost:9100"}[5m])
rate(node_disk_written_bytes_total{instance="localhost:9100"}[5m])

Conclusion

Monitoring is crucial for system reliability. Start with the command-line tools (htop, iostat, iftop, sar) for daily checks and reactive debugging. Deploy Prometheus + Grafana with node_exporter for proactive alerting and historical analysis. The combination covers you from “what is happening right now” to “what happened last week.”

Comments

👍 Was this article helpful?