Introduction
Network performance monitoring (NPM) is the practice of measuring, analyzing, and optimizing network behavior — latency, throughput, packet loss, jitter, and availability. As applications move to the cloud, workforces become distributed, and infrastructure grows more complex, blind operation is no longer viable. Without monitoring, you discover issues only when users report them.
Modern NPM combines SNMP polling, flow analysis (NetFlow, sFlow, IPFIX), synthetic transactions, streaming telemetry (gNMI), and packet capture. The goal is proactive detection — finding and fixing issues before they affect users.
This guide covers core NPM metrics, the Prometheus monitoring stack (SNMP exporter, Blackbox exporter, Alertmanager), leading commercial and open-source tools, cloud monitoring strategies, and emerging trends like eBPF-based observability and AI-driven remediation.
Understanding Network Performance
Key Performance Metrics
Network performance is measured through several core metrics.
Latency measures the time for data to travel from source to destination. Low latency is essential for real-time applications. Latency is typically measured in milliseconds (ms).
Throughput measures the amount of data transmitted per unit of time. It’s typically expressed in bits per second (bps) or bytes per second. High throughput enables fast data transfer.
Packet loss measures the percentage of packets that fail to reach their destination. Even small packet loss can significantly impact application performance.
Jitter measures variation in latency over time. High jitter disrupts real-time applications like VoIP and video conferencing.
Availability measures the percentage of time network resources are operational. High availability targets (99.9% or higher) require comprehensive monitoring.
Application Performance Relationship
Network performance directly affects application performance. Understanding this relationship helps prioritize monitoring efforts.
Applications have different network requirements. File transfers need high throughput but tolerate latency. Database queries need low latency but modest bandwidth. Video conferencing needs both low latency and low jitter.
Effective monitoring correlates network metrics with application performance. This correlation helps identify whether issues originate in the network or application layers.
Monitoring Approaches
Active Monitoring
Active monitoring injects test traffic into the network to measure performance. Synthetic transactions simulate user activity without requiring actual user traffic.
Active monitoring advantages include: consistent measurement methodology, ability to test any path regardless of traffic, and testing before issues affect users.
Common active monitoring techniques include: ping tests for latency and availability, HTTP/S synthetic transactions for web application performance, and custom protocol tests for specific applications.
Passive Monitoring
Passive monitoring observes actual network traffic without injecting additional data. It provides visibility into real user activity.
Passive monitoring advantages include: no additional network load, visibility into actual traffic patterns, and detection of issues that synthetic tests miss.
Passive monitoring techniques include: flow analysis (NetFlow, sFlow, IPFIX), packet capture and analysis, and SNMP monitoring.
Hybrid Approaches
Most comprehensive monitoring strategies combine active and passive approaches. Active monitoring provides consistent measurement and early warning. Passive monitoring offers real traffic visibility.
The combination provides the most complete picture of network performance.
Key Monitoring Technologies
SNMP and Prometheus
The Prometheus stack has become the dominant open-source monitoring platform for cloud-native environments, and it works equally well for traditional network devices via the SNMP exporter and Blackbox exporter.
flowchart LR
subgraph Devices["Network Devices"]
R1["Router"]
SW1["Switch"]
FW["Firewall"]
end
subgraph Exporters["Prometheus Exporters"]
SNMP["SNMP Exporter<br/>:9116"]
BB["Blackbox Exporter<br/>:9115"]
end
subgraph Stack["Monitoring Stack"]
P["Prometheus<br/>:9090"]
AM["Alertmanager"]
G["Grafana"]
end
R1 -->|"SNMP (UDP 161)"| SNMP
SW1 -->|"SNMP"| SNMP
FW -->|"SNMP"| SNMP
BB -->|"ICMP/TCP probes"| R1
BB -->|"ICMP/TCP probes"| SW1
SNMP -->|"scrape :9116/snmp"| P
BB -->|"scrape :9115/probe"| P
P --> AM
P --> G
SNMP Exporter Setup
The Prometheus SNMP exporter translates SNMP OID walks into Prometheus metrics. Deploy it alongside Prometheus and configure which OIDs to collect:
# docker-compose.yml
services:
snmp-exporter:
image: prom/snmp-exporter:v0.25.0
ports:
- "9116:9116"
volumes:
- ./snmp.yml:/etc/snmp_exporter/snmp.yml
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
blackbox-exporter:
image: prom/blackbox-exporter:latest
ports:
- "9115:9115"
volumes:
- ./blackbox.yml:/etc/blackbox_exporter/blackbox.yml
Configure Prometheus to scrape SNMP targets and route them through the exporter:
# prometheus.yml
scrape_configs:
- job_name: "snmp-core-switches"
metrics_path: /snmp
params:
module: [if_mib]
auth: [public_v2]
static_configs:
- targets: ["192.168.1.1", "192.168.1.2"]
labels:
site: datacenter-east
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: snmp-exporter:9116
- job_name: "network-connectivity"
metrics_path: /probe
params:
module: [icmp_probe]
static_configs:
- targets:
- "8.8.8.8"
- "1.1.1.1"
- "192.168.1.1"
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Key PromQL Metrics for Network
Once data flows into Prometheus, query interface health and performance:
# Interface throughput (bits/sec)
rate(ifHCInOctets[5m]) * 8
# Interface utilization percentage
rate(ifHCInOctets[5m]) * 8 / ifHighSpeed / 1000000 * 100
# Interface error rate
rate(ifInErrors[5m])
# Device reachability (1 = up)
probe_success{job="network-connectivity"}
# Round-trip latency (seconds)
probe_duration_seconds{job="network-connectivity"}
Alert on critical conditions:
# network-alerts.yml
groups:
- name: network
rules:
- alert: InterfaceDown
expr: ifOperStatus == 2
for: 5m
labels:
severity: critical
annotations:
summary: "Interface {{ $labels.ifDescr }} on {{ $labels.instance }} is down"
- alert: HighUtilization
expr: rate(ifHCInOctets[5m]) * 8 / ifHighSpeed / 1000000 * 100 > 80
for: 15m
labels:
severity: warning
annotations:
summary: "{{ $labels.ifDescr }} at {{ $value | printf \"%.1f\" }}% utilization"
- alert: HostUnreachable
expr: probe_success{job="network-connectivity"} == 0
for: 3m
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} is unreachable via ICMP"
Flow Analysis
Flow analysis examines network traffic patterns without full packet capture. Protocols like NetFlow, sFlow, and IPFIX export flow records containing source, destination, volume, and timing information.
Flow analysis enables: traffic analysis and baselining, bandwidth utilization by application and user, and identification of top talkers and applications.
Flow data is more compact than packet captures, enabling longer retention periods.
Packet Capture and Analysis
Packet capture records individual packets for detailed analysis. It’s essential for troubleshooting complex issues and understanding application behavior.
Full packet capture generates massive data volumes. Tcpdump, Wireshark, and specialized tools provide packet capture and analysis capabilities.
Common use cases include: troubleshooting application issues, security incident investigation, and protocol debugging.
NetFlow and IPFIX
NetFlow was developed by Cisco and has become an industry standard. IPFIX is the standards-based evolution of NetFlow.
Flow records include: source and destination IP addresses and ports, protocol, byte and packet counts, timestamps, and application identification.
Network devices generate flow data with minimal performance impact. Flow collectors aggregate and analyze the data.
sFlow
sFlow (sampled flow) provides statistical sampling of network traffic. Unlike NetFlow’s flow-based approach, sFlow samples packets at configurable intervals.
sFlow advantages include: scalability for high-speed networks and minimal impact on network devices.
The trade-off is less precise measurement due to sampling.
Monitoring Tools
Tool Comparison
| Tool | Type | Key Strengths | Best For |
|---|---|---|---|
| Prometheus + Grafana | Open-source stack | SNMP exporter, Blackbox exporter, PromQL, Kubernetes-native | Cloud-native, custom pipelines, SRE teams |
| Zabbix | Open-source | SNMP, IPMI, JMX, agent-based, auto-discovery | Enterprise on-prem with Linux expertise |
| SolarWinds NPM | Commercial | Auto-discovery, PerfStack troubleshooting, alerting | Mid-market to large enterprises |
| PRTG | Commercial | All-in-one sensors (SNMP, flow, packet, WMI) | SMBs, simple multi-vendor environments |
| LibreNMS | Open-source | Auto-discovery, polling, alerting, billing | Network engineers who prefer self-hosted |
| Kentik | SaaS/Commercial | Flow-based (NetFlow, sFlow, IPFIX), cloud-aware, AI analytics | Multi-cloud, high-volume traffic analysis |
| ThousandEyes (Cisco) | SaaS | Synthetic monitoring, BGP, Internet insights, WAN visibility | Distributed enterprises, SD-WAN, Internet-aware |
| Cilium + Hubble | Open-source | eBPF-based, Kubernetes network observability, service map | K8s-native, zero-trust, service mesh |
Prometheus + Grafana
The Prometheus stack has become the de facto standard for cloud-native network monitoring. It scrapes metrics from exporters, stores them in a time-series database, and visualizes through Grafana. The SNMP exporter and Blackbox exporter cover traditional network devices, while exporters for Kubernetes, databases, and applications fill in the rest. Alertmanager handles routing, deduplication, and notification of alerts across Slack, PagerDuty, email, and webhooks.
Zabbix
Zabbix provides mature, agent-based monitoring with auto-discovery of network devices. It supports SNMP, IPMI, JMX, and custom checks. The latest versions include improved visualization, native Prometheus integration, and machine learning-based anomaly detection. Zabbix requires more setup than commercial tools but offers excellent value for organizations with Linux skills.
SolarWinds Network Performance Monitor
SolarWinds NPM offers automatic network discovery, performance polling, alerting, and PerfStack cross-correlation troubleshooting. The Orion platform integrates with NetFlow Traffic Analyzer, IPAM, and server monitoring. Well-suited for mid-market enterprises needing robust monitoring without excessive complexity.
PRTG Network Monitor
PRTG uses a sensor-based licensing model — each monitored parameter consumes one sensor. It supports SNMP, flow, packet sniffing, WMI, and REST API sensors out of the box. PRTG’s all-in-one approach is accessible for SMBs but can become expensive at scale.
LibreNMS
LibreNMS is a community-driven fork of the Observium project. It automatically discovers devices via CDP, FDP, LLDP, and OSPF, then polls interface statistics, CPU, memory, storage, and custom OIDs. The built-in alerting system supports email, Slack, Telegram, and webhooks. It bills itself as “fully featured” and is suitable for network engineers who prefer a self-hosted, no-cost solution.
Kentik
Kentik is a cloud-native NPM platform built on flow telemetry. It ingests NetFlow, sFlow, IPFIX, and cloud flow logs (AWS VPC, Azure, GCP) at scale, then applies AI-driven analytics for anomaly detection, capacity planning, and DDoS identification. It is particularly strong for multi-cloud environments where visibility across providers is critical.
ThousandEyes
ThousandEyes (acquired by Cisco) provides synthetic monitoring from multiple vantage points worldwide. It runs agent-based tests for HTTP, TCP, UDP, DNS, and BGP to measure Internet and WAN performance from the user’s perspective. Critical for troubleshooting “it’s the network” claims from SaaS providers and ISPs. Integrates with AppDynamics, ServiceNow, and Cisco Catalyst Center.
Cilium + Hubble
Cilium uses eBPF (extended Berkeley Packet Filter) to provide kernel-level network observability for Kubernetes. Hubble, the observability layer, captures flow logs, service maps, and HTTP/gRPC request rates at line rate without sidecars. This is the cutting edge of K8s network monitoring, replacing traditional approaches with wire-level visibility.
Implementation Best Practices
Define Monitoring Objectives
Before deploying monitoring, define clear objectives. What are you trying to achieve? What decisions will monitoring inform?
Common objectives include: proactive issue identification, capacity planning, service level compliance, and troubleshooting acceleration.
Clear objectives guide tool selection and configuration.
Establish Baselines
Understanding normal network behavior is essential for identifying anomalies. Establish baselines during stable operating periods.
Baselines should include: typical bandwidth utilization, normal latency ranges, baseline application response times, and common traffic patterns.
Use baselines to configure appropriate alerts that avoid alert fatigue.
Implement Appropriate Alerts
Alerts should notify operators of issues requiring attention without overwhelming them with false positives.
Alert configuration principles include: alert on significant deviations from baseline, use severity levels appropriately, implement alert deduplication and correlation, and ensure clear alert documentation.
Plan for Scalability
Network monitoring generates significant data. Plan storage and processing capacity for growth.
Consider data retention requirements. Historical data supports capacity planning and forensic analysis.
Automate Response Where Possible
Automated responses can address common issues without operator intervention.
Automation examples include: automatic traffic rerouting during failures, automated VM migration during congestion, and auto-scaling based on utilization.
Metrics to Monitor
Metric Reference Table
| Category | Metric | Good | Warning | Critical | PromQL Example |
|---|---|---|---|---|---|
| Device | CPU utilization | < 50% | 50-80% | > 80% | avg(cpmCPUTotal5minRev) |
| Device | Memory utilization | < 60% | 60-85% | > 85% | ciscoMemoryPoolUsed / (ciscoMemoryPoolUsed + ciscoMemoryPoolFree) * 100 |
| Interface | Utilization | < 50% | 50-80% | > 80% | rate(ifHCInOctets[5m]) * 8 / ifHighSpeed / 1000000 * 100 |
| Interface | Error rate | < 0.01% | 0.01-1% | > 1% | rate(ifInErrors[5m]) |
| Interface | Discard rate | 0 | < 0.1% | > 0.1% | rate(ifInDiscards[5m]) |
| Path | Latency (LAN) | < 1 ms | 1-10 ms | > 10 ms | probe_duration_seconds |
| Path | Latency (WAN) | < 20 ms | 20-100 ms | > 100 ms | probe_duration_seconds |
| Path | Latency (Internet) | < 50 ms | 50-200 ms | > 200 ms | probe_duration_seconds |
| Path | Packet loss | 0% | 0-1% | > 1% | probe_success |
| Path | Jitter | < 5 ms | 5-20 ms | > 20 ms | histogram_quantile(0.99, ...) |
Infrastructure Metrics
Device health metrics reveal underlying hardware or software problems before they cause outages:
- CPU utilization: sustained > 80% indicates overload — check for process spikes or insufficient capacity
- Memory utilization: sustained > 85% risks OOM conditions on control-plane processes
- Interface errors: FCS errors, CRC errors, runts, giants — typically indicate physical layer issues (bad cable, SFP, negotiated duplex mismatch)
- Interface discards: packets dropped due to buffer exhaustion — signals congestion or microbursts
- Temperature and power supply: environmental monitoring prevents hardware failure
Network Path Metrics
End-to-end path metrics measure the network as the user experiences it:
- Latency: round-trip time by path and time of day. WAN links show predictable latency proportional to distance (fiber ~1ms per 100km). Sudden increases indicate congestion or routing changes.
- Jitter: variation in latency over time. Real-time applications (VoIP, video) degrade above 20-30ms jitter. Caused by queuing delays, bufferbloat, or route flaps.
- Packet loss: even 0.5% loss degrades TCP throughput significantly (TCP halves its congestion window on loss). Loss > 1% is service-affecting for most applications.
- Throughput: measure utilization relative to link capacity. 80% is the typical warning threshold — beyond that, queuing delay and loss increase non-linearly.
Application Metrics
- Application response time: time from request to first byte. Correlate with network latency to determine whether the bottleneck is the network or the application server.
- Transaction success rate: percentage of completed transactions. Drops often correlate with network timeouts or firewall policy changes.
- Session establishment time: TCP handshake duration. Prolonged setup may indicate SYN queue exhaustion, asymmetric routing, or firewall inspection delay.
Security Metrics
- Denied connections: firewall deny counters spike during scanning or attack activity
- DNS query anomalies: unusual query volumes or NXDOMAIN rates may indicate malware C2 communication
- Authentication failures: repeated LDAP, RADIUS, or TACACS+ failures signal credential attacks
Cloud and Hybrid Monitoring
Cloud Monitoring Challenges
Cloud environments present unique monitoring challenges. Limited visibility into provider infrastructure, dynamic resource allocation, and multi-cloud complexity require adapted approaches.
Many organizations use cloud-native monitoring combined with traditional tools.
AWS Monitoring
AWS native monitoring centers on CloudWatch, with network-specific tools layered on top:
# Enable VPC Flow Logs for all ENIs in a VPC
aws ec2 create-flow-logs \
--resource-type VPC \
--resource-ids vpc-abc123 \
--traffic-type ALL \
--log-destination-type cloud-watch-logs \
--log-group-name vpc-flow-logs \
--deliver-logs-permission-arn arn:aws:iam::123456789:role/FlowLogsRole
# Query flow logs with Athena for top talkers
# SELECT srcaddr, dstaddr, SUM(bytes) AS total_bytes
# FROM vpc_flow_logs
# WHERE action = 'ACCEPT'
# GROUP BY srcaddr, dstaddr
# ORDER BY total_bytes DESC
# LIMIT 10;
VPC Flow Logs capture metadata (not payload) for all IP traffic. Export to AWS Athena for ad-hoc analysis or stream to third-party tools like Kentik, Datadog, or Prometheus via a log exporter.
Azure Monitoring
Azure Monitor + Network Watcher provides equivalent capabilities:
# Enable NSG flow logs
az network watcher flow-log create \
--resource-group prod-rg \
--name nsg-flow-log \
--nsg prod-nsg \
--storage-account saflowlogs \
--enabled true \
--retention 90
# Run a connectivity check from a VM to an endpoint
az network watcher test-connectivity \
--resource-group prod-rg \
--source-resource vm-web-01 \
--destination-address api.example.com \
--destination-port 443
Azure Monitor Metrics store platform-level network metrics (bytes in/out, packets, drops) for 93 days by default. Export to Log Analytics for custom queries and correlation with application logs.
Hybrid Considerations
Organizations with hybrid environments must monitor both on-premises and cloud infrastructure.
Key considerations include: consistent metrics across environments, unified alerting and dashboards, and correlation across cloud and on-premises components.
Troubleshooting with Monitoring Data
Data-Driven Troubleshooting
Monitoring data accelerates troubleshooting by providing objective information about network state.
Effective troubleshooting uses monitoring data to: confirm or rule out network involvement, identify the scope and location of issues, establish timeline and impact, and verify resolution.
Common Troubleshooting Scenarios
Monitoring helps address common scenarios.
Slow application performance: Use latency and throughput data to identify bottlenecks. Correlate with application metrics.
Intermittent connectivity: Review historical data for patterns. Check for events coinciding with issues.
Bandwidth exhaustion: Identify top users and applications. Plan capacity additions.
Documentation
Document troubleshooting processes and findings. This documentation builds institutional knowledge and improves future response.
Future Trends
eBPF-Based Observability
eBPF (extended Berkeley Packet Filter) is transforming network observability by running sandboxed programs in the Linux kernel without modifying kernel source or loading modules. Tools like Cilium and Hubble use eBPF to capture every packet, TCP connection, and HTTP request at line rate — with negligible overhead.
# Hubble CLI: observe all HTTP traffic in real time
hubble observe --protocol http --to-namespace prod
# Output:
# Jan 15 14:23:01.234 pod/frontend:5432 -> pod/api:8080
# HTTP/1.1 GET 200 12ms (x-forwarded-for: 10.0.1.5)
# Jan 15 14:23:01.567 pod/payment:5432 -> pod/db:3306
# MySQL Query SELECT 5ms
Unlike traditional monitoring that relies on sidecars or agents, eBPF provides kernel-native visibility — capturing flows, service maps, and latency metrics without application changes. This is the standard approach for Kubernetes network observability in 2026.
AI-Driven Operations (AIOps)
Machine learning is shifting NPM from reactive to predictive. Tools analyze historical telemetry to establish dynamic baselines and flag anomalies that deviate from learned patterns — catching slow-burn issues like incremental bufferbloat or creeping interface errors that static thresholds miss.
Root cause analysis (RCA) engines correlate alerts across network, server, and application layers, reducing mean time to resolution (MTTR) from hours to minutes.
Automated Remediation
Network automation closes the loop with monitoring. When the monitoring stack detects a problem, automation takes action:
- Traffic rerouting during link failure (via BGP or SDN controllers)
- VM migration when rack-level congestion is detected
- Auto-scaling of firewall or load balancer capacity
- Configuration rollback after a failed deployment triggers alert spike
Integration between Prometheus Alertmanager, Ansible, or custom webhook handlers makes this practical in production.
Integrated Observability
Network monitoring is converging with application performance monitoring (APM) and infrastructure monitoring into unified observability platforms. The goal: trace a user request from browser through CDN, load balancer, application server, and database, with network metrics available at every hop.
OpenTelemetry is driving this convergence by providing a standard way to emit traces, metrics, and logs from any component — including network devices via gNMI and flow telemetry.
Resources
- Prometheus SNMP Exporter — SNMP-to-Prometheus metrics bridge
- Prometheus Blackbox Exporter — ICMP/TCP/HTTP probing
- Cilium + Hubble — eBPF-based Kubernetes network observability
- LibreNMS — Auto-discovering open-source NMS
- Kentik NPM — Cloud-native flow-based network monitoring
- ThousandEyes — Internet and WAN synthetic monitoring
- RFC Editor — Network protocol standards
Comments