Skip to main content
⚡ Calmops

Open Source Monitoring Stack for Small Teams in 2026

Introduction

System monitoring is critical for maintaining application reliability and quickly diagnosing issues. For small teams, commercial monitoring solutions like Datadog, New Relic, or DataDog can cost thousands of dollars per month as your infrastructure scales. Open source alternatives have matured significantly, now offering capabilities that rival commercial products at a fraction of the cost.

In this guide, we’ll explore how to build a comprehensive monitoring stack using open source tools. The combination of Prometheus for metrics collection and Grafana for visualization has become the de facto standard for cloud-native monitoring. We’ll examine how to implement these tools effectively, extend them with additional capabilities, and operate them efficiently with limited resources.

Whether you’re monitoring a handful of servers or managing a Kubernetes cluster, this guide will help you establish robust observability without breaking your budget. The techniques covered here work for infrastructure of various sizes, though we’ll focus on implementations practical for small teams with limited operational capacity.

Understanding Observability Fundamentals

Modern monitoring goes beyond simple uptime checks and error logging. Observability encompasses three key pillars: metrics, logs, and traces. Understanding these concepts helps you choose the right tools and design effective monitoring strategies.

Metrics are numerical measurements collected over time, such as CPU usage, request latency, or error rates. They provide quantitative data about system behavior and are ideal for establishing baselines, detecting anomalies, and triggering alerts. Metrics are also the most efficient pillar for long-term storage and querying, making them the foundation of most monitoring systems.

Logs are timestamped records of events that occurred in your system. They provide detailed context about what happened, when, and often why. While essential for debugging specific issues, logs generate significantly more data than metrics and require more storage and processing resources. Effective log management requires thoughtful filtering and aggregation.

Distributed tracing follows requests as they flow through multiple services, enabling you to understand performance bottlenecks in complex architectures. Traces are particularly valuable for microservices applications where a single user request might traverse dozens of services. While more complex to implement, distributed tracing dramatically reduces debugging time for performance issues.

Prometheus: The Metrics Backbone

Prometheus has become the standard for metrics collection in cloud-native environments. Originally developed at SoundCloud and now a graduated project in the Cloud Native Computing Foundation (CNCF), Prometheus offers powerful features for gathering, storing, and querying time-series metrics.

How Prometheus Works

Prometheus uses a pull-based model, where the server scrapes metrics from configured targets at regular intervals. This design offers several advantages over push-based systems. Targets don’t need to know about Prometheus, making discovery simpler. The centralized scraping point provides a consistent view of metrics timing. Load remains predictable since Prometheus controls exactly when metrics are collected.

The architecture consists of several components working together. The Prometheus server handles data collection, storage, and querying. Exporters run on monitored systems, exposing metrics in Prometheus format. Alertmanager processes alerts generated from Prometheus rules, handling notification routing and grouping. The pushgateway supports short-lived jobs that can’t be scraped directly.

Setting Up Prometheus

Getting started with Prometheus is straightforward. For small deployments, running Prometheus in Docker provides a quick path to production:

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

The Prometheus configuration file defines what to scrape and how often:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

For Kubernetes deployments, the Prometheus Operator automates much of the configuration management, using Custom Resource Definitions (CRDs) to define monitoring targets declaratively.

Querying with PromQL

Prometheus’s query language, PromQL, enables sophisticated analysis of collected metrics. Understanding key PromQL patterns helps you extract meaningful insights from your metrics data.

Basic queries select metrics by name and optionally filter by labels:

# All CPU metrics across all jobs
process_cpu_seconds_total

# Specific job with label matching
container_memory_usage_bytes{job="kubernetes-pods"}

Aggregation operators combine data points over time:

# Average CPU usage over 5 minutes
avg_over_time(process_cpu_seconds_total[5m])

# Rate of requests per second
rate(http_requests_total[5m])

These queries can be combined into recording rules that pre-compute frequently used expressions, improving query performance for complex dashboards.

Best Practices for Prometheus

Design your metric labeling strategy carefully, as labels determine how you can slice and dice your data. Avoid high-cardinality labels like user IDs or timestamps, which dramatically increase storage requirements. Instead, use labels that represent meaningful dimensions like service name, environment, or region.

Use recording rules to define frequently queried metric combinations. Pre-computing complex calculations improves dashboard responsiveness and reduces load on the Prometheus server. Group related recording rules into files organized by service or team.

Configure appropriate retention periods based on your needs. Prometheus stores data in two-hour chunks, and default retention is 15 days. For cost-effective long-term storage, consider using Thanos or Cortex for centralized storage that spans multiple Prometheus instances.

Grafana: Visualization and Alerting

Grafana complements Prometheus by providing powerful visualization and alerting capabilities. The tool connects to multiple data sources, making it a centralized dashboard platform regardless of where your metrics originate.

Dashboard Design Principles

Effective dashboards communicate system health at a glance while enabling drill-down investigation. Design dashboards in layers, starting with high-level overview panels that immediately indicate overall system status. Use color coding consistently—green for healthy, yellow for degraded, red for critical—so users can quickly assess situations.

Organize panels logically, typically with the most critical metrics at the top. Group related metrics together and use consistent time ranges across panels so comparisons are meaningful. Avoid cramming too much information into a single view; instead, use links between dashboards to enable navigation from overview to detail.

Include both current state and historical trends. Current values without context are difficult to interpret—are 500 requests per minute good or bad? Include historical baselines or thresholds that provide context for current measurements.

Setting Up Grafana

Grafana installation follows patterns similar to Prometheus, with Docker providing an easy starting point:

version: '3.8'
services:
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=your-secure-password

After startup, configure Prometheus as a data source through the Grafana UI. Navigate to Configuration → Data Sources → Add data source, select Prometheus, and enter your Prometheus server URL.

Building Effective Dashboards

Grafana’s panel editor offers numerous visualization options. For metrics over time, graph panels provide historical context. For current values, stat panels display single numbers with optional sparklines. Gauge panels work well for percentage-based metrics with defined thresholds.

Use variables to create dynamic dashboards that can filter across multiple dimensions. A variable for $service that queries available service labels allows users to focus on specific components without creating separate dashboards.

SELECT label_values(process_cpu_seconds_total, job) AS "Service" FROM prometheus

Template variables also enable dashboard reuse across environments by switching between development, staging, and production contexts.

Alert Configuration

Grafana’s alerting integrates with Prometheus Alertmanager, allowing you to define alerts that trigger notifications through various channels. Configure alert rules that evaluate PromQL expressions:

- name: high_cpu_usage
  condition: |
    avg_over_time(container_cpu_usage_seconds_total[5m]) > 0.8
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High CPU usage detected"
    description: "{{ $labels.pod }} CPU usage above 80% for 5 minutes"

Configure notification channels for email, Slack, PagerDuty, or webhooks. For critical alerts, implement proper escalation paths that ensure someone responds even if the primary on-call person is unavailable.

Node Exporter: Infrastructure Monitoring

While Prometheus collects metrics, you need exporters to expose metrics from your systems. Node Exporter is the most common exporter, providing system-level metrics from Linux and Unix hosts.

Key Metrics Available

Node Exporter exposes hundreds of metrics covering CPU, memory, disk, network, and system-level information. Understanding the most important metrics helps you build effective alerts and dashboards.

CPU metrics include seconds spent in different states (user, system, idle, iowait, steal). The rate of change for these counters reveals system load patterns:

# CPU usage as percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory metrics track total, available, used, and cached memory. Linux memory accounting can be complex, but key queries simplify interpretation:

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

Disk metrics cover space, inode usage, and I/O operations. Network metrics include bytes and packets sent and received, errors, and drops. Filesystem metrics expose mount point-specific storage information.

Deploying Node Exporter

Node Exporter runs on each monitored host and exposes metrics on port 9100. For small deployments, running Node Exporter via systemd or Docker works well:

docker run -d \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  quay.io/prometheus/node-exporter:latest \
  --path.rootfs=/host

The volume mount allows Node Exporter to access host filesystem information, enabling disk and filesystem metrics. For Kubernetes deployments, DaemonSets automatically deploy Node Exporter to all nodes.

Extending the Stack

The Prometheus-Grafana combination forms the foundation of comprehensive monitoring. Additional exporters and tools extend capabilities for specific needs.

Blackbox Exporter for Endpoint Monitoring

The Blackbox Exporter enables probing of endpoints using HTTP, HTTPS, DNS, TCP, and ICMP. Use it to monitor external service availability and response times:

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      preferred_ip_protocol: "ip4"

Create Prometheus targets that scrape the Blackbox Exporter, then visualize probe success rates and response times in Grafana.

Application Instrumentation

For custom applications, adding Prometheus metrics provides visibility into application-specific behavior. Client libraries exist for most major languages, making instrumentation straightforward:

from prometheus_client import Counter, Histogram, generate_latest
import time

REQUEST_COUNT = Counter('app_requests_total', 'Total requests', ['method', 'endpoint'])
REQUEST_LATENCY = Histogram('app_request_latency_seconds', 'Request latency', ['method'])

@app.route('/api/users')
def get_users():
    with REQUEST_LATENCY.labels(method='GET', endpoint='/api/users').time():
        # Your logic here
        pass
    REQUEST_COUNT.labels(method='GET', endpoint='/api/users').inc()
    return users

Instrument critical business operations alongside technical metrics. Track orders processed, user signups, or payment transactions to correlate business health with system performance.

Container Monitoring

For Docker containers, cAdvisor provides container-specific metrics including CPU, memory, disk, and network usage. In Kubernetes, cAdvisor is integrated into the Kubelet, making container metrics available automatically.

Query container metrics to understand resource consumption by service:

container_memory_usage_bytes{container!=""}

Alerting Strategies

Effective alerting balances sensitivity with noise reduction. Too few alerts miss real issues; too many alerts cause alert fatigue and desensitization.

Defining Alert Severity

Classify alerts by impact and urgency. Critical alerts indicate immediate service degradation requiring rapid response—these should page on-call staff. Warning alerts indicate potential issues that should be addressed but don’t require immediate action. Informational alerts provide context without requiring response.

groups:
  - name: infrastructure
    rules:
      # Critical: Service down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"
          
      # Warning: High memory usage
      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage above 85%"

Reducing Alert Fatigue

Configure appropriate for durations to require sustained conditions before alerting. A brief CPU spike that resolves in 30 seconds rarely indicates a real problem, but 10 minutes of elevated CPU suggests genuine resource constraints.

Use alert grouping in Alertmanager to prevent notification storms during major incidents. When multiple alerts fire from the same root cause, grouped notifications present a coherent picture rather than dozens of separate alerts.

Implement runbook URLs in alert annotations that direct responders to documented remediation procedures. This accelerates incident response and ensures consistent handling regardless of who’s on-call.

Cost Considerations

Building an open source monitoring stack has no software licensing costs, but infrastructure expenses still apply. Understanding these costs helps you plan appropriately.

Storage Requirements

Prometheus storage depends on your metric cardinality, scrape frequency, and retention period. A typical small deployment with 1,000 time series scraped every 15 seconds generates roughly 1.5GB of data daily. With 15-day retention, plan for approximately 25GB of storage.

Long-term storage solutions like Thanos or Cortex enable indefinite metric retention by storing historical data in object storage. These add complexity but eliminate the need to choose between retention and cost.

Resource Planning

Prometheus memory requirements scale with the number of time series and query complexity. For deployments under 100,000 time series, 2GB RAM is typically sufficient. Larger deployments may require 8GB or more.

Grafana is relatively lightweight, running comfortably on 512MB RAM for small deployments. The main resource consideration is dashboard complexity—dashboards with thousands of panels may require more memory for rendering.

Conclusion

Building an enterprise-grade monitoring infrastructure with open source tools is not only possible but often preferable for small teams. The Prometheus-Grafana combination provides capabilities that rival commercial solutions at a fraction of the cost, with the flexibility to adapt to your specific needs.

Start simple: deploy Prometheus and Grafana, add Node Exporter to your servers, and build basic dashboards showing system health. Gradually add application instrumentation, more sophisticated alerts, and specialized exporters as your needs evolve.

The initial investment in setting up proper monitoring pays dividends in faster incident response, better system understanding, and the confidence that comes from knowing your systems’ behavior. With the foundation in place, you can focus on building your product rather than worrying about what’s happening in your infrastructure.

Resources

Comments