Skip to main content

API Monitoring & Analytics: Complete Guide

Created: February 26, 2026 Larry Qu 7 min read

Monitoring and analytics are essential for maintaining reliable, high-performance APIs. This guide covers metrics collection, logging, error tracking, and building observability into your APIs.

Why API Monitoring Matters

Monitoring tells you whether your API is working, for whom, and how well. Without it, you are flying blind—outages go undetected until users complain, performance regressions compound silently, and capacity planning becomes guesswork. Effective monitoring enables:

  • Proactive detection: Find issues before users do, through alerting on early-warning signals
  • Usage insights: Understand which endpoints, clients, and patterns drive traffic
  • Performance optimization: Identify slow queries, heavy payloads, and inefficient code paths
  • Capacity planning: Track growth trends and predict when you will need more resources
  • SLA verification: Measure uptime, latency, and error rates against contractual targets

Key Metrics

The Four Golden Signals

Google’s SRE team identified four metrics that capture most failure modes in distributed systems:

Latency measures how long requests take to complete. Distinguish between successful requests (which may mask problems) and failed requests (which fail fast, lowering average latency artificially). Always track tail latency (p95, p99) alongside averages—a 50ms average can hide that 5% of users experience 2-second responses.

Traffic measures demand on your system: requests per second, concurrent connections, or data throughput. Traffic patterns reveal usage cycles (daily peaks, seasonal spikes) and help size infrastructure. A sudden traffic drop may indicate a client-side issue or a routing problem.

Errors measure failure rates: explicit failures (HTTP 500s), implicit failures (200 OK with wrong data), and infrastructure errors (timeouts, connection resets). Track error rate as a percentage of total requests, broken down by endpoint, status code, and client.

Saturation measures how close your system is to its capacity limit. Key indicators include CPU utilization, memory pressure, connection pool usage, and queue depth. Saturation often precedes latency increases and errors—it is your earliest warning signal.

Essential API Metrics

Beyond the four golden signals, API-specific metrics provide operational insight:

Category Metrics Purpose
Request Total requests, requests/sec, request size Track volume and growth
Latency p50, p95, p99 response time, TTFB, TLS handshake Identify slowdowns
Errors Error rate %, errors by status/endpoint, error reason Detect failures
Resources CPU, memory, connection pool, DB query time Plan capacity
Business Active users, requests per client, top endpoints Understand usage

Request Logging

Structured Logging

// Good: Structured JSON logging
const logger = {
  info: (message, meta = {}) => {
    console.log(JSON.stringify({
      timestamp: new Date().toISOString(),
      level: 'INFO',
      message,
      ...meta
    }));
  },
  
  error: (message, meta = {}) => {
    console.error(JSON.stringify({
      timestamp: new Date().toISOString(),
      level: 'ERROR',
      message,
      ...meta
    }));
  }
};

// Usage
logger.info('API request', {
  method: 'GET',
  path: '/api/users/123',
  statusCode: 200,
  latencyMs: 45,
  userId: 'user_456',
  requestId: 'req_abc123'
});

logger.error('Request failed', {
  method: 'POST',
  path: '/api/orders',
  statusCode: 500,
  error: error.message,
  stack: error.stack,
  userId: 'user_456'
});

Request/Response Logging Middleware

const requestLogger = (req, res, next) => {
  const startTime = Date.now();
  const requestId = req.headers['x-request-id'] || uuid();
  
  req.requestId = requestId;
  
  // Log request
  logger.info('Incoming request', {
    requestId,
    method: req.method,
    path: req.path,
    query: req.query,
    ip: req.ip,
    userAgent: req.headers['user-agent']
  });
  
  // Capture response
  const originalSend = res.send;
  res.send = function(data) {
    const latency = Date.now() - startTime;
    
    logger.info('Request completed', {
      requestId,
      method: req.method,
      path: req.path,
      statusCode: res.statusCode,
      latencyMs: latency,
      responseSize: res.get('Content-Length')
    });
    
    originalSend.call(this, data);
  };
  
  next();
};

For a broader framework on designing observable systems, see the API Design Best Practices Guide.

Performance Metrics Collection

Monitoring Pipeline

flowchart LR
    API[Your API] -->|exposes /metrics| P[Prometheus]
    P -->|scrapes| G[Grafana Dashboard]
    P -->|evaluates rules| A[Alertmanager]
    A -->|notifies| PD[PagerDuty / Slack]
    API -->|logs| L[Log Aggregator]
    L -->|feeds| G

Custom Metrics with Prometheus

const promClient = require('prom-client');

const register = new promClient.Registry();

promClient.collectDefaultMetrics({ register });

// Custom metrics
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});

const httpRequestTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const apiResponseSize = new promClient.Gauge({
  name: 'api_response_size_bytes',
  help: 'Response size in bytes',
  labelNames: ['method', 'route']
});

register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
register.registerMetric(apiResponseSize);

// Middleware
app.use((req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const route = req.route ? req.route.path : req.path;
    
    httpRequestDuration.labels(req.method, route, res.statusCode).observe(duration);
    httpRequestTotal.labels(req.method, route, res.statusCode).inc();
  });
  
  next();
});

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Metrics Dashboard

# Grafana dashboard example
dashboard:
  title: "API Performance"
  panels:
    - title: "Requests per Second"
      type: "graph"
      targets:
        - expr: "rate(http_requests_total[5m])"
    
    - title: "Latency (p95)"
      type: "graph"
      targets:
        - expr: "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
    
    - title: "Error Rate"
      type: "graph"
      targets:
        - expr: "sum(rate(http_requests_total{status_code=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))"

Error Tracking

Errors fall into two categories: expected (validation failures, 404s, rate limits) and unexpected (exceptions, crashes, infrastructure failures). Monitoring should focus on unexpected errors while tracking expected errors for usage patterns.

A robust error tracking pipeline captures:

  • Exception details: stack trace, error type, message
  • Request context: endpoint, parameters, headers, user ID
  • Environment: deployment version, host, region
  • Breadcrumbs: preceding log events leading to the failure

Sentry Integration

const Sentry = require('@sentry/node');

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  release: process.env.npm_package_version
});

// Request handler
app.use(Sentry.Handlers.requestHandler());

// Error handler
app.use(Sentry.Handlers.errorHandler());

// Manual error capture
app.get('/api/users/:id', async (req, res) => {
  try {
    const user = await getUser(req.params.id);
    if (!user) {
      // Capture with context
      Sentry.captureMessage('User not found', {
        level: 'warning',
        tags: { endpoint: 'getUser', userId: req.params.id }
      });
    }
    res.json(user);
  } catch (error) {
    // Capture with extra context
    Sentry.captureException(error, {
      extra: {
        userId: req.params.id,
        requestId: req.requestId
      }
    });
    throw error;
  }
});

Usage Analytics

API Usage by Client

const usageAnalytics = {
  // Track usage per API key
  async track(key, endpoint, method, statusCode, latency) {
    await redis.hincrby(`usage:${key}:daily`, endpoint, 1);
    await redis.hincrby(`usage:${key}:monthly`, endpoint, 1);
    
    // Track latency percentiles
    await redis.lpush(`latency:${key}:${endpoint}`, latency);
    await redis.ltrim(`latency:${key}:${endpoint}`, 0, 999);
  },
  
  // Get usage report
  async getReport(key, period = 'daily') {
    const usage = await redis.hgetall(`usage:${key}:${period}`);
    const total = Object.values(usage).reduce((a, b) => a + b, 0);
    
    return {
      total,
      byEndpoint: usage,
      period
    };
  }
};

// Middleware
app.use(async (req, res, next) => {
  const start = Date.now();
  
  res.on('finish', async () => {
    const key = req.headers['x-api-key'];
    if (key) {
      await usageAnalytics.track(
        key,
        req.path,
        req.method,
        res.statusCode,
        Date.now() - start
      );
    }
  });
  
  next();
});

Usage Dashboard Data

{
  "period": "2024-01",
  "totalRequests": 1000000,
  "uniqueClients": 500,
  "topEndpoints": [
    { "path": "/api/users", "requests": 250000 },
    { "path": "/api/products", "requests": 180000 },
    { "path": "/api/orders", "requests": 120000 }
  ],
  "errorRate": 0.5,
  "avgLatencyMs": 45,
  "p95LatencyMs": 120,
  "dataTransferMb": 5000
}

Health Checks

Basic Health Endpoint

app.get('/health', (req, res) => {
  res.json({
    status: 'healthy',
    timestamp: new Date().toISOString(),
    uptime: process.uptime(),
    version: process.env.npm_package_version
  });
});

Detailed Health Check

app.get('/health', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    cache: await checkCache(),
    external: await checkExternalServices()
  };
  
  const allHealthy = Object.values(checks).every(c => c.healthy);
  
  res.status(allHealthy ? 200 : 503).json({
    status: allHealthy ? 'healthy' : 'unhealthy',
    timestamp: new Date().toISOString(),
    checks
  });
});

async function checkDatabase() {
  try {
    await db.query('SELECT 1');
    return { healthy: true, latency: 5 };
  } catch (error) {
    return { healthy: false, error: error.message };
  }
}

Alerting

Alert Rules

# Prometheus alerting rules
groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }}% over last 5 minutes"
      
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "p95 latency is {{ $value }}s"
      
      - alert: RateLimitNear
        expr: rate_limit_remaining / rate_limit_total < 0.1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Approaching rate limit"

Conclusion

API monitoring is not a one-time setup—it is an ongoing practice. Start with the four golden signals (latency, traffic, errors, saturation) and layer on API-specific metrics as you identify what matters for your domain. Instrument your code with structured logging and custom metrics from day one; retrofitting observability is much harder than building it in.

Aim for actionable alerts over noisy ones: every alert should require a human response or justify automation. Track your dashboard’s usefulness by how often teams refer to it during incidents.

For more on logging architecture and aggregation, see the API Error Handling Guide. For caching strategies that reduce backend load and improve latency metrics, see the API Caching Strategies Guide. For the full picture on designing maintainable APIs, see the REST API Design Best Practices Guide.

Resources

Comments

Share this article

Scan to read on mobile

👍 Was this article helpful?