API Monitoring & Analytics: Complete Guide

Monitoring and analytics are essential for maintaining reliable, high-performance APIs. This guide covers metrics collection, logging, error tracking, and building observability into your APIs.

Why API Monitoring Matters

Monitoring tells you whether your API is working, for whom, and how well. Without it, you are flying blind—outages go undetected until users complain, performance regressions compound silently, and capacity planning becomes guesswork. Effective monitoring enables:

Proactive detection: Find issues before users do, through alerting on early-warning signals
Usage insights: Understand which endpoints, clients, and patterns drive traffic
Performance optimization: Identify slow queries, heavy payloads, and inefficient code paths
Capacity planning: Track growth trends and predict when you will need more resources
SLA verification: Measure uptime, latency, and error rates against contractual targets

Key Metrics

The Four Golden Signals

Google’s SRE team identified four metrics that capture most failure modes in distributed systems:

Latency measures how long requests take to complete. Distinguish between successful requests (which may mask problems) and failed requests (which fail fast, lowering average latency artificially). Always track tail latency (p95, p99) alongside averages—a 50ms average can hide that 5% of users experience 2-second responses.

Traffic measures demand on your system: requests per second, concurrent connections, or data throughput. Traffic patterns reveal usage cycles (daily peaks, seasonal spikes) and help size infrastructure. A sudden traffic drop may indicate a client-side issue or a routing problem.

Errors measure failure rates: explicit failures (HTTP 500s), implicit failures (200 OK with wrong data), and infrastructure errors (timeouts, connection resets). Track error rate as a percentage of total requests, broken down by endpoint, status code, and client.

Saturation measures how close your system is to its capacity limit. Key indicators include CPU utilization, memory pressure, connection pool usage, and queue depth. Saturation often precedes latency increases and errors—it is your earliest warning signal.

Essential API Metrics

Beyond the four golden signals, API-specific metrics provide operational insight:

Category	Metrics	Purpose
Request	Total requests, requests/sec, request size	Track volume and growth
Latency	p50, p95, p99 response time, TTFB, TLS handshake	Identify slowdowns
Errors	Error rate %, errors by status/endpoint, error reason	Detect failures
Resources	CPU, memory, connection pool, DB query time	Plan capacity
Business	Active users, requests per client, top endpoints	Understand usage

Request Logging

Structured Logging

// Good: Structured JSON logging
const logger = {
  info: (message, meta = {}) => {
    console.log(JSON.stringify({
      timestamp: new Date().toISOString(),
      level: 'INFO',
      message,
      ...meta
    }));
  },
  
  error: (message, meta = {}) => {
    console.error(JSON.stringify({
      timestamp: new Date().toISOString(),
      level: 'ERROR',
      message,
      ...meta
    }));
  }
};

// Usage
logger.info('API request', {
  method: 'GET',
  path: '/api/users/123',
  statusCode: 200,
  latencyMs: 45,
  userId: 'user_456',
  requestId: 'req_abc123'
});

logger.error('Request failed', {
  method: 'POST',
  path: '/api/orders',
  statusCode: 500,
  error: error.message,
  stack: error.stack,
  userId: 'user_456'
});

Request/Response Logging Middleware

const requestLogger = (req, res, next) => {
  const startTime = Date.now();
  const requestId = req.headers['x-request-id'] || uuid();
  
  req.requestId = requestId;
  
  // Log request
  logger.info('Incoming request', {
    requestId,
    method: req.method,
    path: req.path,
    query: req.query,
    ip: req.ip,
    userAgent: req.headers['user-agent']
  });
  
  // Capture response
  const originalSend = res.send;
  res.send = function(data) {
    const latency = Date.now() - startTime;
    
    logger.info('Request completed', {
      requestId,
      method: req.method,
      path: req.path,
      statusCode: res.statusCode,
      latencyMs: latency,
      responseSize: res.get('Content-Length')
    });
    
    originalSend.call(this, data);
  };
  
  next();
};

For a broader framework on designing observable systems, see the API Design Best Practices Guide.

Performance Metrics Collection

Monitoring Pipeline

flowchart LR
    API[Your API] -->|exposes /metrics| P[Prometheus]
    P -->|scrapes| G[Grafana Dashboard]
    P -->|evaluates rules| A[Alertmanager]
    A -->|notifies| PD[PagerDuty / Slack]
    API -->|logs| L[Log Aggregator]
    L -->|feeds| G

Custom Metrics with Prometheus

const promClient = require('prom-client');

const register = new promClient.Registry();

promClient.collectDefaultMetrics({ register });

// Custom metrics
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});

const httpRequestTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const apiResponseSize = new promClient.Gauge({
  name: 'api_response_size_bytes',
  help: 'Response size in bytes',
  labelNames: ['method', 'route']
});

register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
register.registerMetric(apiResponseSize);

// Middleware
app.use((req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const route = req.route ? req.route.path : req.path;
    
    httpRequestDuration.labels(req.method, route, res.statusCode).observe(duration);
    httpRequestTotal.labels(req.method, route, res.statusCode).inc();
  });
  
  next();
});

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Metrics Dashboard

# Grafana dashboard example
dashboard:
  title: "API Performance"
  panels:
    - title: "Requests per Second"
      type: "graph"
      targets:
        - expr: "rate(http_requests_total[5m])"
    
    - title: "Latency (p95)"
      type: "graph"
      targets:
        - expr: "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
    
    - title: "Error Rate"
      type: "graph"
      targets:
        - expr: "sum(rate(http_requests_total{status_code=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))"

Error Tracking

Errors fall into two categories: expected (validation failures, 404s, rate limits) and unexpected (exceptions, crashes, infrastructure failures). Monitoring should focus on unexpected errors while tracking expected errors for usage patterns.

A robust error tracking pipeline captures:

Exception details: stack trace, error type, message
Request context: endpoint, parameters, headers, user ID
Environment: deployment version, host, region
Breadcrumbs: preceding log events leading to the failure

Sentry Integration

const Sentry = require('@sentry/node');

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  release: process.env.npm_package_version
});

// Request handler
app.use(Sentry.Handlers.requestHandler());

// Error handler
app.use(Sentry.Handlers.errorHandler());

// Manual error capture
app.get('/api/users/:id', async (req, res) => {
  try {
    const user = await getUser(req.params.id);
    if (!user) {
      // Capture with context
      Sentry.captureMessage('User not found', {
        level: 'warning',
        tags: { endpoint: 'getUser', userId: req.params.id }
      });
    }
    res.json(user);
  } catch (error) {
    // Capture with extra context
    Sentry.captureException(error, {
      extra: {
        userId: req.params.id,
        requestId: req.requestId
      }
    });
    throw error;
  }
});

Usage Analytics

API Usage by Client

const usageAnalytics = {
  // Track usage per API key
  async track(key, endpoint, method, statusCode, latency) {
    await redis.hincrby(`usage:${key}:daily`, endpoint, 1);
    await redis.hincrby(`usage:${key}:monthly`, endpoint, 1);
    
    // Track latency percentiles
    await redis.lpush(`latency:${key}:${endpoint}`, latency);
    await redis.ltrim(`latency:${key}:${endpoint}`, 0, 999);
  },
  
  // Get usage report
  async getReport(key, period = 'daily') {
    const usage = await redis.hgetall(`usage:${key}:${period}`);
    const total = Object.values(usage).reduce((a, b) => a + b, 0);
    
    return {
      total,
      byEndpoint: usage,
      period
    };
  }
};

// Middleware
app.use(async (req, res, next) => {
  const start = Date.now();
  
  res.on('finish', async () => {
    const key = req.headers['x-api-key'];
    if (key) {
      await usageAnalytics.track(
        key,
        req.path,
        req.method,
        res.statusCode,
        Date.now() - start
      );
    }
  });
  
  next();
});

Usage Dashboard Data

{
  "period": "2024-01",
  "totalRequests": 1000000,
  "uniqueClients": 500,
  "topEndpoints": [
    { "path": "/api/users", "requests": 250000 },
    { "path": "/api/products", "requests": 180000 },
    { "path": "/api/orders", "requests": 120000 }
  ],
  "errorRate": 0.5,
  "avgLatencyMs": 45,
  "p95LatencyMs": 120,
  "dataTransferMb": 5000
}

Health Checks

Basic Health Endpoint

app.get('/health', (req, res) => {
  res.json({
    status: 'healthy',
    timestamp: new Date().toISOString(),
    uptime: process.uptime(),
    version: process.env.npm_package_version
  });
});

Detailed Health Check

app.get('/health', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    cache: await checkCache(),
    external: await checkExternalServices()
  };
  
  const allHealthy = Object.values(checks).every(c => c.healthy);
  
  res.status(allHealthy ? 200 : 503).json({
    status: allHealthy ? 'healthy' : 'unhealthy',
    timestamp: new Date().toISOString(),
    checks
  });
});

async function checkDatabase() {
  try {
    await db.query('SELECT 1');
    return { healthy: true, latency: 5 };
  } catch (error) {
    return { healthy: false, error: error.message };
  }
}

Alerting

Alert Rules

# Prometheus alerting rules
groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }}% over last 5 minutes"
      
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "p95 latency is {{ $value }}s"
      
      - alert: RateLimitNear
        expr: rate_limit_remaining / rate_limit_total < 0.1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Approaching rate limit"

Conclusion

API monitoring is not a one-time setup—it is an ongoing practice. Start with the four golden signals (latency, traffic, errors, saturation) and layer on API-specific metrics as you identify what matters for your domain. Instrument your code with structured logging and custom metrics from day one; retrofitting observability is much harder than building it in.

Aim for actionable alerts over noisy ones: every alert should require a human response or justify automation. Track your dashboard’s usefulness by how often teams refer to it during incidents.

For more on logging architecture and aggregation, see the API Error Handling Guide. For caching strategies that reduce backend load and improve latency metrics, see the API Caching Strategies Guide. For the full picture on designing maintainable APIs, see the REST API Design Best Practices Guide.

Resources

Prometheus Documentation - Metrics collection and alerting
Grafana Documentation - Dashboard and visualization
OpenTelemetry - Distributed tracing and observability
The Four Golden Signals (Google SRE) - Foundational monitoring principles

API Monitoring & Analytics: Complete Guide

Why API Monitoring Matters

Key Metrics

The Four Golden Signals

Essential API Metrics

Request Logging

Structured Logging

Request/Response Logging Middleware

Performance Metrics Collection

Monitoring Pipeline

Custom Metrics with Prometheus

Metrics Dashboard

Error Tracking

Sentry Integration

Usage Analytics

API Usage by Client

Usage Dashboard Data

Health Checks

Basic Health Endpoint

Detailed Health Check

Alerting

Alert Rules

Conclusion

Resources

Comments

Share this article

👍 Was this article helpful?