Monitoring and analytics are essential for maintaining reliable, high-performance APIs. This guide covers metrics collection, logging, error tracking, and building observability into your APIs.
Why API Monitoring Matters
Monitoring tells you whether your API is working, for whom, and how well. Without it, you are flying blind—outages go undetected until users complain, performance regressions compound silently, and capacity planning becomes guesswork. Effective monitoring enables:
- Proactive detection: Find issues before users do, through alerting on early-warning signals
- Usage insights: Understand which endpoints, clients, and patterns drive traffic
- Performance optimization: Identify slow queries, heavy payloads, and inefficient code paths
- Capacity planning: Track growth trends and predict when you will need more resources
- SLA verification: Measure uptime, latency, and error rates against contractual targets
Key Metrics
The Four Golden Signals
Google’s SRE team identified four metrics that capture most failure modes in distributed systems:
Latency measures how long requests take to complete. Distinguish between successful requests (which may mask problems) and failed requests (which fail fast, lowering average latency artificially). Always track tail latency (p95, p99) alongside averages—a 50ms average can hide that 5% of users experience 2-second responses.
Traffic measures demand on your system: requests per second, concurrent connections, or data throughput. Traffic patterns reveal usage cycles (daily peaks, seasonal spikes) and help size infrastructure. A sudden traffic drop may indicate a client-side issue or a routing problem.
Errors measure failure rates: explicit failures (HTTP 500s), implicit failures (200 OK with wrong data), and infrastructure errors (timeouts, connection resets). Track error rate as a percentage of total requests, broken down by endpoint, status code, and client.
Saturation measures how close your system is to its capacity limit. Key indicators include CPU utilization, memory pressure, connection pool usage, and queue depth. Saturation often precedes latency increases and errors—it is your earliest warning signal.
Essential API Metrics
Beyond the four golden signals, API-specific metrics provide operational insight:
| Category | Metrics | Purpose |
|---|---|---|
| Request | Total requests, requests/sec, request size | Track volume and growth |
| Latency | p50, p95, p99 response time, TTFB, TLS handshake | Identify slowdowns |
| Errors | Error rate %, errors by status/endpoint, error reason | Detect failures |
| Resources | CPU, memory, connection pool, DB query time | Plan capacity |
| Business | Active users, requests per client, top endpoints | Understand usage |
Request Logging
Structured Logging
// Good: Structured JSON logging
const logger = {
info: (message, meta = {}) => {
console.log(JSON.stringify({
timestamp: new Date().toISOString(),
level: 'INFO',
message,
...meta
}));
},
error: (message, meta = {}) => {
console.error(JSON.stringify({
timestamp: new Date().toISOString(),
level: 'ERROR',
message,
...meta
}));
}
};
// Usage
logger.info('API request', {
method: 'GET',
path: '/api/users/123',
statusCode: 200,
latencyMs: 45,
userId: 'user_456',
requestId: 'req_abc123'
});
logger.error('Request failed', {
method: 'POST',
path: '/api/orders',
statusCode: 500,
error: error.message,
stack: error.stack,
userId: 'user_456'
});
Request/Response Logging Middleware
const requestLogger = (req, res, next) => {
const startTime = Date.now();
const requestId = req.headers['x-request-id'] || uuid();
req.requestId = requestId;
// Log request
logger.info('Incoming request', {
requestId,
method: req.method,
path: req.path,
query: req.query,
ip: req.ip,
userAgent: req.headers['user-agent']
});
// Capture response
const originalSend = res.send;
res.send = function(data) {
const latency = Date.now() - startTime;
logger.info('Request completed', {
requestId,
method: req.method,
path: req.path,
statusCode: res.statusCode,
latencyMs: latency,
responseSize: res.get('Content-Length')
});
originalSend.call(this, data);
};
next();
};
For a broader framework on designing observable systems, see the API Design Best Practices Guide.
Performance Metrics Collection
Monitoring Pipeline
flowchart LR
API[Your API] -->|exposes /metrics| P[Prometheus]
P -->|scrapes| G[Grafana Dashboard]
P -->|evaluates rules| A[Alertmanager]
A -->|notifies| PD[PagerDuty / Slack]
API -->|logs| L[Log Aggregator]
L -->|feeds| G
Custom Metrics with Prometheus
const promClient = require('prom-client');
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register });
// Custom metrics
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});
const httpRequestTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
const apiResponseSize = new promClient.Gauge({
name: 'api_response_size_bytes',
help: 'Response size in bytes',
labelNames: ['method', 'route']
});
register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
register.registerMetric(apiResponseSize);
// Middleware
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route ? req.route.path : req.path;
httpRequestDuration.labels(req.method, route, res.statusCode).observe(duration);
httpRequestTotal.labels(req.method, route, res.statusCode).inc();
});
next();
});
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
Metrics Dashboard
# Grafana dashboard example
dashboard:
title: "API Performance"
panels:
- title: "Requests per Second"
type: "graph"
targets:
- expr: "rate(http_requests_total[5m])"
- title: "Latency (p95)"
type: "graph"
targets:
- expr: "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
- title: "Error Rate"
type: "graph"
targets:
- expr: "sum(rate(http_requests_total{status_code=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))"
Error Tracking
Errors fall into two categories: expected (validation failures, 404s, rate limits) and unexpected (exceptions, crashes, infrastructure failures). Monitoring should focus on unexpected errors while tracking expected errors for usage patterns.
A robust error tracking pipeline captures:
- Exception details: stack trace, error type, message
- Request context: endpoint, parameters, headers, user ID
- Environment: deployment version, host, region
- Breadcrumbs: preceding log events leading to the failure
Sentry Integration
const Sentry = require('@sentry/node');
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
release: process.env.npm_package_version
});
// Request handler
app.use(Sentry.Handlers.requestHandler());
// Error handler
app.use(Sentry.Handlers.errorHandler());
// Manual error capture
app.get('/api/users/:id', async (req, res) => {
try {
const user = await getUser(req.params.id);
if (!user) {
// Capture with context
Sentry.captureMessage('User not found', {
level: 'warning',
tags: { endpoint: 'getUser', userId: req.params.id }
});
}
res.json(user);
} catch (error) {
// Capture with extra context
Sentry.captureException(error, {
extra: {
userId: req.params.id,
requestId: req.requestId
}
});
throw error;
}
});
Usage Analytics
API Usage by Client
const usageAnalytics = {
// Track usage per API key
async track(key, endpoint, method, statusCode, latency) {
await redis.hincrby(`usage:${key}:daily`, endpoint, 1);
await redis.hincrby(`usage:${key}:monthly`, endpoint, 1);
// Track latency percentiles
await redis.lpush(`latency:${key}:${endpoint}`, latency);
await redis.ltrim(`latency:${key}:${endpoint}`, 0, 999);
},
// Get usage report
async getReport(key, period = 'daily') {
const usage = await redis.hgetall(`usage:${key}:${period}`);
const total = Object.values(usage).reduce((a, b) => a + b, 0);
return {
total,
byEndpoint: usage,
period
};
}
};
// Middleware
app.use(async (req, res, next) => {
const start = Date.now();
res.on('finish', async () => {
const key = req.headers['x-api-key'];
if (key) {
await usageAnalytics.track(
key,
req.path,
req.method,
res.statusCode,
Date.now() - start
);
}
});
next();
});
Usage Dashboard Data
{
"period": "2024-01",
"totalRequests": 1000000,
"uniqueClients": 500,
"topEndpoints": [
{ "path": "/api/users", "requests": 250000 },
{ "path": "/api/products", "requests": 180000 },
{ "path": "/api/orders", "requests": 120000 }
],
"errorRate": 0.5,
"avgLatencyMs": 45,
"p95LatencyMs": 120,
"dataTransferMb": 5000
}
Health Checks
Basic Health Endpoint
app.get('/health', (req, res) => {
res.json({
status: 'healthy',
timestamp: new Date().toISOString(),
uptime: process.uptime(),
version: process.env.npm_package_version
});
});
Detailed Health Check
app.get('/health', async (req, res) => {
const checks = {
database: await checkDatabase(),
cache: await checkCache(),
external: await checkExternalServices()
};
const allHealthy = Object.values(checks).every(c => c.healthy);
res.status(allHealthy ? 200 : 503).json({
status: allHealthy ? 'healthy' : 'unhealthy',
timestamp: new Date().toISOString(),
checks
});
});
async function checkDatabase() {
try {
await db.query('SELECT 1');
return { healthy: true, latency: 5 };
} catch (error) {
return { healthy: false, error: error.message };
}
}
Alerting
Alert Rules
# Prometheus alerting rules
groups:
- name: api_alerts
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}% over last 5 minutes"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "p95 latency is {{ $value }}s"
- alert: RateLimitNear
expr: rate_limit_remaining / rate_limit_total < 0.1
for: 1m
labels:
severity: warning
annotations:
summary: "Approaching rate limit"
Conclusion
API monitoring is not a one-time setup—it is an ongoing practice. Start with the four golden signals (latency, traffic, errors, saturation) and layer on API-specific metrics as you identify what matters for your domain. Instrument your code with structured logging and custom metrics from day one; retrofitting observability is much harder than building it in.
Aim for actionable alerts over noisy ones: every alert should require a human response or justify automation. Track your dashboard’s usefulness by how often teams refer to it during incidents.
For more on logging architecture and aggregation, see the API Error Handling Guide. For caching strategies that reduce backend load and improve latency metrics, see the API Caching Strategies Guide. For the full picture on designing maintainable APIs, see the REST API Design Best Practices Guide.
Resources
- Prometheus Documentation - Metrics collection and alerting
- Grafana Documentation - Dashboard and visualization
- OpenTelemetry - Distributed tracing and observability
- The Four Golden Signals (Google SRE) - Foundational monitoring principles
Comments