Introduction
Monitoring tells you when something is wrong. Profiling tells you why. Together they give you the visibility to keep production systems healthy and fast.
Monitoring = continuous observation of system health (metrics, alerts, dashboards) Profiling = deep analysis of performance characteristics (CPU, memory, I/O)
APM: Application Performance Monitoring
APM tools automatically instrument your application to collect traces, metrics, and errors with minimal code changes.
Datadog APM
npm install dd-trace
// Must be the FIRST import in your entry file
import tracer from 'dd-trace';
tracer.init({
service: 'api-server',
env: process.env.NODE_ENV,
version: process.env.APP_VERSION,
logInjection: true, // adds trace IDs to logs
runtimeMetrics: true, // CPU, memory, event loop
});
// Everything after this is automatically instrumented:
// - HTTP requests (Express, Fastify, Koa)
// - Database queries (pg, mysql2, mongoose)
// - Redis operations
// - External HTTP calls (axios, node-fetch)
import express from 'express';
What Datadog captures automatically:
- Every HTTP request with duration, status code, route
- Database queries with query text and duration
- Redis operations
- External API calls
- Error stack traces with context
Custom Spans
import tracer from 'dd-trace';
async function processOrder(orderId) {
// Create a custom span for business logic
const span = tracer.startSpan('order.process');
span.setTag('order.id', orderId);
try {
const order = await db.getOrder(orderId);
span.setTag('order.total', order.total);
await chargePayment(order);
await updateInventory(order);
span.setTag('order.status', 'completed');
return order;
} catch (err) {
span.setTag('error', true);
span.setTag('error.message', err.message);
throw err;
} finally {
span.finish();
}
}
New Relic
npm install newrelic
// newrelic.js (config file)
exports.config = {
app_name: ['My Application'],
license_key: process.env.NEW_RELIC_LICENSE_KEY,
logging: { level: 'info' },
allow_all_headers: true,
distributed_tracing: { enabled: true },
};
// Entry file โ require newrelic FIRST
require('newrelic');
const express = require('express');
OpenTelemetry (Vendor-Neutral)
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
npm install @opentelemetry/exporter-otlp-http
// tracing.js โ initialize before anything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'api-server',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false }, // too noisy
}),
],
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown());
Node.js Profiling with clinic.js
clinic.js is the best tool for diagnosing Node.js performance issues:
npm install -g clinic
# Doctor: overall health check โ identifies the type of problem
clinic doctor -- node server.js
# Flame: CPU profiling โ shows where time is spent
clinic flame -- node server.js
# Bubbleprof: async profiling โ shows I/O bottlenecks
clinic bubbleprof -- node server.js
Workflow:
- Run
clinic doctorfirst โ it tells you which tool to use next - Generate load while the tool is running:
ab -n 1000 -c 10 http://localhost:3000/api/users - Stop the server (Ctrl+C) โ clinic generates an HTML report
Reading Flame Graphs
Wide bars = more CPU time spent here
Tall stacks = deep call chains
Look for:
- Wide bars near the top = hot functions (optimize these)
- Unexpected library code taking time
- Synchronous operations that should be async
Built-in Node.js Profiler
# Generate V8 CPU profile
node --prof server.js
# Run load test while profiling
ab -n 1000 -c 10 http://localhost:3000/api/users
# Process the profile (creates readable output)
node --prof-process isolate-*.log > profile.txt
# Look for "Heavy (bottom up)" section
head -100 profile.txt
Memory Profiling
// Detect memory leaks with periodic heap snapshots
import v8 from 'v8';
import fs from 'fs';
function takeHeapSnapshot() {
const filename = `heap-${Date.now()}.heapsnapshot`;
const snapshot = v8.writeHeapSnapshot(filename);
console.log(`Heap snapshot written to ${snapshot}`);
}
// Take snapshot every 5 minutes in development
if (process.env.NODE_ENV === 'development') {
setInterval(takeHeapSnapshot, 5 * 60 * 1000);
}
# Open in Chrome DevTools:
# DevTools โ Memory โ Load profile โ select .heapsnapshot file
# Compare two snapshots to find what's growing
Event Loop Monitoring
// Monitor event loop lag โ high lag = blocked event loop
import { monitorEventLoopDelay } from 'perf_hooks';
const histogram = monitorEventLoopDelay({ resolution: 20 });
histogram.enable();
setInterval(() => {
const lag = histogram.percentile(99) / 1e6; // convert ns to ms
if (lag > 100) {
console.warn(`Event loop lag p99: ${lag.toFixed(2)}ms โ possible blocking operation`);
}
histogram.reset();
}, 10000);
Custom Metrics with prom-client
import { Registry, Histogram, Counter, Gauge } from 'prom-client';
const registry = new Registry();
// Track external API call performance
const externalApiDuration = new Histogram({
name: 'external_api_duration_seconds',
help: 'Duration of external API calls',
labelNames: ['service', 'endpoint', 'status'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
registers: [registry],
});
// Wrap external calls with timing
async function callExternalAPI(service, endpoint, fn) {
const end = externalApiDuration.startTimer({ service, endpoint });
try {
const result = await fn();
end({ status: 'success' });
return result;
} catch (err) {
end({ status: 'error' });
throw err;
}
}
// Usage
const user = await callExternalAPI('user-service', '/users', () =>
fetch('http://user-service/users/42').then(r => r.json())
);
Error Tracking with Sentry
npm install @sentry/node @sentry/profiling-node
import * as Sentry from '@sentry/node';
import { nodeProfilingIntegration } from '@sentry/profiling-node';
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
release: process.env.APP_VERSION,
integrations: [
nodeProfilingIntegration(),
],
tracesSampleRate: 0.1, // sample 10% of transactions
profilesSampleRate: 0.1, // sample 10% for profiling
});
// Express error handler โ must be last middleware
app.use(Sentry.Handlers.errorHandler());
// Capture errors manually
try {
riskyOperation();
} catch (err) {
Sentry.captureException(err, {
extra: { userId, orderId },
tags: { component: 'payment' },
});
throw err;
}
Choosing the Right Tool
| Need | Tool |
|---|---|
| Full APM with traces + metrics + logs | Datadog, New Relic |
| Open source, self-hosted | Prometheus + Grafana + Jaeger |
| Vendor-neutral instrumentation | OpenTelemetry |
| Node.js CPU profiling | clinic flame |
| Node.js I/O bottlenecks | clinic bubbleprof |
| Memory leak detection | Chrome DevTools heap snapshots |
| Error tracking | Sentry |
| Load testing + profiling | k6 + clinic |
Monitoring Checklist
- APM agent installed and sending traces
- Error tracking configured (Sentry or equivalent)
- Custom business metrics instrumented
- Event loop lag monitored
- Memory usage tracked over time
- Alerts set for p95 latency, error rate, memory
- Dashboards for key services
- Profiling run on production load patterns
Comments