Introduction
Monitoring tells you when something is wrong. Profiling tells you why. Together they give you the visibility to keep production systems healthy and fast. See Javascript Guide for more context. See Javascript Guide for more context.
Monitoring = continuous observation of system health (metrics, alerts, dashboards) Profiling = deep analysis of performance characteristics (CPU, memory, I/O)
APM: Application Performance Monitoring
APM tools automatically instrument your application to collect traces, metrics, and errors with minimal code changes.
Datadog APM
npm install dd-trace
// Must be the FIRST import in your entry file
import tracer from 'dd-trace';
tracer.init({
service: 'api-server',
env: process.env.NODE_ENV,
version: process.env.APP_VERSION,
logInjection: true, // adds trace IDs to logs
runtimeMetrics: true, // CPU, memory, event loop
});
// Everything after this is automatically instrumented:
// - HTTP requests (Express, Fastify, Koa)
// - Database queries (pg, mysql2, mongoose)
// - Redis operations
// - External HTTP calls (axios, node-fetch)
import express from 'express';
What Datadog captures automatically:
- Every HTTP request with duration, status code, route
- Database queries with query text and duration
- Redis operations
- External API calls
- Error stack traces with context
Custom Spans
import tracer from 'dd-trace';
async function processOrder(orderId) {
// Create a custom span for business logic
const span = tracer.startSpan('order.process');
span.setTag('order.id', orderId);
try {
const order = await db.getOrder(orderId);
span.setTag('order.total', order.total);
await chargePayment(order);
await updateInventory(order);
span.setTag('order.status', 'completed');
return order;
} catch (err) {
span.setTag('error', true);
span.setTag('error.message', err.message);
throw err;
} finally {
span.finish();
}
}
New Relic
npm install newrelic
// newrelic.js (config file)
exports.config = {
app_name: ['My Application'],
license_key: process.env.NEW_RELIC_LICENSE_KEY,
logging: { level: 'info' },
allow_all_headers: true,
distributed_tracing: { enabled: true },
};
// Entry file — require newrelic FIRST
require('newrelic');
const express = require('express');
OpenTelemetry (Vendor-Neutral)
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
npm install @opentelemetry/exporter-otlp-http
// tracing.js — initialize before anything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'api-server',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false }, // too noisy
}),
],
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown());
Node.js Profiling with clinic.js
clinic.js is the best tool for diagnosing Node.js performance issues:
npm install -g clinic
# Doctor: overall health check — identifies the type of problem
clinic doctor -- node server.js
# Flame: CPU profiling — shows where time is spent
clinic flame -- node server.js
# Bubbleprof: async profiling — shows I/O bottlenecks
clinic bubbleprof -- node server.js
Workflow:
- Run
clinic doctorfirst — it tells you which tool to use next - Generate load while the tool is running:
ab -n 1000 -c 10 http://localhost:3000/api/users - Stop the server (Ctrl+C) — clinic generates an HTML report
Reading Flame Graphs
Wide bars = more CPU time spent here
Tall stacks = deep call chains
Look for:
- Wide bars near the top = hot functions (optimize these)
- Unexpected library code taking time
- Synchronous operations that should be async
Built-in Node.js Profiler
# Generate V8 CPU profile
node --prof server.js
# Run load test while profiling
ab -n 1000 -c 10 http://localhost:3000/api/users
# Process the profile (creates readable output)
node --prof-process isolate-*.log > profile.txt
# Look for "Heavy (bottom up)" section
head -100 profile.txt
Memory Profiling
// Detect memory leaks with periodic heap snapshots
import v8 from 'v8';
import fs from 'fs';
function takeHeapSnapshot() {
const filename = `heap-${Date.now()}.heapsnapshot`;
const snapshot = v8.writeHeapSnapshot(filename);
console.log(`Heap snapshot written to ${snapshot}`);
}
// Take snapshot every 5 minutes in development
if (process.env.NODE_ENV === 'development') {
setInterval(takeHeapSnapshot, 5 * 60 * 1000);
}
# Open in Chrome DevTools:
# DevTools → Memory → Load profile → select .heapsnapshot file
# Compare two snapshots to find what's growing
Event Loop Monitoring
// Monitor event loop lag — high lag = blocked event loop
import { monitorEventLoopDelay } from 'perf_hooks';
const histogram = monitorEventLoopDelay({ resolution: 20 });
histogram.enable();
setInterval(() => {
const lag = histogram.percentile(99) / 1e6; // convert ns to ms
if (lag > 100) {
console.warn(`Event loop lag p99: ${lag.toFixed(2)}ms — possible blocking operation`);
}
histogram.reset();
}, 10000);
Custom Metrics with prom-client
import { Registry, Histogram, Counter, Gauge } from 'prom-client';
const registry = new Registry();
// Track external API call performance
const externalApiDuration = new Histogram({
name: 'external_api_duration_seconds',
help: 'Duration of external API calls',
labelNames: ['service', 'endpoint', 'status'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
registers: [registry],
});
// Wrap external calls with timing
async function callExternalAPI(service, endpoint, fn) {
const end = externalApiDuration.startTimer({ service, endpoint });
try {
const result = await fn();
end({ status: 'success' });
return result;
} catch (err) {
end({ status: 'error' });
throw err;
}
}
// Usage
const user = await callExternalAPI('user-service', '/users', () =>
fetch('http://user-service/users/42').then(r => r.json())
);
Error Tracking with Sentry
npm install @sentry/node @sentry/profiling-node
import * as Sentry from '@sentry/node';
import { nodeProfilingIntegration } from '@sentry/profiling-node';
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
release: process.env.APP_VERSION,
integrations: [
nodeProfilingIntegration(),
],
tracesSampleRate: 0.1, // sample 10% of transactions
profilesSampleRate: 0.1, // sample 10% for profiling
});
// Express error handler — must be last middleware
app.use(Sentry.Handlers.errorHandler());
// Capture errors manually
try {
riskyOperation();
} catch (err) {
Sentry.captureException(err, {
extra: { userId, orderId },
tags: { component: 'payment' },
});
throw err;
}
Choosing the Right Tool
| Need | Tool |
|---|---|
| Full APM with traces + metrics + logs | Datadog, New Relic |
| Open source, self-hosted | Prometheus + Grafana + Jaeger |
| Vendor-neutral instrumentation | OpenTelemetry |
| Node.js CPU profiling | clinic flame |
| Node.js I/O bottlenecks | clinic bubbleprof |
| Memory leak detection | Chrome DevTools heap snapshots |
| Error tracking | Sentry |
| Load testing + profiling | k6 + clinic |
Monitoring Checklist
- APM agent installed and sending traces
- Error tracking configured (Sentry or equivalent)
- Custom business metrics instrumented
- Event loop lag monitored
- Memory usage tracked over time
- Alerts set for p95 latency, error rate, memory
- Dashboards for key services
- Profiling run on production load patterns
Comments