Skip to main content

Application Monitoring and Profiling: APM, Tracing, and Node.js Profiling

Created: March 7, 2026 Larry Qu 5 min read

Introduction

Monitoring tells you when something is wrong. Profiling tells you why. Together they give you the visibility to keep production systems healthy and fast. See Javascript Guide for more context. See Javascript Guide for more context.

Monitoring = continuous observation of system health (metrics, alerts, dashboards) Profiling = deep analysis of performance characteristics (CPU, memory, I/O)

APM: Application Performance Monitoring

APM tools automatically instrument your application to collect traces, metrics, and errors with minimal code changes.

Datadog APM

npm install dd-trace
// Must be the FIRST import in your entry file
import tracer from 'dd-trace';

tracer.init({
    service: 'api-server',
    env: process.env.NODE_ENV,
    version: process.env.APP_VERSION,
    logInjection: true,  // adds trace IDs to logs
    runtimeMetrics: true, // CPU, memory, event loop
});

// Everything after this is automatically instrumented:
// - HTTP requests (Express, Fastify, Koa)
// - Database queries (pg, mysql2, mongoose)
// - Redis operations
// - External HTTP calls (axios, node-fetch)
import express from 'express';

What Datadog captures automatically:

  • Every HTTP request with duration, status code, route
  • Database queries with query text and duration
  • Redis operations
  • External API calls
  • Error stack traces with context

Custom Spans

import tracer from 'dd-trace';

async function processOrder(orderId) {
    // Create a custom span for business logic
    const span = tracer.startSpan('order.process');
    span.setTag('order.id', orderId);

    try {
        const order = await db.getOrder(orderId);
        span.setTag('order.total', order.total);

        await chargePayment(order);
        await updateInventory(order);

        span.setTag('order.status', 'completed');
        return order;
    } catch (err) {
        span.setTag('error', true);
        span.setTag('error.message', err.message);
        throw err;
    } finally {
        span.finish();
    }
}

New Relic

npm install newrelic
// newrelic.js (config file)
exports.config = {
    app_name: ['My Application'],
    license_key: process.env.NEW_RELIC_LICENSE_KEY,
    logging: { level: 'info' },
    allow_all_headers: true,
    distributed_tracing: { enabled: true },
};
// Entry file — require newrelic FIRST
require('newrelic');
const express = require('express');

OpenTelemetry (Vendor-Neutral)

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
npm install @opentelemetry/exporter-otlp-http
// tracing.js — initialize before anything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
    resource: new Resource({
        [SemanticResourceAttributes.SERVICE_NAME]: 'api-server',
        [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
    }),
    traceExporter: new OTLPTraceExporter({
        url: 'http://otel-collector:4318/v1/traces',
    }),
    instrumentations: [
        getNodeAutoInstrumentations({
            '@opentelemetry/instrumentation-fs': { enabled: false },  // too noisy
        }),
    ],
});

sdk.start();

process.on('SIGTERM', () => sdk.shutdown());

Node.js Profiling with clinic.js

clinic.js is the best tool for diagnosing Node.js performance issues:

npm install -g clinic

# Doctor: overall health check — identifies the type of problem
clinic doctor -- node server.js

# Flame: CPU profiling — shows where time is spent
clinic flame -- node server.js

# Bubbleprof: async profiling — shows I/O bottlenecks
clinic bubbleprof -- node server.js

Workflow:

  1. Run clinic doctor first — it tells you which tool to use next
  2. Generate load while the tool is running: ab -n 1000 -c 10 http://localhost:3000/api/users
  3. Stop the server (Ctrl+C) — clinic generates an HTML report

Reading Flame Graphs

Wide bars = more CPU time spent here
Tall stacks = deep call chains

Look for:
- Wide bars near the top = hot functions (optimize these)
- Unexpected library code taking time
- Synchronous operations that should be async

Built-in Node.js Profiler

# Generate V8 CPU profile
node --prof server.js

# Run load test while profiling
ab -n 1000 -c 10 http://localhost:3000/api/users

# Process the profile (creates readable output)
node --prof-process isolate-*.log > profile.txt

# Look for "Heavy (bottom up)" section
head -100 profile.txt

Memory Profiling

// Detect memory leaks with periodic heap snapshots
import v8 from 'v8';
import fs from 'fs';

function takeHeapSnapshot() {
    const filename = `heap-${Date.now()}.heapsnapshot`;
    const snapshot = v8.writeHeapSnapshot(filename);
    console.log(`Heap snapshot written to ${snapshot}`);
}

// Take snapshot every 5 minutes in development
if (process.env.NODE_ENV === 'development') {
    setInterval(takeHeapSnapshot, 5 * 60 * 1000);
}
# Open in Chrome DevTools:
# DevTools → Memory → Load profile → select .heapsnapshot file
# Compare two snapshots to find what's growing

Event Loop Monitoring

// Monitor event loop lag — high lag = blocked event loop
import { monitorEventLoopDelay } from 'perf_hooks';

const histogram = monitorEventLoopDelay({ resolution: 20 });
histogram.enable();

setInterval(() => {
    const lag = histogram.percentile(99) / 1e6;  // convert ns to ms
    if (lag > 100) {
        console.warn(`Event loop lag p99: ${lag.toFixed(2)}ms — possible blocking operation`);
    }
    histogram.reset();
}, 10000);

Custom Metrics with prom-client

import { Registry, Histogram, Counter, Gauge } from 'prom-client';

const registry = new Registry();

// Track external API call performance
const externalApiDuration = new Histogram({
    name: 'external_api_duration_seconds',
    help: 'Duration of external API calls',
    labelNames: ['service', 'endpoint', 'status'],
    buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
    registers: [registry],
});

// Wrap external calls with timing
async function callExternalAPI(service, endpoint, fn) {
    const end = externalApiDuration.startTimer({ service, endpoint });
    try {
        const result = await fn();
        end({ status: 'success' });
        return result;
    } catch (err) {
        end({ status: 'error' });
        throw err;
    }
}

// Usage
const user = await callExternalAPI('user-service', '/users', () =>
    fetch('http://user-service/users/42').then(r => r.json())
);

Error Tracking with Sentry

npm install @sentry/node @sentry/profiling-node
import * as Sentry from '@sentry/node';
import { nodeProfilingIntegration } from '@sentry/profiling-node';

Sentry.init({
    dsn: process.env.SENTRY_DSN,
    environment: process.env.NODE_ENV,
    release: process.env.APP_VERSION,
    integrations: [
        nodeProfilingIntegration(),
    ],
    tracesSampleRate: 0.1,    // sample 10% of transactions
    profilesSampleRate: 0.1,  // sample 10% for profiling
});

// Express error handler — must be last middleware
app.use(Sentry.Handlers.errorHandler());

// Capture errors manually
try {
    riskyOperation();
} catch (err) {
    Sentry.captureException(err, {
        extra: { userId, orderId },
        tags: { component: 'payment' },
    });
    throw err;
}

Choosing the Right Tool

Need Tool
Full APM with traces + metrics + logs Datadog, New Relic
Open source, self-hosted Prometheus + Grafana + Jaeger
Vendor-neutral instrumentation OpenTelemetry
Node.js CPU profiling clinic flame
Node.js I/O bottlenecks clinic bubbleprof
Memory leak detection Chrome DevTools heap snapshots
Error tracking Sentry
Load testing + profiling k6 + clinic

Monitoring Checklist

  • APM agent installed and sending traces
  • Error tracking configured (Sentry or equivalent)
  • Custom business metrics instrumented
  • Event loop lag monitored
  • Memory usage tracked over time
  • Alerts set for p95 latency, error rate, memory
  • Dashboards for key services
  • Profiling run on production load patterns

Resources

Comments

Share this article

Scan to read on mobile