Application Monitoring and Profiling: APM, Tracing, and Node.js Profiling

Introduction

Monitoring tells you when something is wrong. Profiling tells you why. Together they give you the visibility to keep production systems healthy and fast.

Monitoring = continuous observation of system health (metrics, alerts, dashboards) Profiling = deep analysis of performance characteristics (CPU, memory, I/O)

APM: Application Performance Monitoring

APM tools automatically instrument your application to collect traces, metrics, and errors with minimal code changes.

Datadog APM

npm install dd-trace

// Must be the FIRST import in your entry file
import tracer from 'dd-trace';

tracer.init({
    service: 'api-server',
    env: process.env.NODE_ENV,
    version: process.env.APP_VERSION,
    logInjection: true,  // adds trace IDs to logs
    runtimeMetrics: true, // CPU, memory, event loop
});

// Everything after this is automatically instrumented:
// - HTTP requests (Express, Fastify, Koa)
// - Database queries (pg, mysql2, mongoose)
// - Redis operations
// - External HTTP calls (axios, node-fetch)
import express from 'express';

What Datadog captures automatically:

Every HTTP request with duration, status code, route
Database queries with query text and duration
Redis operations
External API calls
Error stack traces with context

Custom Spans

import tracer from 'dd-trace';

async function processOrder(orderId) {
    // Create a custom span for business logic
    const span = tracer.startSpan('order.process');
    span.setTag('order.id', orderId);

    try {
        const order = await db.getOrder(orderId);
        span.setTag('order.total', order.total);

        await chargePayment(order);
        await updateInventory(order);

        span.setTag('order.status', 'completed');
        return order;
    } catch (err) {
        span.setTag('error', true);
        span.setTag('error.message', err.message);
        throw err;
    } finally {
        span.finish();
    }
}

New Relic

npm install newrelic

// newrelic.js (config file)
exports.config = {
    app_name: ['My Application'],
    license_key: process.env.NEW_RELIC_LICENSE_KEY,
    logging: { level: 'info' },
    allow_all_headers: true,
    distributed_tracing: { enabled: true },
};

// Entry file — require newrelic FIRST
require('newrelic');
const express = require('express');

OpenTelemetry (Vendor-Neutral)

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
npm install @opentelemetry/exporter-otlp-http

// tracing.js — initialize before anything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
    resource: new Resource({
        [SemanticResourceAttributes.SERVICE_NAME]: 'api-server',
        [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
    }),
    traceExporter: new OTLPTraceExporter({
        url: 'http://otel-collector:4318/v1/traces',
    }),
    instrumentations: [
        getNodeAutoInstrumentations({
            '@opentelemetry/instrumentation-fs': { enabled: false },  // too noisy
        }),
    ],
});

sdk.start();

process.on('SIGTERM', () => sdk.shutdown());

Node.js Profiling with clinic.js

clinic.js is the best tool for diagnosing Node.js performance issues:

npm install -g clinic

# Doctor: overall health check — identifies the type of problem
clinic doctor -- node server.js

# Flame: CPU profiling — shows where time is spent
clinic flame -- node server.js

# Bubbleprof: async profiling — shows I/O bottlenecks
clinic bubbleprof -- node server.js

Workflow:

Run clinic doctor first — it tells you which tool to use next
Generate load while the tool is running: ab -n 1000 -c 10 http://localhost:3000/api/users
Stop the server (Ctrl+C) — clinic generates an HTML report

Reading Flame Graphs

Wide bars = more CPU time spent here
Tall stacks = deep call chains

Look for:
- Wide bars near the top = hot functions (optimize these)
- Unexpected library code taking time
- Synchronous operations that should be async

Built-in Node.js Profiler

# Generate V8 CPU profile
node --prof server.js

# Run load test while profiling
ab -n 1000 -c 10 http://localhost:3000/api/users

# Process the profile (creates readable output)
node --prof-process isolate-*.log > profile.txt

# Look for "Heavy (bottom up)" section
head -100 profile.txt

Memory Profiling

// Detect memory leaks with periodic heap snapshots
import v8 from 'v8';
import fs from 'fs';

function takeHeapSnapshot() {
    const filename = `heap-${Date.now()}.heapsnapshot`;
    const snapshot = v8.writeHeapSnapshot(filename);
    console.log(`Heap snapshot written to ${snapshot}`);
}

// Take snapshot every 5 minutes in development
if (process.env.NODE_ENV === 'development') {
    setInterval(takeHeapSnapshot, 5 * 60 * 1000);
}

# Open in Chrome DevTools:
# DevTools → Memory → Load profile → select .heapsnapshot file
# Compare two snapshots to find what's growing

Event Loop Monitoring

// Monitor event loop lag — high lag = blocked event loop
import { monitorEventLoopDelay } from 'perf_hooks';

const histogram = monitorEventLoopDelay({ resolution: 20 });
histogram.enable();

setInterval(() => {
    const lag = histogram.percentile(99) / 1e6;  // convert ns to ms
    if (lag > 100) {
        console.warn(`Event loop lag p99: ${lag.toFixed(2)}ms — possible blocking operation`);
    }
    histogram.reset();
}, 10000);

Custom Metrics with prom-client

import { Registry, Histogram, Counter, Gauge } from 'prom-client';

const registry = new Registry();

// Track external API call performance
const externalApiDuration = new Histogram({
    name: 'external_api_duration_seconds',
    help: 'Duration of external API calls',
    labelNames: ['service', 'endpoint', 'status'],
    buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
    registers: [registry],
});

// Wrap external calls with timing
async function callExternalAPI(service, endpoint, fn) {
    const end = externalApiDuration.startTimer({ service, endpoint });
    try {
        const result = await fn();
        end({ status: 'success' });
        return result;
    } catch (err) {
        end({ status: 'error' });
        throw err;
    }
}

// Usage
const user = await callExternalAPI('user-service', '/users', () =>
    fetch('http://user-service/users/42').then(r => r.json())
);

Error Tracking with Sentry

npm install @sentry/node @sentry/profiling-node

import * as Sentry from '@sentry/node';
import { nodeProfilingIntegration } from '@sentry/profiling-node';

Sentry.init({
    dsn: process.env.SENTRY_DSN,
    environment: process.env.NODE_ENV,
    release: process.env.APP_VERSION,
    integrations: [
        nodeProfilingIntegration(),
    ],
    tracesSampleRate: 0.1,    // sample 10% of transactions
    profilesSampleRate: 0.1,  // sample 10% for profiling
});

// Express error handler — must be last middleware
app.use(Sentry.Handlers.errorHandler());

// Capture errors manually
try {
    riskyOperation();
} catch (err) {
    Sentry.captureException(err, {
        extra: { userId, orderId },
        tags: { component: 'payment' },
    });
    throw err;
}

Choosing the Right Tool

Need	Tool
Full APM with traces + metrics + logs	Datadog, New Relic
Open source, self-hosted	Prometheus + Grafana + Jaeger
Vendor-neutral instrumentation	OpenTelemetry
Node.js CPU profiling	clinic flame
Node.js I/O bottlenecks	clinic bubbleprof
Memory leak detection	Chrome DevTools heap snapshots
Error tracking	Sentry
Load testing + profiling	k6 + clinic

Monitoring Checklist

APM agent installed and sending traces
Error tracking configured (Sentry or equivalent)
Custom business metrics instrumented
Event loop lag monitored
Memory usage tracked over time
Alerts set for p95 latency, error rate, memory
Dashboards for key services
Profiling run on production load patterns