Skip to main content
โšก Calmops

Application Monitoring and Profiling: APM, Tracing, and Node.js Profiling

Introduction

Monitoring tells you when something is wrong. Profiling tells you why. Together they give you the visibility to keep production systems healthy and fast.

Monitoring = continuous observation of system health (metrics, alerts, dashboards) Profiling = deep analysis of performance characteristics (CPU, memory, I/O)

APM: Application Performance Monitoring

APM tools automatically instrument your application to collect traces, metrics, and errors with minimal code changes.

Datadog APM

npm install dd-trace
// Must be the FIRST import in your entry file
import tracer from 'dd-trace';

tracer.init({
    service: 'api-server',
    env: process.env.NODE_ENV,
    version: process.env.APP_VERSION,
    logInjection: true,  // adds trace IDs to logs
    runtimeMetrics: true, // CPU, memory, event loop
});

// Everything after this is automatically instrumented:
// - HTTP requests (Express, Fastify, Koa)
// - Database queries (pg, mysql2, mongoose)
// - Redis operations
// - External HTTP calls (axios, node-fetch)
import express from 'express';

What Datadog captures automatically:

  • Every HTTP request with duration, status code, route
  • Database queries with query text and duration
  • Redis operations
  • External API calls
  • Error stack traces with context

Custom Spans

import tracer from 'dd-trace';

async function processOrder(orderId) {
    // Create a custom span for business logic
    const span = tracer.startSpan('order.process');
    span.setTag('order.id', orderId);

    try {
        const order = await db.getOrder(orderId);
        span.setTag('order.total', order.total);

        await chargePayment(order);
        await updateInventory(order);

        span.setTag('order.status', 'completed');
        return order;
    } catch (err) {
        span.setTag('error', true);
        span.setTag('error.message', err.message);
        throw err;
    } finally {
        span.finish();
    }
}

New Relic

npm install newrelic
// newrelic.js (config file)
exports.config = {
    app_name: ['My Application'],
    license_key: process.env.NEW_RELIC_LICENSE_KEY,
    logging: { level: 'info' },
    allow_all_headers: true,
    distributed_tracing: { enabled: true },
};
// Entry file โ€” require newrelic FIRST
require('newrelic');
const express = require('express');

OpenTelemetry (Vendor-Neutral)

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
npm install @opentelemetry/exporter-otlp-http
// tracing.js โ€” initialize before anything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
    resource: new Resource({
        [SemanticResourceAttributes.SERVICE_NAME]: 'api-server',
        [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
    }),
    traceExporter: new OTLPTraceExporter({
        url: 'http://otel-collector:4318/v1/traces',
    }),
    instrumentations: [
        getNodeAutoInstrumentations({
            '@opentelemetry/instrumentation-fs': { enabled: false },  // too noisy
        }),
    ],
});

sdk.start();

process.on('SIGTERM', () => sdk.shutdown());

Node.js Profiling with clinic.js

clinic.js is the best tool for diagnosing Node.js performance issues:

npm install -g clinic

# Doctor: overall health check โ€” identifies the type of problem
clinic doctor -- node server.js

# Flame: CPU profiling โ€” shows where time is spent
clinic flame -- node server.js

# Bubbleprof: async profiling โ€” shows I/O bottlenecks
clinic bubbleprof -- node server.js

Workflow:

  1. Run clinic doctor first โ€” it tells you which tool to use next
  2. Generate load while the tool is running: ab -n 1000 -c 10 http://localhost:3000/api/users
  3. Stop the server (Ctrl+C) โ€” clinic generates an HTML report

Reading Flame Graphs

Wide bars = more CPU time spent here
Tall stacks = deep call chains

Look for:
- Wide bars near the top = hot functions (optimize these)
- Unexpected library code taking time
- Synchronous operations that should be async

Built-in Node.js Profiler

# Generate V8 CPU profile
node --prof server.js

# Run load test while profiling
ab -n 1000 -c 10 http://localhost:3000/api/users

# Process the profile (creates readable output)
node --prof-process isolate-*.log > profile.txt

# Look for "Heavy (bottom up)" section
head -100 profile.txt

Memory Profiling

// Detect memory leaks with periodic heap snapshots
import v8 from 'v8';
import fs from 'fs';

function takeHeapSnapshot() {
    const filename = `heap-${Date.now()}.heapsnapshot`;
    const snapshot = v8.writeHeapSnapshot(filename);
    console.log(`Heap snapshot written to ${snapshot}`);
}

// Take snapshot every 5 minutes in development
if (process.env.NODE_ENV === 'development') {
    setInterval(takeHeapSnapshot, 5 * 60 * 1000);
}
# Open in Chrome DevTools:
# DevTools โ†’ Memory โ†’ Load profile โ†’ select .heapsnapshot file
# Compare two snapshots to find what's growing

Event Loop Monitoring

// Monitor event loop lag โ€” high lag = blocked event loop
import { monitorEventLoopDelay } from 'perf_hooks';

const histogram = monitorEventLoopDelay({ resolution: 20 });
histogram.enable();

setInterval(() => {
    const lag = histogram.percentile(99) / 1e6;  // convert ns to ms
    if (lag > 100) {
        console.warn(`Event loop lag p99: ${lag.toFixed(2)}ms โ€” possible blocking operation`);
    }
    histogram.reset();
}, 10000);

Custom Metrics with prom-client

import { Registry, Histogram, Counter, Gauge } from 'prom-client';

const registry = new Registry();

// Track external API call performance
const externalApiDuration = new Histogram({
    name: 'external_api_duration_seconds',
    help: 'Duration of external API calls',
    labelNames: ['service', 'endpoint', 'status'],
    buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
    registers: [registry],
});

// Wrap external calls with timing
async function callExternalAPI(service, endpoint, fn) {
    const end = externalApiDuration.startTimer({ service, endpoint });
    try {
        const result = await fn();
        end({ status: 'success' });
        return result;
    } catch (err) {
        end({ status: 'error' });
        throw err;
    }
}

// Usage
const user = await callExternalAPI('user-service', '/users', () =>
    fetch('http://user-service/users/42').then(r => r.json())
);

Error Tracking with Sentry

npm install @sentry/node @sentry/profiling-node
import * as Sentry from '@sentry/node';
import { nodeProfilingIntegration } from '@sentry/profiling-node';

Sentry.init({
    dsn: process.env.SENTRY_DSN,
    environment: process.env.NODE_ENV,
    release: process.env.APP_VERSION,
    integrations: [
        nodeProfilingIntegration(),
    ],
    tracesSampleRate: 0.1,    // sample 10% of transactions
    profilesSampleRate: 0.1,  // sample 10% for profiling
});

// Express error handler โ€” must be last middleware
app.use(Sentry.Handlers.errorHandler());

// Capture errors manually
try {
    riskyOperation();
} catch (err) {
    Sentry.captureException(err, {
        extra: { userId, orderId },
        tags: { component: 'payment' },
    });
    throw err;
}

Choosing the Right Tool

Need Tool
Full APM with traces + metrics + logs Datadog, New Relic
Open source, self-hosted Prometheus + Grafana + Jaeger
Vendor-neutral instrumentation OpenTelemetry
Node.js CPU profiling clinic flame
Node.js I/O bottlenecks clinic bubbleprof
Memory leak detection Chrome DevTools heap snapshots
Error tracking Sentry
Load testing + profiling k6 + clinic

Monitoring Checklist

  • APM agent installed and sending traces
  • Error tracking configured (Sentry or equivalent)
  • Custom business metrics instrumented
  • Event loop lag monitored
  • Memory usage tracked over time
  • Alerts set for p95 latency, error rate, memory
  • Dashboards for key services
  • Profiling run on production load patterns

Resources

Comments