Skip to main content
โšก Calmops

Centralized Logging: ELK Stack, Loki, and Structured Logging

Introduction

When you have multiple services running across multiple servers, ssh-ing into each one to read logs doesn’t scale. Centralized logging aggregates all logs into one place where you can search, filter, and alert across your entire system.

Two main options:

  • ELK Stack (Elasticsearch + Logstash + Kibana) โ€” powerful, feature-rich, higher resource usage
  • Grafana Loki โ€” lightweight, designed for Kubernetes, integrates with Grafana

Structured Logging First

Before setting up log aggregation, make your logs machine-parseable. Structured (JSON) logs are far easier to query than plain text:

// BAD: unstructured โ€” hard to query
console.log('User 42 logged in from 192.168.1.1 at 2026-03-30T10:00:00Z');

// GOOD: structured JSON โ€” easy to filter and aggregate
logger.info('User login', {
    userId: 42,
    ip: '192.168.1.1',
    timestamp: new Date().toISOString(),
    userAgent: req.headers['user-agent'],
});

Winston (Node.js)

npm install winston
// logger.js
import winston from 'winston';

const logger = winston.createLogger({
    level: process.env.LOG_LEVEL || 'info',
    format: winston.format.combine(
        winston.format.timestamp(),
        winston.format.errors({ stack: true }),
        winston.format.json()
    ),
    defaultMeta: {
        service: 'api-server',
        version: process.env.APP_VERSION,
        environment: process.env.NODE_ENV,
    },
    transports: [
        // Console output (for Docker/Kubernetes log collection)
        new winston.transports.Console(),

        // File output (optional)
        new winston.transports.File({
            filename: '/var/log/app/error.log',
            level: 'error',
        }),
    ],
});

export default logger;
// Usage
logger.info('Request received', {
    method: req.method,
    path: req.path,
    userId: req.user?.id,
    requestId: req.id,
});

logger.error('Database query failed', {
    error: err.message,
    stack: err.stack,
    query: sql,
    params,
});

logger.warn('Rate limit approaching', {
    userId: req.user.id,
    remaining: rateLimit.remaining,
    resetAt: rateLimit.resetAt,
});

Request Logging Middleware

// middleware/requestLogger.js
import { v4 as uuidv4 } from 'uuid';
import logger from '../logger.js';

export function requestLogger(req, res, next) {
    req.id = req.headers['x-request-id'] || uuidv4();
    res.setHeader('X-Request-ID', req.id);

    const start = Date.now();

    res.on('finish', () => {
        logger.info('HTTP request', {
            requestId: req.id,
            method: req.method,
            path: req.path,
            statusCode: res.statusCode,
            duration: Date.now() - start,
            userId: req.user?.id,
            ip: req.ip,
            userAgent: req.headers['user-agent'],
        });
    });

    next();
}

ELK Stack Setup

Docker Compose

# docker-compose.elk.yml
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ports:
      - "9200:9200"
    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/data
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:9200/_cluster/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5

  logstash:
    image: docker.elastic.co/logstash/logstash:8.12.0
    ports:
      - "5044:5044"   # Beats input
      - "5000:5000"   # TCP input
    volumes:
      - ./logstash/pipeline:/usr/share/logstash/pipeline
    depends_on:
      elasticsearch:
        condition: service_healthy

  kibana:
    image: docker.elastic.co/kibana/kibana:8.12.0
    ports:
      - "5601:5601"
    environment:
      ELASTICSEARCH_HOSTS: http://elasticsearch:9200
    depends_on:
      elasticsearch:
        condition: service_healthy

volumes:
  elasticsearch-data:

Logstash Pipeline

# logstash/pipeline/app.conf
input {
  # Receive JSON logs over TCP
  tcp {
    port => 5000
    codec => json_lines
  }

  # Receive from Filebeat
  beats {
    port => 5044
  }
}

filter {
  # Parse timestamp
  date {
    match => ["timestamp", "ISO8601"]
    target => "@timestamp"
  }

  # Add geo info from IP
  if [ip] {
    geoip {
      source => "ip"
      target => "geoip"
    }
  }

  # Remove internal fields
  mutate {
    remove_field => ["@version", "host"]
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "app-logs-%{+YYYY.MM.dd}"
  }

  # Debug: also print to stdout
  # stdout { codec => rubydebug }
}

Send Logs from Node.js to Logstash

npm install winston-logstash-transport
import WinstonLogstash from 'winston-logstash-transport';

const logger = winston.createLogger({
    transports: [
        new winston.transports.Console(),
        new WinstonLogstash({
            host: process.env.LOGSTASH_HOST || 'logstash',
            port: 5000,
            ssl_enable: false,
        }),
    ],
});

Grafana Loki (Lightweight Alternative)

Loki is designed for Kubernetes and works natively with Grafana. It indexes only labels (not the full log content), making it much cheaper to run than Elasticsearch.

Docker Compose

# docker-compose.loki.yml
services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yml:/etc/loki/local-config.yaml
      - loki-data:/loki
    command: -config.file=/etc/loki/local-config.yaml

  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - ./promtail-config.yml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3001:3000"
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  loki-data:
  grafana-data:
# promtail-config.yml โ€” collect Docker container logs
server:
  http_listen_port: 9080

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        target_label: container
      - source_labels: ['__meta_docker_container_label_com_docker_compose_service']
        target_label: service

Querying Loki with LogQL

# All logs from the api-server service
{service="api-server"}

# Error logs only
{service="api-server"} |= "error"

# JSON parsing and filtering
{service="api-server"} | json | statusCode >= 500

# Count errors per minute
sum(rate({service="api-server"} |= "error" [1m])) by (service)

# P95 latency from structured logs
quantile_over_time(0.95,
  {service="api-server"} | json | unwrap duration [5m]
) by (path)

# Logs from multiple services
{service=~"api-server|worker"}

# Logs in last 15 minutes with specific user
{service="api-server"} | json | userId = "42"

Kibana Queries (ELK)

# KQL (Kibana Query Language)

# Find all errors
statusCode: 500 OR statusCode: 503

# Find slow requests
duration > 1000

# Find specific user's requests
userId: "42" AND statusCode: 200

# Find errors in last hour
@timestamp > now-1h AND statusCode >= 500

# Full text search
"database connection failed"

# Wildcard
path: /api/users/*

Log Levels and When to Use Them

// ERROR: something failed that needs attention
logger.error('Payment processing failed', {
    orderId,
    error: err.message,
    userId,
});

// WARN: unexpected but handled โ€” investigate later
logger.warn('Deprecated API endpoint called', {
    path: req.path,
    userId: req.user?.id,
});

// INFO: normal operations, key business events
logger.info('Order created', {
    orderId: order.id,
    userId: order.userId,
    total: order.total,
});

// DEBUG: detailed diagnostic info (disabled in production)
logger.debug('Cache miss', {
    key: cacheKey,
    ttl: 3600,
});

Production log level: Set LOG_LEVEL=info in production. debug generates too much volume.

Log Retention and Cost

Logs grow fast. Plan your retention:

# Elasticsearch ILM (Index Lifecycle Management)
# Hot: 0-7 days (fast SSD, full indexing)
# Warm: 7-30 days (slower storage, read-only)
# Cold: 30-90 days (cheapest storage)
# Delete: after 90 days

# Loki retention (loki-config.yml)
limits_config:
  retention_period: 30d  # delete logs older than 30 days

compactor:
  retention_enabled: true

Cost optimization:

  • Only log what you’ll actually query
  • Use sampling for high-volume debug logs
  • Archive to S3/GCS for compliance, query rarely

Alerting on Logs

# Grafana alert on error rate from Loki
- alert: HighErrorRate
  expr: |
    sum(rate({service="api-server"} |= "error" [5m])) > 10
  for: 5m
  annotations:
    summary: "More than 10 errors/second for 5 minutes"

Resources

Comments